Never Ending Security

It starts all here

Sniffing out probes

WiFi capability comes included on just about every device you can imagine. You can even purchase SD cards that are WiFi capable. Most people carry their phones with them wherever they go. Even if you never use the WiFi in your phone, it is probably giving up your location continously. It is also probably identifying you uniquely. In almost any computer network, computers use unique numbers to refer to one another. A WiFi capable device has a media access control address, or MAC, assigned to it long before you purchased it. This address is six bytes long, so there are exactly 281,474,976,710,656 unique addresses available. Any time your device uses a WiFi network, it must send this six byte address to uniquely identify itself.

The process of joining a WiFi network is a straightforward task. First, the device joining can listen for other devices to identify themselves. These identifiers are broadcast continously and are known as beacons. Beacons are broadcast by devices that act as an access point. Included in the beacon is a Service Set Identifier or SSID. This is the name of the access point. If you are ever in a busy area and goto your phone or laptop’s listing of nearby networks you’ve noticed it lists a large number of networks. In such an environment, your device is being constantly inundated with beacons from many networks. If your device receives a beacon from a network it wants to associate with, it can begin the process of joining the network.

The second possibility is that your phone can send out probes. In the case of probes your phone has the option of simply asking “Is anyone out there?” This is known as a broadcast probe. In that case any access point may reply. The other method is your phone has the option of asking “Is Bob there?” In this case your phone must broadcast not only its unique six byte address but the SSID of the access point it wants to connect to. Many WiFi capable devices will continually transmit such probes if outside of range of a known network.

Looking at all this together, we can see that the WiFi signal of your phone can not only unique identify you but also identify places you have been. After all, if your phone is probing for a network named “Starbucks” you were either there or free loading the WiFi from the parking lot.

Putting this knowledge to work

I live along a busy roadway, so I am in a unique position to capture WiFi traffic. There is also a decent amount of pedestrian traffic in the area.


In order to capture as many signals as possible, I set up a high gain antenna pointing at the roadway. It is important to emphasize that the antenna should point down the roadway as much as possible. It is very helpful to think of a directional antenna like a flashlight. If you are standing on the side of the roadway, you can point the flashlight directly at it illuminating a single spot. But if you are very close to the edge of the roadway, you can point almost parallel to it. This illuminates more surface area. In this way, the antenna has as many vehicles in view as possible for as long as possible.

Parabolic antenna for 2.4 ghz

This antenna cost me less than $20 shipped off eBay.

In order to capture WiFi traffic I needed a device that could be hooked to this antenna. This device also needs to support monitor mode. Monitor mode is a way of saying the device can capture all available traffic. I happen to have modified a laptop for such purposes years ago.

Modified laptop

The laptop’s screen is broken, but everything else works fine. I don’t have any pictures of how I performed this mod. It is an IBM R51 laptop. Underneath the keyboard is a micro PCI slot. After removing the original wireless card, I installed an Atheros chipset wireless card. If you intend to buy a wireless card for the purpose of monitoring, I highly reccomend Atheros. They are certainly not scientific quality measurement equipment, but most of their products are cheap and are capable of monitor mode. Instead of connecting the card to the internal antennas, I connected it to a coaxial pigtail. The connectors on the WiFi cards themselves are almost alwaysMMCX. On the other end of this coaxial pigtail is a Reverse-Polarity TNC connector. This is brought out the outside of the laptop case. From there, I can adapt to a type N connector used by the high gain antenna.


For an operating system, I have Ubuntu Server Linux installed on the laptop. You’ll need to compile the aircrack-ng suite. Theairmon-ng utility included in it is the easiest way of putting the WiFi card into monitor mode.

With the wireless card in monitor mode, you can now capture packets from it. Initially I tried doing this with Python’s socket module but I found it much easier to do using scapy. Getting scapy to grab packets for your is relatively easy.

import scapy
from scapy.all import sniff

def dummyHandler(packet):


The sniff function runs forever capturing packets from the wlan0 interface. For each packet it calls dummyHandler once with the packet as the argument. Notice the store argument is set to zero. If this is not done, scapy stores all packets in memory indefinitely. This quickly exhausts the available memory on the system.

Frame format

In order to actually make sense of the packet, it is mandatory to understand the WiFi frame format. A great quick reference to that is available here. The basic breakdown of the header is shown here.

  • 2 bytes – Frame control
  • 2 bytes – Duration
  • 6 bytes – Address 1
  • 6 bytes – Address 2
  • 6 bytes – Address 3
  • 2 bytes – Sequence

The frame control consists of a 16 bit integer with many independent bitfields. Normally, any data transmitted over a network is sent in big-endian order. That is to say, the most significant bytes come first. For whatever reason, the IEEE 802.11 standard which defines this format actually specifies that data is sent in a little endian format. The standard is not publicly available to my knowledge, but thisthis StackOverflow post does an excellent job of explaining things. The scapy module extracts the only two values from the Frame Control bitfield that we care about: packet type and subtype. The argument to the handler function has the type and subtypeattributes set on it. The only type that is of interest here is Management packets, which have a type value of zero.

The payload of the packet is available from scapy as the payload attribute of the argument. It also contains the complete header frame. To extract the additional values, the struct module is useful. In the context of the previous example

def handler(packet):
    payload = buffer(str(pkt.payload))
    HEADER_FMT = "<HH6s6s6sH"
    headerSize = struct.calcsize(HEADER_FMT)
    header = payload[:headerSize]
    frameControl,dur,addr1,addr2,addr3,seq = struct.unpack(HEADER_FMT,header)

    TO_DS_BIT = 2**9
    FROM_DS_BIT = 2**10
    fromDs = (FROM_DS_BIT & frameControl) != 0
    toDs = (TO_DS_BIT & frameControl) != 0

    if fromDs and not toDs:
        srcAddr = addr3
    elif not  fromDs and not toDs:
        srcAddr = addr2
    elif not fromDs and toDs:
        srcAddr = addr2
    elif fromDs and toDs:

The payload attribute is first converted to a string and then passed to the buffer constructor. Using a buffer allows the creation of read-only slices of the original data source without the interpreter having to do the additional work of a deep copy. The structmodule uses a format string to specify the byte structure of data. It expects the input data to have exactly the length required by the format string. So it is neccessary to create a slice of payload before passing it to struct.unpack. For more information on thestruct module format string consult do help(struct) in the interactive Python interpreter.

The addresses are assigned to addr1,addr2,addr3 because the position of the source address changes based on the value of two bits in the Frame Control bitfield. For the specification of this check the quick reference card.


Probes are management packets with a subtype of four. In the payload of the packet are tagged parameters. The format of the tags is very simple

  • 1 byte – Tag ID
  • 1 byte – Tag Length N
  • N bytes – Content of tag

The only tag that that I am extracting is the SSID tag. It has an ID of zero and a length of 0 to 32. If the length is zero, the probe is a broadcast probe. If the length is non-zero, it is an ASCII string specifying the SSID of the network being probed for.

In order to find the SSID tag, it is required to parse and discard any tag which may precede it. Since the ID and length are just a single byte, concerns about endianness do not apply. It is sufficient to extract each tag, check if the ID is zero, and if not just advance the reference into the payload by the length of the tagged parameter.

Storing gathered data

I ended up creating a simple schema for PostgreSQL to store the observed data. I also added the restriction that if a probe is received from a device in the past five minutes for the same SSID, then it is not added to the database. This prevents devices that are persisently in the area from simply filling the database.

To insert the observations into the database, I used the psycopg2 module. Nothing exciting there.


At this point I’m going to dispense with examples and link the current project on GitHub.

Running it

To run the script you’ll need to do some preparation. Start up a PostgreSQL database if you don’t already have one. Create a database for this and create all the needed tables using the probecap.sql file. In my case I am running the database on a seperate machine. As a result, it is very important to have both machines using NTP so the clocks are synchronized.

Next, get your wireless device into monitor mode by using airmon-ng. It can vary from one piece of hardware to the next, but typically all you have to do is a airmon-ng wlan0 stop then airmon-ng wlan0 start. This has to be run as root. Pay very close the output of the second command, as it tells you the name of the interface the device is listening on in monitor mode. In my case it ismon0.

You also must be root to run the Python script.

python mon0 conf.json

The first argument is the name of the interface, the second is a JSON file containing a single dictionary. This dictionary is the arguments passed to psycopg2.connect. Update the provided conf.json.example to have the details of your PostgreSQL database.

What’s next?

Now that I’m gathering data all the time, I’ve got some ideas. First off, I’d expect the number of probes to increase and decrease directly with traffic patterns. Additionally, I should be able observe the same device in regular daily patterns as people commute to and from work.

Flawed data capture

Whenever I wrote the code to gather probes, I made the assumption that most stations would be sending out directed probes (for a specific SSID) rather than undirected probes (for any SSID). I did not originally record undirected probes. When I started looking at the data, it was obvious that I was discarding a large amount of potentially interesting data. I had observed 9450 unique stations, but only 1947 of those sent directed probes. By discarding the undirected probes, I was only recording probes from 20% of the stations that I otherwise could be.

As a result, I’ve modified the capture script to record undirected probes from stations. I’m now recording that into a separate database from the original database.

I opted to go ahead and do some analysis now on the existing data set. All of the graphs and data presented here are gathered from the original data set.

The original idea

When I started this I thought that I would be able to analyze the data and notice patterns in the observations of certain stations. So far I have not been able to do this. The reason is only a percentage of stations are observed sending probes more than once.

Fraction of stations sighted more than once

As this chart shows, less than half of stations are observed more than once. If you are looking for patterns in a specific stations activity, this means that less than half of the dataset is of interest.


Overall I observed 32022 probes from 1947 unique stations.

Background subtraction

The same probe is recorded only once in a five-minute period. As a result the recored number of probes for some stations is much lower than the real world number of probes. This means that a station probing for some SSID can be recorded no more than 288 times in a day. This is necessary because in any area there are some number of stations that are always there. Most of those stations are associated with an access point and are not actively sending probes, so the capture script does not record them. However, some fraction of them are not associated and may be constantly probing. Common sources of such probes are things like WiFi capable printers which have never been set up. The five-minute period limit stops those stations from simply flooding the database with records and quickly filling it up. Since I am interested in observing the probes only from the traffic in front of my house, these stations are deemed background noise. I don’t intend to count them in the analyzed data.

The upside to using this five minute period is that any station persisently observed should have an inter-observation period average of five minutes. The inter-observation period is the time between which a station is observed probing for the same SSID. This period should be a normal distribution centered around an average value of five minutes.

The distribution of the inter-observation period for a non-background station is unknown at this point in time. However, even if it is also a normal distribution it should not be centered around five minutes.

For each combination of station and SSID I calculated the inter-observation periods. Furthermore, any station that did not have at least 144 independent observation events (one hours worth) I decided it could not be a background station. Then, I calculated the 95% confidence interval for the inter-observation period. If the value five minutes lies within this interval, I conclude that the station is a background station. That station is excluded from the final results.

There is a good chance that this analysis make somes statistical assumption that is untrue. However, starting from the idea that a background station should always be observed and that a station in a vehicle is observed infrequently then it can be concluded that background stations should make up a disproportionate amount of the observed probes. The statistical method I presented identifies stations in agreement with this idea. I identified 12 stations which were reponsible for 50.95% of the observed probes.

Probes by day of the week

The first thing I decided to look at was the number of probes per day of the week.

Probes per day of the week

This graph is not particularly interesting. There are more probes recorded on Friday than any other day of the week. This lines up with the idea that more people are out doing things on a Friday than any other day of the week.

Probes by hour of day

Looking at the number of probes per hour of the day is much more interesting. This is a histogram where the bin width is an hour. I chose to normalize the height of each tally by the number of days included in it. This enables the absolute height of each tally to be compared across all three graphs, even though each one does not include the same number of days.

Probes per hour of day

Probes per hour of weekday

Probes per hour of weekend

The second graph showing the probes per hour of the weekday is the most striking. It lines up with the idea that traffic peaks in the morning when people are travelling to work and in the afternoon when they are coming home. There is a school bus stop near my house. I’m guessing most school students also carry a cell phone, which might explain the afternoon values beginning to rise earlier than expected.

The graph for the weekend shows a strong difference between the weekday graph with activity peaking in the middle of the day.

Stations Per SSID

The next thing I thought would be interesting would be to see how many SSIDs are shared in common by the stations observed. The first question is how many SSIDs have more than one station probing for them.

SSIDs Probed By One Station

Only 12% of the SSIDs observed had more than one station probing for them. I needed to cut down that part of the dataset even more, so I graphed just the upper quartile.

upper Quartile of SSIDs By Station Count

Each SSID is shown along with the number of stations probing for it. Most of these do not stand out very much. “Bright House Networks” is the name of a regional cable provider. “Wayport_Access” is a provider of internet access at McDonald’s.

But what exactly is “Nintendo_3DS_continous_scan_000”? It turns out to be something called StreetPass for the Nintendo 3DS. It is used by the handheld to connect to other handhelds. A good explanation of the WiFi component is found here. The handheld uses these probes as a way to announce its presence and set up what amounts to an ad hoc network. The fact that I saw 353 Nintendo 3DS handhelds is surprising to me.

I have not yet figured out what an SSID of “DIRECT-” corresponds to.

The SSID “attwifi” apparently is used by AT&T to offer wireless service to its customers in public places. Interestingly, it seems that some iPhones attempt to connect to this network even if the user does not instruct them to. This phenomenon is detailed here andhere. This makes a great target for a man in the middle attack on iOS devices.

The SSIDs “linksys” and “NETGEAR” are the defaults on many home access points.

Where to go from here

At the moment, I’m sitting on this project. I believe that capturing the undirected probes will give me a much more interesting dataset.

While I was working on ideas for the background subtraction, it dawned on me that you could use the observations to measure wait time in a queue. If I had clear view of a traffic light, this would be a really neat application. Unfortunately I do not at this time.

The first 3 digits of each stations MAC address are assigned by the IEEE to specific manufacturers. Due to this, I essentially have access to a popularity map of device manufacturers. I am not sure what I will do with this at this time, but I think there are some interesting possibilities.

Source code

I have updated the project on GitHub with the latest source code. I used the matplotlib python module to generate the graphs.

I show my first analysis of captured WiFi probes. Since then I have collected more data. Most importantly, the capture script now collects all probes instead of just directed probes. The data analysis is mostly unchanged. The background subtraction has been further tuned. The stations per SSID chart now simply shows the top 10 most popular SSIDs rather than the upper quartile.

The biggest change this dataset shows from the last is the probes per hour of weekday. There are strong peaks correlating with 8 AM and 5 PM. This agrees with my hypothesis that WiFi activity correlates with traffic patterns.


Fraction of stations sighted more than once

Probes per day of the week

Probes per hour of day

Probes per hour of weekday

Probes per hour of weekend

SSIDs Probed By One Station

Top 10 SSIDs By Station Count

Each SSID is shown along with the number of stations probing for it.

Source code

I have updated the project on GitHub with the latest source code. I used the matplotlib python module to generate the graphs.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s