Introduction to tracking the behavior of people online

From Knowledge Kitchen
Jump to navigation Jump to search


This document broadly outlines some strategies that are in current use to track people's online behavior.

IP Address

Every device connected to the Internet is assigned a unique number known as the Internet Protocol (IP) address. Similar to a mailing address, any data sent to this address will arrive at this device. Without an IP address, a device cannot receive any data from the Internet - there would be no receiving address. The particular IP number assigned to any given device is either manually entered into the device's network settings or (more commonly) automatically assigned to the device by the DHCP server when the device goes 'online'.

DHCP servers are operated by the four companies that own all US-based Internet Service Providers. Organizations with large computer networks of their own will sometimes run their own DHCP servers to dish out the IP addresses within their own networks. The DHCP server is designed specifically to dole out these IP addresses to devices on their networks. Since ranges of IP address numbers are leased to organizations that run local networks by Internet Service Providers, the IP address given to a particular device by the organization's local DHCP router is an indicator of at what organization the device is currently sitting, and the general geographic area in which that organization resides.

Although a typical person's IP address may change when they switch networks on their laptops or mobile devices, it is possible tracking organizations to log the general IP ranges that a given person tends to use. Banks are well known for requiring additional authentication when they detect a log-in attempt from an IP addresses not already associated with a given account.

Internet Service Providers

All the data you send to and fro on the Internet passes through the cables and routers provided by your Internet Service Provider - they provide you the Internet as a service, after all. They know who you are (they have to know who to bill, after all), can see everything you see on the Internet (all traffic to-and-from your devices go through their Internet hubs), and they have the right to sell that data to whomever they want, which they are most likely exercising as we speak.

In addition to viewing the content passing through their cables to and from your devices, mobile broadband providers can use cell tower triangulation (or your mobile device's GPS data if you're making it even easier for them by sending that across the network) to determine your location for the entire duration that the device is on, even when you're not using it. These movement paths can then be mapped and used to develop a profile of places you go, in addition to the other data they have collected by other means.


Email

When a user registers for an email account with an email service, like Google's Gmail, the terms of service usually require the user to agree to allow the service provider to read and analyze the contents of their emails forever. Besides the contents of the email, the metadata sent with an email can also contain useful data for building profiles and tracking, such as who a user communicates with, when and where they typically send emails, etc.


Search

The vast majority of people on the web discover content on the web by searching using a search engine. The keywords searched can provide useful mineable information on a person's activities, interests, and private information like health conditions, family problems, etc. This is especially useful if it can be combined with the contents of emails, the web sites a user visits, and their physical movements geographically.


Voice

Many devices now offer voice recognition. The words spoken are invariably sent to a server which performs text-to-speech conversion. This text (and the original audio, if storage space permits) can be stored by the server and added to the known information about a user.


MAC addresses

Every Bluetooth, WiFi, or other networking interface in a computer device has a unique Media Access Control (MAC) address that never changes. When a device connects to a WiFi network, Ethernet network, or another Bluetooth device, this MAC address is broadcasted to the router or other devices on the network. Since the MAC address on a device never changes, companies that provide WiFi networks to places like airports, train stations, coffee shops, retail shops and public parks, can track the usage and behavior of a particular device across many different pseudo-public WiFi networks, and develop profiles of the people who own these devices that they are then free to sell or use in any way they wish.

Some more sophisticated mobile devices may broadcast pseudo-random MAC addresses, rather than a device's real MAC address in order to try to thwart this tracking. However, when MAC addresses are used in tandem with other tracking mechanisms, a WiFi or Bluetooth tracking system may be able to correctly identify the MAC address of the device based on other aspects of their profiling data.

Cookies and sessions

In the HTTP protocol, a web server can send a header to a web client that instructs the client to store a bit of data. Web browsers are designed to oblige these requests. This bit of data is called a 'cookie', or sometimes a 'session', and it usually takes the form of a key/value pair. Web servers can store many of these on web browser computers. Once stored, every subsequent request that client makes to that same server includes those bits of data in the request headers.

Cookies are often used to track individual visitors to web sites. For example, a web server might assign a web browser a unique identification number 666. Every subsequent request that same web browser makes to the server includes that ID number, 666, so the server 'knows' that these requests come from the same web browser being used for each request, as differentiated from all the other web browser accessing the same content on the server who have been assigned different ID numbers.

Flash cookies

Web sites that use Flash content are able to store data on a user's device in a separate storage space than that used by regular cookies. This means that when a user deletes the cookies in their web browser, the data in the Flash cookie storage area is not deleted, since it is managed by the Flash application, which is a separate application from the web browser.

Given the declining popularity of the Flash player in web browsers recently in favor of similar functionality offered in HTML 5, the use of Flash cookies has declined, and there is a corresponding surge in the use of HTML's 'local storage' features.

Local storage

Local storage, also known as 'web storage', is a mechanism similar to cookies, where code sent by a web server to a web client can instruct the web browser to store large quantities of data on the user's device. This feature has gained popularity with the decline of Flash and the rise of browser support for HTML 5.

Local storage allows for far greater data to be stored on the client machine than cookies. The client sends this data to the server only when requested to do so, usually via Javascript code instructions on a web page, rather than sending this data automatically to the server with every HTTP request the browser makes, as is the case with cookies.

Beacons

Bits of Javascript code running in the web browser can send data about the user's browsing habits continuously to the web server. These are called beacons.

Images in email

One of the old classic tricks is to place a link to an into an email. When the recipient opens the email in their email client (whether that be a web-based client or not), the client will inevitably try to load that image from the server and display it to the user. When the client requests the image from the server, the server 'knows' that the client opened the email.

Browser fingerprinting

A combination of your web browser's features and settings can allow a web site to uniquely identify you. The chances of any two people have all the same browser settings will be very very low.

At a base level, these settings could include:

  1. UserAgent
  2. Language
  3. Color Depth
  4. Screen Resolution
  5. Timezone
  6. Has session storage or not
  7. Has local storage or not
  8. Has indexed DB
  9. Has IE specific 'AddBehavior'
  10. Has open DB
  11. CPU class
  12. Platform
  13. DoNotTrack or not
  14. Full list of installed fonts (maintaining their order, which increases the entropy), implemented with Flash.
  15. A list of installed fonts, detected with JS/CSS (side-channel technique) - can detect up to 500 installed fonts without flash
  16. Canvas fingerprinting
  17. WebGL fingerprinting
  18. Plugins (IE included)
  19. Is AdBlock installed or not
  20. Has the user tampered with the browser's languages
  21. Has the user tampered with its screen resolution
  22. Has the user tampered with its OS
  23. Has the user tampered with its browser
  24. Touch screen detection and capabilities
  25. Pixel Ratio
  26. System's total number of logical processors available to the user agent.

Added to this could be:

  • Multi-monitor detection,
  • Internal HashTable implementation detection
  • WebRTC fingerprinting
  • Math constants
  • Accessibility fingerprinting
  • Camera information
  • DRM support
  • Accelerometer support
  • Virtual keyboards
  • List of supported gestures (for touch-enabled devices)
  • Pixel density
  • Video and audio codecs availability
  • Audio stack fingerprinting

And these could be combined with:

  • IP Address
  • Location
  • Cookies
  • Local storage
  • ...and much more...

What's the chances of any two people having 10 of these settings to exactly the same value?

Ad networks

Companies that make tools to help web site creators require that those web site creators place code on their sites in order to use their tools. In addition to providing helpful functionality, this code tracks visitors to those sites using many or all of the available contemporary tracking techniques. Since many different web sites may use tools made by the same company, that company is able to track individuals not just in their usage of a single web site, but across all the web sites that use their code.

The current master of this genre of tracking is Google. In addition to other tracking systems in their 'free' web browser Chrome, their search engine, Gmail email system, and all mobile devices running the Android operating system, Google provides tools to help most webmasters do at least the following helpful tasks:

  • automatically place ads on the web pages to generate some revenue for the site managers
  • see analytics about who their visitors are and what they do on the web site
  • enable 'social sharing' among users by placing '+1' buttons on the web site
  • and many more tools.

Nobody knows quite how many websites use Google code. The number, as of 2015, estimated to use just one product, Google Analytics, has been estimated to be upwards of 50 million web sites, including virtually all of the web's most popular web sites. Visitors can be tracked as they jump from one website to another, since all those sites run the same tracking code.

Ad networks with similar methods are run by ad-driven tech companies such as Facebook and Amazon and many smaller players sell their own tracking data to these big companies.


What links here