Computing Applications Virtual extension

Hidden Surveillance By Web Sites: Web Bugs in Contemporary use

By David Martin, Hailin Wu, and Adil Alsaid

Posted Dec 1 2003

Introduction
Why Would Anyone Care What We Read?
Defining Web Bugs
Conclusion
References
Authors
Footnotes
Sidebar: Example Web Bugs

When a Web browser navigates to a Web page, it usually has to process several distinct network transactions, each referring to a different part of the requested page. These parts include the main HTML source of a Web page and any of its images, scripts, layers, CSS, Java applets, ActiveX objects, and so forth. These network transactions are usually recorded in remote Web server log files, each entry constituting one piece of the user’s clickstream—the sequence of Web sites visited by the user over time. These log files and clickstreams can later be examined in order to better understand visitor traffic patterns, interests, site problems, and even break-in attempts. Each of these logging and analysis purposes amounts to a form of user surveillance, some more intrusive than others.

Privacy advocates are concerned about the possible uses for these massive data troves, because anyone who knows the URLs that you visit pretty much knows what you are reading on the Web. Given a typical URL, it’s trivial to turn it into the full Web page text—just put the URL into a Web browser. There are currently no federal restrictions on the commercial acquisition and use of clickstream information in the U.S. other than the Federal Trade Commission’s enforcement of site privacy policies under its authority to investigate “unfair and deceptive trade” practices. In other words, the FTC regulates statements about privacy practices, but not the practices themselves. The government does face some restrictions on its own acquisition of clickstream data, but they are low hurdles: the U.S. Patriot Act of October 2001 allows clickstream information to be gathered once the investigators certify before a magistrate that it is “likely […] to be relevant to an ongoing criminal investigation.” The rules appear to be far looser in non-prosecutorial settings. For example, in Knight v. Exeter School District (New Hampshire), the court ruled that a student’s parent had the right to review a school’s proxy log [6]. And in Indiana, a reporter cited the state’s open records law in order to obtain browser history, bookmark, and cookie files from 49 school superintendents [12].

Meanwhile, Web site operators tend to downplay the importance of clickstream logs. They point out that server activity logging is routine, indiscriminate, and enabled by default on Web servers. Many site operators promise not to use the data for personal identification purposes unless compelled to under law. One is left with the impression that while the data fills up disk drives, there may be no real plans for it.

How then can we distinguish between data collected intentionally for surveillance purposes from data that is collected routinely as part of a Web site’s operation? We approached this question by singling out certain Web elements that appear to have no purpose other than information gathering and that also can be recognized automatically. Counting these elements then provides a lower—possibly very lower—bound on the amount of intentional tracking in the Web. These elements of interest are Web bugs, which are covert (but not too covert) channels between a Web site and a third-party data collector. Web bugs are also known as pixel tags, Web beacons, and clear GIFs.

We started with two lists of Web sites chosen for inclusion in the FTC’s 2000 Web privacy study [5]. The popular list consisted of 91 Web sites designated as the most popular in January 2000, and the random list consisted of 335 consumer-oriented Web sites. Some of these sites no longer exist, so we narrowed it down to 84 from the popular list and 298 from the random list. We used a modified version of Bugnosis [1, 2], our own Web bug detection engine, to analyze approximately 90 pages from each site on average. We also read the privacy policies of popular sites to see how well they corresponded with the site’s actual practices.

Under our definition, we found that 58% of popular Web sites and 36% of random Web sites contain Web bugs somewhere close to their home page. That’s a lot! After reading the privacy policies of the popular Web sites, we found that 29% (14 out of 48) of those that do contain Web bugs say nothing about third-party content and corresponding surveillance capabilities, let alone the possibility of Web bugs on their sites. Without an appropriate disclosure, the users of these sites really have no way to know that they may be participating in a third-party surveillance network. On the brighter side, 67% (26 out of 39) of the third parties we encountered did provide machine-readable (P3P [9]) privacy policies on their sites.

Two other reports have specifically addressed the distribution of Web bugs. Reinke at E-Soft Inc. concentrated on identifying and ranking the third-party recipients of Web bug data [10]. His report first ranked the third parties by the number of Web sites that carry their Web bugs. Taking the Web sites’ popularities into account, Reinke also ranked the third parties by the total amount of surveillance data the third parties are likely to receive. The sample space was 701,176 pages retrieved from 101,991 different Web sites.

Murray and Cowart at Cyveillance Inc. considered changes in Web bug practices over time by comparing results from two samples, and concluded that Web bugs are almost five times as likely to appear on a random page in 2001 as they were on a random page in 1998 [7]. The authors also noted a very strong correlation between Web bug presence and the presence of content from “leading brand names” on Web pages in 2001. Their sample space included pages from over 1,000,000 sites.

Why Would Anyone Care What We Read?

In an Oct. 4, 2001 press release announcing the development of standards for Web bug use by its members [8], the Network Advertising Initiative wrote:

“Web beacons are a tool that can be used online to deliver a cookie in a third party context. This allows companies to perform many important tasks—including unique visitor counts, web usage patterns, assessments of the efficacy of ad campaigns, delivery of more relevant offers, and tailoring of web site content. The web beacon’s cookie is typically delivered or read through a single pixel on the host site.”

According to the Privacy Foundation [11], Web bugs may also be used in order to:

Count the number of times a particular Web page has been viewed.
Transfer demographic data (gender, age, zip code, and so on) about visitors of a Web site to an Internet marketing company. This information is typically used for online profiling purposes.
Transfer personally identifiable information (name, address, phone number, email address) about visitors of a Web site to an Internet marketing company. This information is typically used for online profiling purposes. It also can be combined with other offline demographic data such as household income, number of family members, type(s) of car(s) owned, mortgage balance, and so forth.
Profile individuals by tracking what Web pages they visit across many different Web sites.
Pass off a person’s search strings from a search engine to an Internet marketing company.
Match a purchase to a banner ad that a person viewed before making the purchase. The Web site that displayed the banner ad is typically given a percentage of the sale.
Count the number of times a banner ad has appeared.
Report back the type and configuration of the Internet browser used by a visitor to a Web site. This information is typically used in aggregate form to determine what kind of content can be put on a Web site to be viewed by most visitors.
Allow a third party to provide server logging to a Web site that cannot do this function itself.

Keep in mind that while not all of these uses are inherently threatening to privacy, they’re all a form of user population surveillance. Web users are at an inherent disadvantage here, because the surveillance mechanism is designed to go unnoticed. Think of a hidden video camera at the entrance of a public library. Maybe it’s only there to count the total number of visitors, or maybe it’s recording every face it sees forever. But it hardly makes sense to ask which it is without knowing the camera is there at all.

Defining Web Bugs

So what are these Web bugs, really?

Generally speaking, a Web bug is any HTML element that is intended to go unnoticed by users, and is present partially for surveillance purposes.

This captures our intent, but we’re going to have real problems writing a computer program that figures out whether an arbitrary HTML element is supposed to be noticed and why it is there. We had to make this definition much tighter.

We start by looking only at third-party images that are tiny (7 square pixels or less). Unfortunately, it’s common practice among Web designers to use tiny images to achieve spacing and alignment effects, since HTML offers no other way to do it. In order to minimize the possibility of confusing spacing images with tracking devices, we narrowed our focus to third-party images that have associated third-party cookies. Since cookies are the standard mechanism for uniquely identifying users, this step eliminates the third parties who clearly aren’t interested in that type of tracking. And then we exclude any images that appear more than once on the page. This helps because, while spacing and alignment images might naturally appear multiple times on a page, there is little sense in embedding more than one identical surveillance device on the same page.

The final ingredient is an expert knowledge database of 29 URL patterns [2]. Matching a pattern in this database can either force an image to be identified as a Web bug or prevent it from being so identified even if it has the other characteristics of a Web bug. But ultimately, this database was not a big decision factor: only 5% of the Web bugs found in this study are due to a database match, and less than 0.1% of apparent Web bugs were ignored because of a database entry.

More specifically, a Web bug is an HTML element that is an image, is tiny, is served by a third party, has an associated cookie, and appears only once on the page. Or, if the element’s URL matches one of the 29 expert knowledge database patterns, then the database decides whether it’s a Web bug or not.

Other researchers [7, 10] generally allow more images to be considered Web bugs than we do. For instance, they don’t require a Web bug to have an associated cookie or to appear only once on the page. But they’re more demanding about the image size—while we’ll consider a 7×1 image small enough to be intended to go unnoticed, they consider only a 1×1 image—the tiniest possible image—to be hidden.

Although our study focused on Web pages, Web bugs are equally effective when embedded within HTML email. In fact, they are arguably more problematic in email, because by definition, an email Web bug creator already has a significant piece of personal information about the Web bug recipient: the user’s email address.

We used a customized version of Bugnosis [1] to collect and analyze the Web pages we captured. Bugnosis is a Web bug detector we built as a simple privacy monitor add-on for Internet Explorer; see [2] for more details. One of the purposes of this study was to improve the Web bug detection rules for the next version of Bugnosis, so the definition we used in this study is somewhat different than that used by the current version (1.1).

Conclusion

The fact that over half of random sites and over a third of popular sites use Web bugs means that even more sites are doing some type of surveillance, since Web bugs are really only one approach to generating log entries. This points to an extremely significant level of routine surreptitious surveillance in the Web, and ultimately, to the inappropriateness of the popular understanding of the Web as an electronic library.

The American Library Association’s Code of Ethics [3] states that “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired, or transmitted.” And librarians follow through on this promise, routinely turning down requests for increased monitoring. As a result, when a library visitor presents a library card and checks out a book, they can be confident that the librarian’s record keeping will be limited to making sure the book is eventually returned (and perhaps noting how popular the book is). Even bookstores tend to fight when faced with a subpoena for purchase records.

But when visiting a Web site, users really don’t know whether they are contributing to a data bank about the site, or about themselves, or both. Every Web site could be different. Most users probably don’t even think of the data bank when surfing. And although libraries and bookstores simply have no effective means to track the idle browsing of their offerings, Web sites do, and they use it. Meanwhile, the ACM Code of Ethics [4] also calls on its members to restrain. “It is the responsibility of professionals to maintain the privacy and integrity of data describing individuals. […] This imperative implies that […] personal information gathered for a specific purpose not be used for other purposes without consent of the individual(s). These principles […] prohibit procedures that capture or monitor electronic user data, including messages, without the permission of users or bona fide authorization related to system operation and maintenance.” The hidden monitoring of users by Web bugs and the unobservable extraction of clickstreams from server logs certainly appear to be at odds with this principle.

So we are left with some questions. Why do Web architects seem to feel so differently about user tracking than librarians do? Is this Web surveillance regimen more due to technical ease or to business necessity? And if surveillance is a given, how can we at least improve user awareness? P3P [9] begins to help, by relieving users of the absurd requirement that they track down all of the constituent pieces on the Web pages they visit and read each privacy policy before deciding whether to proceed. But P3P is essentially an automatic translator, and translators never get to choose the underlying message. Today, the message is that commercial Web sites want automated feedback, and that mainstream users are expected to not worry about generating a remote clickstream log that could be analyzed, searched, archived, and subpoenaed.

Sidebar: Example Web Bugs

Page URL: http://www.nwfusion.com/
Web bug URL: http://ad.doubleclick.net/ad/idg.pixel.global/idgpixel7.18.00;sz=1×1;ord=121421?
Our associated cookie: id=8000000d02fc90f

On a recent visit to the Network World Fusion, we found two similar Web bugs. The first of them was placed on the page using this HTML code:

<IMG height=1 src=”http://ad.doubleclick.net/ad/idg.pixel.global/idgpixel7.18.00;sz=1×1;ord=121421?” width=1 align=middle border=0>

The URL indicates the third party receiving the surveillance information while providing a free-form string in which arbitrary parameters can be passed as well. In this case, we see that the recipient is ad.doubleclick.net, but the parameters are not particularly interesting—the “ord” parameter just avoids serving the image out of a cache. When this image was fetched, our computer also transmitted its associated cookie to ad.doubleclick.net. Our cookie was the string “id=8000000d02fc90f”, which uniquely identified our Web browser.

These images here show how Bugnosis draws attention to Web bugs:

The image on the left is the original page, with the area around the word “Help” pulled out and magnified. Two Web bugs are positioned right next to the letter “p”; together, they are barely perceptible as a white dot on the original page. (Often, Web bugs are rendered in the “transparent” color, making them totally invisible.) On the right, we see the same page as modified by Bugnosis, which turned the white dots into 18×18 cartoonish bug images. We also brought up Bugnosis’ analysis window, which shows URLs and cookies for the Web bugs discovered on the page.

Page URL: http://sho.com/queer
Web bug URL: http://ad.linksynergy.com/fs-bin/show?id=9ty3z8Cgl5c&bids=28413.10000020
Our associated cookie: lsn_statp=i%2FOeBA%3D%3D;lsn_track=UmFuZG9tSVbn%2BaRRdAYn4BQn0uOnzQIAJMMl596jt5ZDcactK%2F7FCH9YyLVOOsSdnKt7W50uEpM%3D;linkshare_cookie14231=10000013%3A14231

This Web bug was placed on the site for the cable TV series “Queer as Folk”—an adult program about the lives and romances of a group of urban gay people. How many people are comfortable with the thought of unseen monitors noting their interest in particular sexual themes?

Page URL: http://www.msnbc.com/news/attack_front.asp?0SB=C944
Web bug URL: http://www.qwestdex.com/img/tracking/msnbc.gif?r=473
Our associated cookie: profile=userid&dmmjr&mypwd&0&font&regular&userfirstname&David&userlastname&Martin&city&DENVER&state&CO&key1&Home& plat1&39.691399&plng1&-104.928802

This last example is remarkable for the amount of personal information stored in the cookie. Both a username “dmmjr” and password “mypwd” are visible, along with a full name and even the latitude and longitude of an address. (At one point we had typed in the identity information along with a street address at the qwestdex.com site in order to locate a nearby merchant. It did not become apparent until later that this identity information and the computed latitude and longitude were being stored in our qwestdex.com cookie. Web bugs and cookies are not in themselves capable of pinpointing physical locations, but they are good at repeating what they have learned through other means.)

There are certainly some security issues with this example, but keep in mind that information like this can be maintained without being so visible. For example, Qwestdex could have instead stored all of the above information in a database entry indexed by a simple user ID; then this bug would look more like the Network Fusion example.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Hidden Surveillance By Web Sites: Web Bugs in Contemporary use

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/953460.953509

December 2003 Issue

Published: December 1, 2003

Vol. 46 No. 12

Pages: 258-264

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Jul 26 2024

Establishing Standards for Embodied AI

Shaoshan Liu

Architecture and Hardware

vitruvian man on green binary code background, illustration

BLOG@CACM Jul 24 2024

A Pioneer in Using AI to Teach Reading

Jeremy Roschelle

Architecture and Hardware

BLOG@CACM Jul 23 2024

A Versal Story in the Era of Hardware AI: Why the Chinese Could Win

Aleksandr Romanov and Maksim Popov

Architecture and Hardware

worker amidst rows of circuit boards at Chinese factory

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More