When a Web browser navigates to a Web page, it usually has to process several distinct network transactions, each referring to a different part of the requested page. These parts include the main HTML source of a Web page and any of its images, scripts, layers, CSS, Java applets, ActiveX objects, and so forth. These network transactions are usually recorded in remote Web server log files, each entry constituting one piece of the user’s clickstream—the sequence of Web sites visited by the user over time. These log files and clickstreams can later be examined in order to better understand visitor traffic patterns, interests, site problems, and even break-in attempts. Each of these logging and analysis purposes amounts to a form of user surveillance, some more intrusive than others.
Privacy advocates are concerned about the possible uses for these massive data troves, because anyone who knows the URLs that you visit pretty much knows what you are reading on the Web. Given a typical URL, it’s trivial to turn it into the full Web page text—just put the URL into a Web browser. There are currently no federal restrictions on the commercial acquisition and use of clickstream information in the U.S. other than the Federal Trade Commission’s enforcement of site privacy policies under its authority to investigate “unfair and deceptive trade” practices. In other words, the FTC regulates statements about privacy practices, but not the practices themselves. The government does face some restrictions on its own acquisition of clickstream data, but they are low hurdles: the U.S. Patriot Act of October 2001 allows clickstream information to be gathered once the investigators certify before a magistrate that it is “likely […] to be relevant to an ongoing criminal investigation.” The rules appear to be far looser in non-prosecutorial settings. For example, in Knight v. Exeter School District (New Hampshire), the court ruled that a student’s parent had the right to review a school’s proxy log [6]. And in Indiana, a reporter cited the state’s open records law in order to obtain browser history, bookmark, and cookie files from 49 school superintendents [12].
Meanwhile, Web site operators tend to downplay the importance of clickstream logs. They point out that server activity logging is routine, indiscriminate, and enabled by default on Web servers. Many site operators promise not to use the data for personal identification purposes unless compelled to under law. One is left with the impression that while the data fills up disk drives, there may be no real plans for it.
How then can we distinguish between data collected intentionally for surveillance purposes from data that is collected routinely as part of a Web site’s operation? We approached this question by singling out certain Web elements that appear to have no purpose other than information gathering and that also can be recognized automatically. Counting these elements then provides a lower—possibly very lower—bound on the amount of intentional tracking in the Web. These elements of interest are Web bugs, which are covert (but not too covert) channels between a Web site and a third-party data collector. Web bugs are also known as pixel tags, Web beacons, and clear GIFs.
We started with two lists of Web sites chosen for inclusion in the FTC’s 2000 Web privacy study [5]. The popular list consisted of 91 Web sites designated as the most popular in January 2000, and the random list consisted of 335 consumer-oriented Web sites. Some of these sites no longer exist, so we narrowed it down to 84 from the popular list and 298 from the random list. We used a modified version of Bugnosis [1, 2], our own Web bug detection engine, to analyze approximately 90 pages from each site on average. We also read the privacy policies of popular sites to see how well they corresponded with the site’s actual practices.
Under our definition, we found that 58% of popular Web sites and 36% of random Web sites contain Web bugs somewhere close to their home page. That’s a lot! After reading the privacy policies of the popular Web sites, we found that 29% (14 out of 48) of those that do contain Web bugs say nothing about third-party content and corresponding surveillance capabilities, let alone the possibility of Web bugs on their sites. Without an appropriate disclosure, the users of these sites really have no way to know that they may be participating in a third-party surveillance network. On the brighter side, 67% (26 out of 39) of the third parties we encountered did provide machine-readable (P3P [9]) privacy policies on their sites.
Two other reports have specifically addressed the distribution of Web bugs. Reinke at E-Soft Inc. concentrated on identifying and ranking the third-party recipients of Web bug data [10]. His report first ranked the third parties by the number of Web sites that carry their Web bugs. Taking the Web sites’ popularities into account, Reinke also ranked the third parties by the total amount of surveillance data the third parties are likely to receive. The sample space was 701,176 pages retrieved from 101,991 different Web sites.
Murray and Cowart at Cyveillance Inc. considered changes in Web bug practices over time by comparing results from two samples, and concluded that Web bugs are almost five times as likely to appear on a random page in 2001 as they were on a random page in 1998 [7]. The authors also noted a very strong correlation between Web bug presence and the presence of content from “leading brand names” on Web pages in 2001. Their sample space included pages from over 1,000,000 sites.
Why Would Anyone Care What We Read?
In an Oct. 4, 2001 press release announcing the development of standards for Web bug use by its members [8], the Network Advertising Initiative wrote:
“Web beacons are a tool that can be used online to deliver a cookie in a third party context. This allows companies to perform many important tasks—including unique visitor counts, web usage patterns, assessments of the efficacy of ad campaigns, delivery of more relevant offers, and tailoring of web site content. The web beacon’s cookie is typically delivered or read through a single pixel on the host site.”
According to the Privacy Foundation [11], Web bugs may also be used in order to:
- Count the number of times a particular Web page has been viewed.
- Transfer demographic data (gender, age, zip code, and so on) about visitors of a Web site to an Internet marketing company. This information is typically used for online profiling purposes.
- Transfer personally identifiable information (name, address, phone number, email address) about visitors of a Web site to an Internet marketing company. This information is typically used for online profiling purposes. It also can be combined with other offline demographic data such as household income, number of family members, type(s) of car(s) owned, mortgage balance, and so forth.
- Profile individuals by tracking what Web pages they visit across many different Web sites.
- Pass off a person’s search strings from a search engine to an Internet marketing company.
- Match a purchase to a banner ad that a person viewed before making the purchase. The Web site that displayed the banner ad is typically given a percentage of the sale.
- Count the number of times a banner ad has appeared.
- Report back the type and configuration of the Internet browser used by a visitor to a Web site. This information is typically used in aggregate form to determine what kind of content can be put on a Web site to be viewed by most visitors.
- Allow a third party to provide server logging to a Web site that cannot do this function itself.
Keep in mind that while not all of these uses are inherently threatening to privacy, they’re all a form of user population surveillance. Web users are at an inherent disadvantage here, because the surveillance mechanism is designed to go unnoticed. Think of a hidden video camera at the entrance of a public library. Maybe it’s only there to count the total number of visitors, or maybe it’s recording every face it sees forever. But it hardly makes sense to ask which it is without knowing the camera is there at all.
Defining Web Bugs
So what are these Web bugs, really?
Generally speaking, a Web bug is any HTML element that is intended to go unnoticed by users, and is present partially for surveillance purposes.
This captures our intent, but we’re going to have real problems writing a computer program that figures out whether an arbitrary HTML element is supposed to be noticed and why it is there. We had to make this definition much tighter.
We start by looking only at third-party images that are tiny (7 square pixels or less). Unfortunately, it’s common practice among Web designers to use tiny images to achieve spacing and alignment effects, since HTML offers no other way to do it. In order to minimize the possibility of confusing spacing images with tracking devices, we narrowed our focus to third-party images that have associated third-party cookies. Since cookies are the standard mechanism for uniquely identifying users, this step eliminates the third parties who clearly aren’t interested in that type of tracking. And then we exclude any images that appear more than once on the page. This helps because, while spacing and alignment images might naturally appear multiple times on a page, there is little sense in embedding more than one identical surveillance device on the same page.
The final ingredient is an expert knowledge database of 29 URL patterns [2]. Matching a pattern in this database can either force an image to be identified as a Web bug or prevent it from being so identified even if it has the other characteristics of a Web bug. But ultimately, this database was not a big decision factor: only 5% of the Web bugs found in this study are due to a database match, and less than 0.1% of apparent Web bugs were ignored because of a database entry.
More specifically, a Web bug is an HTML element that is an image, is tiny, is served by a third party, has an associated cookie, and appears only once on the page. Or, if the element’s URL matches one of the 29 expert knowledge database patterns, then the database decides whether it’s a Web bug or not.
Other researchers [7, 10] generally allow more images to be considered Web bugs than we do. For instance, they don’t require a Web bug to have an associated cookie or to appear only once on the page. But they’re more demanding about the image size—while we’ll consider a 7×1 image small enough to be intended to go unnoticed, they consider only a 1×1 image—the tiniest possible image—to be hidden.
Although our study focused on Web pages, Web bugs are equally effective when embedded within HTML email. In fact, they are arguably more problematic in email, because by definition, an email Web bug creator already has a significant piece of personal information about the Web bug recipient: the user’s email address.
We used a customized version of Bugnosis [1] to collect and analyze the Web pages we captured. Bugnosis is a Web bug detector we built as a simple privacy monitor add-on for Internet Explorer; see [2] for more details. One of the purposes of this study was to improve the Web bug detection rules for the next version of Bugnosis, so the definition we used in this study is somewhat different than that used by the current version (1.1).
Conclusion
The fact that over half of random sites and over a third of popular sites use Web bugs means that even more sites are doing some type of surveillance, since Web bugs are really only one approach to generating log entries. This points to an extremely significant level of routine surreptitious surveillance in the Web, and ultimately, to the inappropriateness of the popular understanding of the Web as an electronic library.
The American Library Association’s Code of Ethics [3] states that “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired, or transmitted.” And librarians follow through on this promise, routinely turning down requests for increased monitoring. As a result, when a library visitor presents a library card and checks out a book, they can be confident that the librarian’s record keeping will be limited to making sure the book is eventually returned (and perhaps noting how popular the book is). Even bookstores tend to fight when faced with a subpoena for purchase records.
But when visiting a Web site, users really don’t know whether they are contributing to a data bank about the site, or about themselves, or both. Every Web site could be different. Most users probably don’t even think of the data bank when surfing. And although libraries and bookstores simply have no effective means to track the idle browsing of their offerings, Web sites do, and they use it. Meanwhile, the ACM Code of Ethics [4] also calls on its members to restrain. “It is the responsibility of professionals to maintain the privacy and integrity of data describing individuals. […] This imperative implies that […] personal information gathered for a specific purpose not be used for other purposes without consent of the individual(s). These principles […] prohibit procedures that capture or monitor electronic user data, including messages, without the permission of users or bona fide authorization related to system operation and maintenance.” The hidden monitoring of users by Web bugs and the unobservable extraction of clickstreams from server logs certainly appear to be at odds with this principle.
So we are left with some questions. Why do Web architects seem to feel so differently about user tracking than librarians do? Is this Web surveillance regimen more due to technical ease or to business necessity? And if surveillance is a given, how can we at least improve user awareness? P3P [9] begins to help, by relieving users of the absurd requirement that they track down all of the constituent pieces on the Web pages they visit and read each privacy policy before deciding whether to proceed. But P3P is essentially an automatic translator, and translators never get to choose the underlying message. Today, the message is that commercial Web sites want automated feedback, and that mainstream users are expected to not worry about generating a remote clickstream log that could be analyzed, searched, archived, and subpoenaed.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment