Sign In

Communications of the ACM

Contributed articles

Designing Statistical Privacy For Your Data

Designing Statistical Privacy for Your Data, illustration

Credit: Andrij Borys Associates / Shutterstock

In 2006, AOL released a file containing search queries posed by many of its users. The user names were replaced with random hashes, though the query text was not modified. It turns out some users had queried their own names, or "vanity queries," and nearby locations like local businesses. As a result, it was not difficult for reporters to find and interview an AOL user1 then learn personal details about her (such as age and medical history) from the rest of her queries.

Could AOL have protected all its users by also replacing each word in the search queries with a random hash? Probably not; Kumar et al.27 showed that word co-occurrence patterns would provide clues about which hashes correspond to which words, thus allowing an attacker to partially reconstruct the original queries. Such privacy concerns are not unique to Web-search data. Businesses, government agencies, and research groups routinely collect data about individuals and need to release some form of it for a variety of reasons (such as meeting legal requirements, satisfying business obligations, and encouraging reproducible scientific research). However, they must also protect sensitive information, including identities, facts about individuals, trade secrets, and other application-specific considerations, in the raw data. The privacy challenge is that sensitive information can be inferred in many ways from the data releases. Homer et al.20 showed participants in genomic research studies may be identified from publication of aggregated research results. Greveler et al.17 showed smart meter readings can be used to identify the TV shows and movies being watched in a target household. Coull et al.6 showed webpages viewed by users can be deduced from metadata about network flows, even when server IP addresses are replaced with pseudonyms. And Goljan and Fridrich16 showed how cameras can be identified from noise in the images they produce.


No entries found

Log in to Read the Full Article

Sign In

Sign in using your ACM Web Account username and password to access premium content if you are an ACM member, Communications subscriber or Digital Library subscriber.

Need Access?

Please select one of the options below for access to premium content and features.

Create a Web Account

If you are already an ACM member, Communications subscriber, or Digital Library subscriber, please set up a web account to access premium content on this site.

Join the ACM

Become a member to take full advantage of ACM's outstanding computing information resources, networking opportunities, and other benefits.

Subscribe to Communications of the ACM Magazine

Get full access to 50+ years of CACM content and receive the print version of the magazine monthly.

Purchase the Article

Non-members can purchase this article or a copy of the magazine in which it appears.