Why is Privacy So Hard?

Carnegie Mellon Associate Professor Jason Hong

Why is privacy so hard? Why is it, after so much negative press about it, are we still being constantly tracked on the web and on our smartphones? Why is it, after so many years of bad incidents and high profile data breaches, has the problem of privacy not yet been solved?

This blog article is based on a talk I recently gave at UCSD's Halıcıoğlu Data Science Institute about Security and Privacy at the Edge. I sketched out why privacy is hard, using examples from the web and from smartphones, and extrapolate a bit as to how we'll likely fail in privacy with the emerging Internet of Things as well. My original slides (here) had 10 points, but this list goes all the way to 11. If you're interested in my thoughts about more technical approaches to privacy for the Internet of Things, see my roadmap paper here [1].

1. Privacy is a broad and fuzzy term

Privacy is a broad and ill-defined term that captures a wide range of concerns about our relationships with others. In fact, there isn't a widely agreed upon definition of privacy that fits all of the cases people care about. Privacy has been described as "the right to be let alone" [2], control and feedback over one's data [3], data privacy (which led to the Fair Information Practices [4], which is the basis of the vast majority of legislation on privacy), anonymity (which is a popular definition among computer science researchers), presentation of self [5], boundary negotiation [6], the right to be forgotten, contextual integrity [7] (taking political, ethical, and social norms into account), and more.

There are two points to note here. First, each of these perspectives on privacy leads to a different way of addressing people's concerns. The "right to be let alone" leads to things like do not call lists and spam filters. The "right to be forgotten" leads to people being able to request that web pages about them be deleted from search engines.

Second, data privacy, which is popular among regulators, tends to be procedurally oriented. For example, did you follow this set of rules? Did you check off all of the boxes? These kinds of procedural approaches can cover a wide range of cases, but can also be rather hard to measure. It also has a strong emphasis on following the rules, regardless of whether it actually improves outcomes. For example, because of Europe's new General Data Protection Regulation (GDPR) , we now get notices from every web site informing us that cookies are being used. Does anyone believe that this has improved privacy?

In contrast to procedural approaches, outcome-oriented approaches are really great in that we can try to measure improvements in privacy, but they also tend to only address narrow cases. For example, k-anonymity [8] and its successor differential privacy [9] both seek to establish mathematical guarantees on identifiability of individuals in a collection of data. We computer scientists love this approach because we are really great at optimizing things, but these approaches only work in cases where there is lots of data. For example, it doesn't help much with cases where you are interacting with others who already know who you are.

2. There is a wide range of privacy risks

One of the reasons there are so many perspectives on privacy is because there are so many ways that privacy is being encroached on in modern life. Privacy isn't just about Big Brother, or about corporations collecting lots of data about us. As noted above, privacy is all about our relationships with all of the individuals and organizations we interact with, each of which pose different issues for privacy.

For example, with respect to friends and family, the concerns might be overly protective parents or nosy siblings. With employers, the concerns might include being constantly monitored at work or workplace discrimination. With governments, the concern might be civil liberties. With stalkers and hackers, the concerns might be one's personal safety or theft of highly personal and potentially embarrassing information.

The main point here is that we will need different solutions for each of these different relationships. For example, data privacy is a great framework for corporations and governments, but terrible for friends and families. You aren't going to hand your friends a privacy policy before you start chatting with them.

3. Technological capabilities are rapidly growing

Many information technologies are pushing the boundaries of privacy in unexpected and undesired ways. Data gathering is easier and more pervasive than in any previous point in history. Everything on the web is instrumented, making it almost trivial to collect web clicks, social media posts, likes, and search terms. More and more devices are coming with a rich array of sensors, making it possible to collect motion, location, orientation, and more.

Data storage and querying capabilities are becoming bigger and faster every day. Inferences via machine learning are becoming more powerful, being able to infer, for example, whether someone is depressed [10], someone's sexual orientation [11], or whether someone is pregnant or not [12]. Similarly, data sharing is becoming more widespread. For example, ad networks, ad exchanges, ad optimizers, and more are all sharing data about each of us with dozens of other entities in the advertising ecosystem, creating a complex web of relationships that is hard to understand, let alone communicate to consumers.

4. There are very strong incentives for companies to collect data about us

There is a simple chain of reasoning here as to why companies want to collect more data about us. More data means better machine learning models, which means improvements to the bottom line. For example, an average online ad displayed on a web page will see clickthrough rates of around 0.05% [13]. Anything that can improve those clickthrough rates can be worth millions of dollars.

There's also a new term my research team introduced to me recently: post-purchase monetization. For example, TVs are relatively expensive purchases, and the margins are razor thin. One way of improving sales is to lower initial purchase costs, and then use sensors and tracking to collect and sell data about the owners, primarily to improve targeted advertising.

Money makes the world go round, and especially so to the detriment of privacy.

5. Same device, same data, different perspectives

Technologists often think of computer technologies as being neutral, but that's definitely not the case when it comes to privacy. However, what makes things difficult here is that the judging whether a given technology or data is good or bad for privacy sometimes depends on the context of use and the existing relationships between parties.

For example, as part of my dissertation work [14], I looked at how nurses used locator badges in hospitals. These badges could be used to find individuals in a hospital. From the hospital administration's perspective, these badges were useful for coordination purposes (e.g. "Where is Alice?") and for protecting individuals from spurious claims ("The nurse never came to visit me."). However, from many nurses' perspective, there was a common sentiment that these badges would be used for surveillance, for example tracking how long a nurse was in the restroom. In cases where there was clear value for nurses and where management was trusted, the locator badges were viewed mostly positively. However, if there were existing tensions between the nurses and management, the nurses tended to reject the badges. In other words, the exact same technology was viewed very differently depending on a number of external factors.

Here is another example, this time looking at data. Foursquare is a social media app that lets people check in to a place and share those check-ins with others. However, one person took this same check-in data and created Girls Around Me , showing photos from women's profiles on a map. The same data is acceptable in one context but creepy in another.

The challenge for computer scientists is that our systems don't handle this kind of context very well. It's much easier for us to say "this data type will always be handled this way" rather than "in this case, this data type is handled this way, but in this other cases, it's handled this other way, and sometimes this third way."

6. The Burden on End-Users is Too High

Today, individuals have to make too many decisions about privacy. Does this web site have good privacy protections? Should I install this app? What are all the settings I need to know? What are all the terms and conditions? What are trackers, cookies, VPNs, anonymizers, advertising IDs, and incognito mode, and how do I use them to protect my privacy?

The Internet of Things also poses new kinds of burdens. For example, individuals have to be constantly vigilant about what devices are around them. For example, in my first year teaching at Carnegie Mellon University, I was meeting with students in their lab space. I didn't know until the end of the semester that we were being broadcast on the Internet the entire time! It turns out that there was a small web camera pointed at the group workspace. There was no malicious intent, and nothing embarrassing was ever sent out, but it was still really surprising to me. These kinds of web cameras are becoming more pervasive and problematic. Earlier this year, a professor in the office next to mine shared a story about how there was a non-obvious webcam in the AirBnb that his family had rented , which was not disclosed clearly on the owner's AirBnb page [15]. This webcam possibly recorded some potentially embarrassing video of his family too.

Simply put, the burden of privacy is too high on end-users, and it's only going to get worse.

7. Developers have low knowledge and awareness of privacy

In surveys and interviews our team did with app developers, we found that the vast majority of app developers knew little about existing privacy laws or privacy frameworks, what privacy issues they should pay attention to, and how to address them [16,17]. I often summarize our research by saying that if you round up, the knowledge that a typical developer has about privacy is zero.

Similarly, developers also have low awareness of privacy problems with their apps. Past work has found that many developers didn’t realize how much data their app was collecting, or that it was collecting data at all [16,17, 18].

The main culprit here turns out to be third-party libraries. App developers often use third-party libraries to help with common functionality, such as analytics and advertising. In a year-long user study of apps, we found that over 40% of apps collect data only because of these libraries [19].

8. Companies get little pushback on privacy

A few years ago, my team created a web site called PrivacyGrade.org. The basic idea is that we developed a simple way of modeling privacy concerns of smartphone apps, combining really easy kinds of static analysis with crowd data. We applied this privacy model to all the free apps we could get from Google Play store, and assigned grades of A, B, C, or D to apps. What surprised me was how much positive feedback we got from consumers, journalists, government agencies, and even some app developers.

Later on, I discovered that we had inadvertently addressed what's known as a market failure. Let me explain by using an example. Let's say you want to purchase a web cam. You can go into your favorite electronics store and compare the web cams based on price, color, and features. However, you can't easily compare how good they are with respect to privacy (or security for that matter). As a result, privacy does not really influence customer purchases, and so companies are not incentivized to improve or prioritize privacy.

9. It's not always clear what the right thing to do is

Even if a company wants to be respectful of privacy, it's not always clear how to translate that into a design. For example, the New York Times' privacy policy is currently 15 pages long. These kinds of policies are a widely accepted practice for privacy, but no one really reads these. Some of my colleagues did a rough back-of-the-envelope analysis, and found that it would take 25 full days to read all of the privacy policies that one encounters on the web in a year [20].

For developers, there isn't a clear and widely accepted set of best practices for privacy. What is the best way of informing people of data collection practices? What is the best way of storing data? How can designers best assess what kinds of data uses are and are not acceptable?

Perhaps more difficult is that the business metrics for privacy are also unclear. In the quarterly board meetings, members of the board of directors of a company will discuss metrics such as Lifetime Value, Customer Acquisition Cost, Year over Year Growth, Retention Rates, and more. However, there are no clear metrics for privacy, making it hard to discuss and to see if progress is being made. Also, as noted earlier, there is a strong incentive to collect data because it impacts the bottom line, and so one could go even further by saying that some business metrics implicitly push back against privacy.

10. Machine Learning and Probabilistic Behaviors Make Privacy Hard to Predict

A few months ago, news agencies reported how an Amazon Echo accidentally recorded a family's conversation and sent it to someone on their contact list [21]. The Echo likely misfired and "misheard" ongoing conversation as a command. This incident is a good example of how machine learning and probabilistic behaviors might accidentally cause new kinds of privacy problems in the front end.

However, machine learning and statistical models have the potential to cause many more kinds of unintended privacy problems in the back end. The concern here is that these kinds of statistical and machine learning models might inadvertently lead to a new kind of redlining , which Wikipedia describes as "systematic denial of various services to residents of specific, often racially associated, neighborhoods or communities, either directly or through the selective raising of prices". For example, a 2009 New York Times article reported how Chrome-skull accessories "were in the top one percent of products signaling a risk of default among 85,000 types of purchases analyzed", whereas premium wild birdseed was in the bottom one percent [22]. Are these purchases a possible proxy for protected categories, such as race or gender? Or perhaps they are a proxy for socioeconomic status or even for where a person lives? As machine learning models become more complex, it can be hard to discern if they are unintentionally basing decisions on sensitive or protected categories.

11. Emergent Behaviors Make Privacy Hard to Predict

I once heard a funny story about a prankster walked by a room full of people wearing Google Glass. He yelled out "Ok Glass, take a picture," and everyone's Glass took a picture of what they were looking at. Now, there aren't any serious privacy implications with this specific story, but it illustrates how there will be emergent behaviors that arise as these IoT devices start interacting with other people and other devices in unexpected ways through the shared medium of the physical world. What kinds of things will happen when drones, smart speakers, autonomous vehicles, smart toilets, and smart toys start unintentionally interacting with one another? What happens when the inputs are probabilistic? Will some devices unexpectedly activate other devices? Will sensitive data be inadvertently shared with other devices? Will devices accidentally reveal things about us to others? What happens when you add pets and children into the mix? Out of all of the items discussed, this one is the hardest to predict, the Unknown Unknown, because IoT isn't widely deployed enough yet and because we don't have a parallel for it with smartphones or the web. In the long run, these emergent behaviors might turn out to be trivial to manage, or they might turn out to be the most challenging of all of the problems identified.

Conclusions

We are at an inflection point for privacy. Massive data collection is happening with the web and with smartphones, and will likely continue with Internet of Things. However, there are also new laws and regulations, new technologies, and ongoing concerns being raised every day.

Reflecting on the current state of affairs, privacy will require many socio-technical solutions. We computer scientists tend not to like this, as it doesn't lend itself to clean solutions, but this is the nature of privacy and the current landscape. As such, I would encourage other researchers and privacy advocates to consider how to help regulators and how to address market failures.

I would also encourage us all to think about how to shift the burden off of end-users. The solution can not just be more choice, more options, and more tools for end-users. Every day consumers are too busy and don't have enough knowledge of all of the implications. I often use the analogy of spam email when talking about ways of shifting the burden off of end-users. In the early 2000s, spam was inundating our inboxes, but over time, ISPs started blocking bad actors, new standards like DKIM and SPF were created and deployed, machine learning become sophisticated enough to filter bad emails, and law enforcement agencies started to take down botnets and the most egregious of spammers. Nowadays, I only get a few spam emails, still annoying but manageable. Can we do the same for privacy?

My last point here is that we need to consider the incentives for privacy a lot more deeply. Right now, the incentives of all the major actors in privacy are seriously misaligned with consumers, and we need to find better ways of making sure that our proposed solutions can align these incentives better.

I'll end this blog entry with a parting thought: How can we create a connected world that we would all want to live in?

References

1. Hong, J. (2017). The privacy landscape of pervasive computing. IEEE Pervasive Computing, 16(3), 40-48. https://ieeexplore.ieee.org/document/7994573/

2. Brandeis, L., & Warren, S. (1890). The right to privacy. Harvard Law Review, 4(5), 193-220. https://louisville.edu/law/library/special-collections/the-louis-d.-brandeis-collection/the-right-to-privacy

3. Bellotti, V., & Sellen, A. (1993). Design for privacy in ubiquitous computing environments. In Proceedings of the Third European Conference on Computer-Supported Cooperative Work (ECSCW’93), 1993. https://www.microsoft.com/en-us/research/uploads/prod/2016/08/design-for-privacy-93.pdf

4. Organization for Economic Co-operation and Development, Guidelines on the Protection of Privacy and Transborder Flows of Personal Data. Technical Report, 1980.

5. Goffman, E. (1978). The presentation of self in everyday life. London: Harmondsworth.

6. Altman, I. (1975). The Environment and Social Behavior: Privacy, Personal Space, Territory, and Crowding.

7. H. Nissenbaum. (2004). Privacy as contextual integrity. Washington Law Review, 79, 119.

8. Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557-570. https://epic.org/privacy/reidentification/Sweeney_Article.pdf

9. Dwork, C. (2011). Differential privacy. Encyclopedia of Cryptography and Security, 338-340.

10. Saeb, S., Zhang, M., Karr, C. J., Schueller, S. M., Corden, M. E., Kording, K. P., & Mohr, D. C. (2015). Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: an exploratory study. Journal of Medical Internet Research, 17(7). https://www.ncbi.nlm.nih.gov/pubmed/26180009

11. Jernigan, C., & Mistree, B. F. (2009). Gaydar: Facebook friendships expose sexual orientation. First Monday, 14(10). https://firstmonday.org/article/view/2611/2302

12. Duhigg, C. (2012). How Companies Learn Your Secrets. The New York Times, Feb-2012. Available: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html

13. SmartInsight. Average display advertising clickthrough rates. https://www.smartinsights.com/internet-advertising/internet-advertising-analytics/display-advertising-clickthrough-rates/

14. Hong, J. I. An architecture for privacy-sensitive ubiquitous computing. http://www.cmuchimps.org/publications/an_architecture_for_privacy-sensitive_ubiquitous_computing_2010

15. Bigham. J. A Camera is Watching You in Your AirBnB: And, you consented to it. http://jeffreybigham.com/blog/2019/who-is-watching-you-in-your-airbnb.html

16. Balebako, R., Marsh, A., Lin, J., Hong, J. I., & Cranor, L. F. (2014). The privacy and security behaviors of smartphone app developers. Workshop on Usable Security (USEC 2014) http://www.cmuchimps.org/uploads/publication/paper/137/the_privacy_and_security_behaviors_of_smartphone_app_developers.pdf

17. Li, T., Agarwal, Y., & Hong, J. I. (2018). Coconut: An IDE plugin for developing privacy-friendly apps. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(4), 178. http://www.cmuchimps.org/uploads/publication/paper/194/coconut_an_ide_plugin_for_developing_privacy_friendly_apps.pdf

18. Agarwal, Y., & Hall, M. (2013, June). ProtectMyPrivacy: detecting and mitigating privacy leaks on iOS devices using crowdsourcing. In Proceedings of the 11th annual international conference on Mobile systems, applications, and services (pp. 97-110). https://www.synergylabs.org/yuvraj/docs/Agarwal_MobiSys2013_ProtectMyPrivacy.pdf

19. Chitkara, S., Gothoskar, N., Harish, S., Hong, J. I., & Agarwal, Y. (2017). Does this App Really Need My Location?: Context-Aware Privacy Management for Smartphones. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3), 42. http://www.cmuchimps.org/uploads/publication/paper/185/does_this_app_really_need_my_location_context_aware_privacy_management_for_smartphones.pdf

20. McDonald, A. M., & Cranor, L. F. (2008). The cost of reading privacy policies. ISJLP, 4, 543.

21. Shamba, H. An Amazon Echo recorded a family’s conversation, then sent it to a random person in their contacts, report says. The Washington Post, May 24, 2018. https://www.washingtonpost.com/news/the-switch/wp/2018/05/24/an-amazon-echo-recorded-a-familys-conversation-then-sent-it-to-a-random-person-in-their-contacts-report-says/

22. Duhigg, C. (2009). What does your credit-card company know about you. The New York Times. http://www.nytimes.com/2009/05/17/magazine/17credit-t.html

Why is Privacy So Hard?

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Why is Privacy So Hard?

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.