Understanding Mobile App Reviews to Guide Misuse Audits

App reviews can be an valuable resource in identifying exploitable mobile apps.

By Vaibhav Garg, Hui Guo, Nirav Ajmeri, Saikath Bhattacharya, and Munindar P. Singh

Posted Jul 24 2025

RQinfo: App Reviews for Misuse Audits
RQidentify: Identifying Exploitable Apps through Misuse Audits
RQfunctionality: Uncovering Exploitable Functionalities
Related Work
Discussion
Reproducibility
Acknowledgments
References
Footnotes

Traditional audits of mobile apps typically involve conducting a review their source code.¹¹^,²⁷ Such processes, however, do not account for interpersonal misuse coming from app users rather than app developers. Here we introduce the misuse audit, a process for auditing mobile apps through reported cases of misuse.

A misuse occurs when an app user, the abuser, exploits an app to access the information of other users or third parties, the victims. In particular, a misuse audit can identify cases when the victim doesn’t know about the information access (spying) or may know about the access but is uncomfortable with it. Often missed by traditional audits, the latter case includes incidents of forced consent or when public information, such as that found on dating apps, is accessed beyond a victim’s comfort level. We use the term exploitable behavior to denote these two types of information access and exploitable apps to refer to apps that enable exploitable behavior.

Research⁵^,⁹^,¹⁵ shows that exploitable apps may cause victims discomfort, fear, and even harm. Preventing such risks might involve highlighting the apps and their exploitable functionalities to users, app-distribution platforms, and app developers. However, identifying these apps is nontrivial, as they do have a legitimate purpose but can be misused by abusers. In this article, we propose an approach for misuse audits, MissAuditor, that identifies exploitable apps and uncovers their exploitable functionalities.

We find that app reviews often describe exploitable functionalities, an app’s potential and actual misuse, and users’ expectations. Such reviews provide evidence of exploitable behavior and thus can guide audits of these apps. Moreover, app reviews are more valuable than app metadata (such as app descriptions) because metadata indicates only an app’s intended purpose, not its actual misuse. In particular, we address the following research questions:

RQ_info: What information regarding misuse audit is contained in reviews?
RQ_identify: How can we conduct a misuse audit through app reviews?
RQ_{functionality}: What exploitable functionalities are present in audited apps?

Example 1 shows three (lightly edited) reviews from Apple’s App Store^a that are relevant to exploitable behaviors. Although our study is based on reviews from Apple’s App Store, MissAuditor can be applied to reviews from other sources, including Google’s Play Store.^b

Example 1:

Cases Relevant to Exploitable Behavior

Fly on the wall!
(for the AirBeam video surveillance app)
With this app, I can spy on my family without them knowing it! It’s such an awesome app.

This app basically ruined my family to an extent
(for the Life360 app)
My mother made everyone in the family get this app. She freaks out when the app doesn’t do its job because of random obstacles that mess with the location accuracy. Drains the battery and makes my parents paranoid to know where I am at all times. I don’t even do any bad stuff, yet years of trust building are being swept away by the ability to spy on the children of a household. If you’re a parent I highly recommend you don’t get this app because it is extremely uncomfortable to have and it makes parents trust their children less.

Honest
(for the 3Fun: Threesome & Swingers app)
A lot of the local people I’ve talked to (male half of a couple) have been guys who are saying they’re part of a couple, and in all reality are single guys just looking to collect pictures. There is no way to report that that is why you are reporting them. It’s just a boilerplate report feature. I feel there should be a way for the 3Fun community to point out people for bad behavior like this.

In Example 1, the first review, for AirBeam Video,^c addresses a scenario in which the app assists a user in accessing a victim’s information without the victim’s knowledge. AirBeam Video is a surveillance app installed on the abuser’s device. Hence, the victim may not be an app user but rather a bystander. In the second review, for Life360,^d a user complains about the problem of inappropriate access to their location by their mother. Due to the unequal power dynamics between the victim (the reviewer, in this case) and the abuser (their mother, in this case), the victim is forced to install apps that violate their privacy. The third review, from 3Fun,^e describes a story of improper access to profile pictures. Even though the profile pictures are public, the victim is uncomfortable with this access. It is common for users to upload this information to an app with expectations of how other users will access it. As shown in these three cases, information access by others may lead to discomfort, fear, or harm,⁹^,¹⁵ all of which indicate misuse.

Reviews contain rich information about apps, which is crucial in auditing them for their misuse potential. For example, not only negative but also positive reviews can reveal an app’s exploitable functionalities. Relevant reviews are written by abusers, victims, and third parties, and each type of review is linguistically distinct, making the mining of reviews challenging. Moreover, while reporting exploitability, reviews have varying degrees of convincingness and severity, which can be leveraged in a misuse audit. We also found that exploitable apps exhibit a variety of exploitable functionalities, from tracking location to monitoring phone activities, such as accessing chats, phone contacts, and call history.

Our work’s novelty lies in introducing the issue of misuse audits. We leverage app reviews, a large, unexplored resource for conducting these misuse audits. We contribute to responsible computing in the following ways:

Showing that app reviews are a viable source of information on an app’s exploitable functionalities.
Developing MissAuditor, an approach based on app reviews for misuse audits of apps.
Creating a ranked list of exploitable apps along with their alarming reviews that reveal exploitable functionalities.

In the rest of this article, we first describe our preliminary investigation, showing that app reviews contain evidence useful for audits. Then we address how to identify exploitable apps using our misuse audit approach, MissAuditor, and discuss the procedure for uncovering exploitable functionalities in audited apps. Finally, we outline some related work on app reviews and mobile apps and discuss the reproducibility of our findings.

RQ_INFO: App Reviews for Misuse Audits

Seed dataset. Chatterjee et al.⁵ identified 2,707 iOS apps as candidates for intimate partner surveillance (IPS)—apps that allow a person to spy on their intimate partner. IPS apps are specific to intimate partners and thus are a subclass of our concept of exploitable apps. We started our analysis with Chatterjee et al.’s IPS candidate list.

During data collection, we found that 1,687 of these 2,707 apps received at least one review on Apple’s App Store, leading to a seed dataset containing 11.57 million reviews. Out of 1,687 apps in the seed dataset, only 210 were confirmed to be IPS by Chatterjee et al.

Investigating reviews. Since the seed dataset contains 11.57 million reviews, it would be impractical to manually check each review. Therefore, to find the reviews revealing exploitable behavior, we sampled them using three keywords: spy, stalk, and stealth. From the 5,287 reviews that contain at least one of our keywords, we randomly sampled 1,000 reviews for manual scrutiny. Using this simple, random sample, we investigated whether these arbitrary reviews revealed any misuse cases. We also analyzed what information they contain regarding misuse. The sample is diverse, involving 179 apps with between 1 and 237 reviews each. Out of the 1,000 reviews, we found 403 reviews reporting exploitable behavior and 597 others. Reviews reporting exploitable behavior show variation across two dimensions: story and reviewer. Based on the story, we found reviews of the following two types:

Exploitable act: Reviews describing someone performing an exploitable behavior. In such reviews, the reviewer is sure about the app’s exploitable functionality. For example, “This app is useless and it just helps overprotected [sic] parents spy on their sad kids.”
Potential act: Reviews expressing the possibility of exploitable behavior. The reviewer may not be sure of the exploitable functionality but identifies risks with the app that can be exploitable in the future. For example, “Hate to be a hater. May work well to spy on the kids by ‘accidentally’ leaving iPhone in secret place.”

The functionalities that are or could be misused by abusers (in an exploitable act or a potential act) are called exploitable functionalities. The above two reviews are negative; however, we also found positive reviews indicating an exploitable app. Example 2 shows positive reviews by abusers, who brag about the exploitable functionalities (history tracking, in this case) and sometimes express their delight in the misuse. Other reviews are written by victims, who state their concerns and grievances (frustration at the loss of privacy), and some by third persons reporting app misuse.

Our linguistic analysis revealed that reviews by abusers not only are positive (higher in valence than the other two categories) but also illustrate the abuser’s dominance over other parties. Figure 1 shows this difference in the valence and dominance scores among all three reviewers. In general, these relevant reviews show linguistic variations, complicating the analysis.

Figure 1. Affective analysis showing valence, arousal, and dominance scores for each type of reviewer. Abusers’ language shows higher dominance and valence than the language used by victims and third persons. Mining app reviews is challenging due to such linguistic variations.

Figure 2 shows the relative distribution of 403 relevant reviews across the type of story and reviewer. Third persons mostly write stories of potential abuse (73%), whereas abusers and victims mainly describe exploitable acts (∼99% and 100%, respectively).

The above examples indicate that reviews contain rich information about an app’s exploitable behavior. We found two types of stories that are relevant for our purposes. Moreover, these stories are written by three types of reviewers, whose language varies, making the problem of mining these reviews nontrivial. No new types of stories and reviewers were found during our extensive verification of reviews. This indicates that our random sample of 1,000 reviews was a representative set.

RQ_IDENTIFY: Identifying Exploitable Apps through Misuse Audits

Here we propose MissAuditor, a review-based approach for misuse audits. Figure 3 shows an overview of the MissAuditor approach. First, we collect reviews from the Apple App Store, as shown for the seed dataset in the previous section. Then, we label a subset of the collected reviews for alarmingness (defined below) and train a computational model. We then apply our model to all reviews to identify exploitable apps. Finally, we manually examine reviews of some of the identified exploitable apps to find their exploitable functionalities.

We envision MissAuditor to be incrementally updated by adding reviews of newly found exploitable apps. Apps with no reviews can be audited as soon as their reviews arrive.

Alarmingness of reviews. We find that reviews exhibit two characteristics: convincingness, their claims about misuse, and severity, the effect of the misuse.

Convincingness depends on the review’s claim about the app’s exploitability. Some reviews report detailed exploitable behavior and are therefore convincing, whereas others contain mere suspicions of exploitable behavior. In Example 3, the first review is unrelated to exploitable behavior and therefore is not convincing. The second review describes the reviewer’s suspicion of the app, which may or may not be true (slightly convincing). The third review, by an abuser, confirms the exploitable behavior but lacks details of the exploitable functionalities or victims. In contrast, the other reviews in Example 3 are convincing because they confirm exploitable behavior and mention the location feature, how to set up devices, or the victims being stalked. Extremely convincing reviews also include cases when the app is used for positive purposes, such as tracking family members or pets for safety, but shows the potential for future misuse. Reviews that are slightly, moderately, or extremely convincing are relevant to misuse audits. Assigning a convincingness score helps rank all reviews according to the strength of their claims.

Example 3:

Varying Degree of Convincingness

1: Not convincing

It is such a great game, love it so much!

2: Slightly convincing

Setup was a breeze. Quicktime 7 pro found it easily. Unfortunately, resolution seems much, much, lower than hoped. Video size can not be adjusted live. Hate to be a hater. May work well to spy on the kids by ‘accidentally’ leaving iPhone in secret place.

3: Moderately convincing

This app is perfect for stalking people…

4: Extremely convincing

This app is awesome for our family to keep track of where everyone is at all times! (You can turn the location off too in case you want to be in stealth mode when buying Christmas presents too.) … Even our dog knows that the alert sound when a family member arrives home means …

I use it to spy on my dogs while I’m at work; so I use it for fun, nothing fancy. My iPad is my camera, and my iPhone is my viewer…

Bro this app is high key creepy. when I’m with my dad on his days my mom even mentions how she knew everything I was doing and it even made my dad creeped out. If you need this app then ngl yo wack. I don’t want my mom stalking me.

Severity measures the effect of exploitable behavior on the victim. Example 4 shows a range of reviews varying in severity. The first review is unrelated to exploitable behavior. Thus, it is not severe. The second review shows that the exploitable act is performed with consent, making this review a slightly severe case. The third review is written by the abuser and lacks the victim’s perspective to analyze the exploitable effect. We assume such acts are performed without consent and consider them moderately severe. The fourth review describes the victim’s misery. The victim even says, “This app has truthfully ruined my teenage years” in the review, which gives solid evidence for being an extremely severe case. In the fifth review, the victim complains that others can see when he was last active (also known as last-seen information). This is information is public, but the victim is nevertheless uncomfortable with the access. App developers should be aware of such misuse.

Example 4:

Varying Degree of Severity

1: Not severe

Love the graphics so far it is a great game.

2: Slightly severe

I love this app, just great because you can time your day accordingly, I like my girlfriend knowing where I am and I love stalking her, we have fun with it…

3: Moderately severe

This app is perfect for stalking people…

4: Extremely severe

Honestly if you want your kid to rebel against you even more, this is the app for you! This app has truthfully ruined my teenage years all because my mother now has a way of tracking me down 24/7. I couldn’t do the normal teenage things because I was being stalked all day…

I want to share my last seen just to my family and my girlfriend not others. please add new feature in privacy that I can share my last seen to no body except my family and girl friend thanks soo much!

To capture the above two characteristics, we define the alarmingness of a review as the geometric mean of its convincingness and severity. An app can receive a large number of reviews. Unlike binary classification of reviews, the alarmingness score not only identifies relevant reviews for audit but also ranks them based on the likelihood and danger of the misuse. The higher the alarmingness of the review, the more useful it is in auditing and identifying misuse.

We created a training set of reviews as follows. From the seed dataset, we randomly selected two sets of about 1,000 reviews each: those that match at least one of our keywords and those that do not. After removing duplicate reviews, we were left with 952 and 932 reviews, respectively, combined into our training corpus of 1,884 reviews. Including both matching and nonmatching reviews in about equal numbers makes our training corpus unbiased toward the keywords.

From these 1,884 reviews, we excluded the reviewers’ identifiers, such as usernames. The task was to rate these reviews for convincingness and severity on a four-point Likert scale (1: not, 2: slightly, 3: moderately, 4: extremely). Two authors were annotators. For reviews rated by both, we computed the average convincingness and severity scores. This annotation study had minimal risk and was exempted by the Institutional Review Board (IRB) of our university. We used these annotated reviews as our training data.

We extracted linguistic features of the 1,884 reviews through the Universal Sentence Encoder (USE),⁴ a widely used approach that has been proven effective for app reviews.⁷^,¹³ Using USE features, we trained various multitarget (the targets here being convincingness and severity scores) regression models³ and found that the Support Vector Regressor (SVR) outperforms others. We considered using the reviews’ metadata, such as ratings and titles, but found the metadata to be unhelpful for identifying misuse and left it out. Not only reviews with negative ratings and titles but also reviews with positive ratings and titles can indicate misuse (partly discussed earlier in “Investigating reviews”). For example, an abuser writes a review with the title “Great app” and provides a five-star rating, even though the app facilitates misuse. As these nondiscriminatory attributes may confuse the model, we decided not to include them in training, instead using only the review text.

We leveraged trained SVR to predict convincingness and severity scores of all 11.57 million reviews in the seed dataset. As mentioned earlier, the alarmingness of each review is calculated by taking the geometric mean of its predicted convincingness and severity.

Identifying exploitable apps. Using statistical methods, we aggregated the alarmingness of reviews and ranked all apps according to their reported misuse. From the seed dataset, our model predicted a total of 100 exploitable apps (including false positives). Moreover, our approach is not dependent on the choice of candidate apps and can be applied to any set of apps. To audit additional apps, we applied our model on 1) a dataset of similar apps and 2) a dataset of 100 popular apps in the “Utilities” category. We found not only IPS but also many general-purpose exploitable apps exhibiting multiple exploitable functionalities, which are described below.

Similar apps. For each app determined to be exploitable from the seed dataset, we obtained recommendations for similar apps from the Apple App Store’s “You May Also Like” feature. Through this process, we obtained 788 similar apps. Our motivation in using Apple-recommended apps is that these apps should offer functionalities similar to the apps classified as exploitable. Further, we collected reviews (over the period of August 2008 to August 2022) of these 788 apps. We call this the snowball dataset.

From the snowball dataset, our model predicted 90 apps as exploitable (including false positives). Our evaluation using this dataset revealed that the model yields a recall of 71.60%, which is much higher than the other baseline approaches. In this scenario, a false negative can lead to harm through misuse, whereas a false positive just wastes effort in an unnecessary audit. Hence, high recall is more valuable than high precision.

100 popular utility apps. Surveillance apps that can be misused for spying fall under the “Utilities” category, making it an important category to audit. We considered 100 popular utility apps (mentioned on Apple’s App Store page^f) and collected their reviews. Since a popular app can have many reviews over the years, collecting them can be computationally expensive.

From these 100 apps, our approach, MissAuditor, classified only one app as exploitable. But upon examining its reviews, we found that app to be nonexploitable. We also scrutinized some apps classified as nonexploitable and found their predictions to be true (based on manual verification of reviews).

Some app categories, such as payment apps, calculators, and so on, are unlikely to be misused. Therefore, they don’t form good candidates for identification. Iteratively auditing apps (through MissAuditor) that are similar to the already identified exploitable apps is a feasible solution for uncovering a large, exploitable landscape.

Relevance of findings. Some reviews in our datasets are old. For example, the seed dataset was collected over the period of July 2008 to January 2020. To check whether our findings are still relevant, we randomly sampled 50 confirmed exploitable apps from the union of the seed and the snowball datasets. We collected new reviews (from January 2020 to April 2024) for these apps and applied our MissAuditor approach to these reviews. Then, we scrutinized their descriptions (to understand their basic functionalities) and the alarming reviews (to know misuse cases, if any), following the same process as before. We found that the new reviews of three of these 50 apps don’t report misuse. Conversely, the other 47 apps still possess exploitable functionalities that can cause misuse. A high success rate (94%; 47 of 50) on a random sample implies that most of the identified apps still facilitate misuse.

RQ_{FUNCTIONALITY}: Uncovering Exploitable Functionalities

Here, we illustrate the idea of uncovering exploitable functionalities using the exploitable apps found from the seed dataset. First, we analyzed an app’s description to understand its functionalities. Then, we analyzed the top alarming reviews (discovered by MissAuditor) for identifying existing misuse and exploitable functionalities. Through this process, we discovered the following types of exploitable functionalities:

Monitoring phone activities: Some apps monitor a victim’s phone activities, such as browsing history and text messages. These apps are installed on a victim’s device and their activities can be monitored on another synced device. For example, the SaferKid Text Monitoring App^g allows synced devices to monitor call history, Web history, texts, and so on.
Audio or video surveillance: Some apps enable audio or video surveillance without a victim’s knowledge. These apps listen, view, or record a victim’s voice or actions, and some of them need not be installed on the victim’s phone. For example, Find My Kids: Parental Control^h can be misused to record private conversations between people without them knowing.
Tracking location: Some global positioning system (GPS) apps enable tracking a victim’s phone, making them uncomfortable with this access. For example, Find My iPhoneⁱ is a legitimate app but can be misused to spy on the location of connected devices.
Profile stalking: Some apps are misused for stalking user profiles or user-generated content (such as text and images). For example, Kik Messaging and Chat App^j can be misused for stalking images and victims’ other information on the app.

The table shows these four types of exploitable functionalities and some of the alarming reviews reporting them. Some of these reviews are old (2014 or 2012), but we confirmed that similar concerns are being raised in recent reviews of the same apps. For example, the Find My iPhone app lets its users see the location of the connected devices. Due to unequal power dynamics between the abuser and the victim (say, in a family setting),⁵ the victim is sometimes forced to connect to such apps and allow their device to be located. In other words, legitimate apps designed for locating loved ones can be misused for exploitable behavior when the requirements for consent²³ are violated. Hence, they are dual use.⁵ Through a qualitative study, Freed et al.¹⁰ found that many find-my-phone applications are intended for anti-theft and safety purposes but are heavily misused against privacy. Chatterjee et al.⁵ categorize these apps as spyware, especially in the case of IPS. Just because an app has some legitimate uses does not justify its possible misuse. Therefore, we consider these apps to be exploitable. Relying on app reviews highlights both types of exploitable functionalities (always malicious and possibly with legitimate uses), against which future and current app users should be warned. App distribution platforms and developers should also try to mitigate privacy risks.

Table. Types of exploitable functionalities.

Exploitable Functionality	App Example	Alarming Review
Monitoring phone activities	SaferKid Text Monitoring App	Tracking things like social media, texts, and search history is just a complete disregard of privacy. You have to have trust in your kids … Apps like these shouldn’t be allowed. IF YOU TRUST YOUR KID, DON’T DOWNLOAD. (Date: 2019-12-07)
Audio or video surveillance	Find My Kids: Parental Control	This app proves to have an invasion of privacy. Due to the fact if your kid was at a friend’s house and talking to his friend’s parents, this app records what is going on and is an invasion of privacy. If your child left their phone downstairs or anywhere and they are playing it can record private conversation between adults and is unsafe … (Date: 2019-01-16)
Tracking location	Find My iPhone	It’s supposed to be used to recover a lost phone, not to religiously stalk your children … The fact that a mom actually installed this app onto her son’s phone without his knowledge is flat out wrong. … If you’re constantly monitoring your child 24/7, just imagine what your child will do when they go off to college. … (Date: 2014-02-13)
Profile stalking	Kik Messaging and Chat App	MAKE THIS SAFE. PEOPLE WANT TO USE IT BUT DON’T WANT TO BE STALKED OR ABUSED IN THE APP. PROTECT THE APP OR PEOPLE WON’T USE IT. MY FRIEND HAD A MAN SEND HER A BAD PICTURE IF YOU KNOW WHAT I MEAN. OVERALL THIS APP IS GREAT!!!! (Date: 2015-06-24)

We illustrated the exploitable behavior of the SaferKid Text Monitoring App by installing it on two devices: a parent’s device (iOS version 14.4.1) and a child’s device (Android version 11.0). Figure 4 shows the exploitable features in this app. Activities on the child’s device can be monitored on the synced parent’s device. Figure 4a shows SaferKid’s exploitable functionalities, such as monitoring text messages, Web history, and call history. We verified each of these functionalities. Figure 4b shows the screen displaying all of the victim’s chats. Apps such as SaferKid are advertised as safety apps for children but can be secretly or openly (even forcibly) installed on another device to monitor the user’s activity. Not just parents but anyone can misuse these apps by installing them on a victim’s phone.

Figure 4. Exploitable functionalities in the SaferKid Text Monitoring App. The actual chats and device’s name are hidden for anonymity. (a) The app provides multiple ways to monitor victim’s phone activities. (b) The victim’s chats can be seen on the abuser’s phone.

Related Work

Auditing and app reviews. Most research on software audits analyzes the flow and dependencies in source code.¹¹^,¹⁹^,²⁰^,²⁷ Only a few audit studies consider data such as system traces.²⁶ All of these studies, however, fail to identify exploitable apps because they focus on the technical aspects of building or running them. Thus, their analyses do not address cases of forced consent and inappropriate access to public information. Our work leverages app reviews, unexplored by prior studies, for misuse audits.

Prior studies show that app reviews are valuable in other ways. Dhinakaran et al.⁸ mine requirement-related information from app reviews and propose an active-learning framework to fetch relevant ones in a semi-supervised manner. Haque et al.¹⁴ leverage reviews to draw implicit and explicit comparisons between competing apps. Guo and Singh¹³ extract ⟨action, app problem⟩ pairs from the stories expressed in reviews, which can help direct developers to common problems. Chen et al.⁶ identify the most relevant reviews for developers by grouping informative ones and applying unsupervised techniques. For developers to learn from feature suggestions and bug reports, Kurtanović and Maalej¹⁶ filter rationale-backed reviews through classification. Some studies¹^,¹⁸ mine reviews to understand users’ perceptions about apps selling their data. However, app reviews remain largely unexplored for understanding misuse by an app’s users.

Spying through mobile apps. Chatterjee et al.⁵ apply search queries and manual verification based on information such as app descriptions and permissions to identify IPS apps. However, in general, the actual usage deviates from the legitimate purpose shown in app descriptions. To identify such misuse, we focus on the evidence provided in app reviews.

Roundy et al.²² identify apps used for phone number spoofing and message bombing, which lie outside our scope of exploitable apps. Conversely, exploitable apps include those that enable stalking public information, which lie outside their scope. Roundy et al. use installation data to uncover spying apps that are installed on infected devices. However, we focus on evidence of exploitable behavior present in app reviews to uncover exploitable apps. Roundy et al. rely upon Norton’s security app^k to determine which devices are infected. Thus, their approach would miss apps that a general user can leverage to spy.

All of these studies, along with a few others,²⁴^,²⁵^,²⁸ investigate how technology is abused for spying. However, they do not support the broader set of exploitable apps. In particular, they do not consider cases when the victim is uncomfortable with the access (for example, of public information) even though aware of it, along with cases involving family, not just strangers, as abusers.

Some studies address user privacy in the context of the user’s information shared with software developers.²^,¹⁷^,²¹ They do not consider cases where an app user can exploit the app to access the information of a victim (another app user or bystander).

Discussion

We proposed MissAuditor, an approach to automatically analyze app reviews for misuse audits. Specifically, MissAuditor identifies exploitable apps along with their exploitable functionalities from reviews. Doing so reduces the burden of manually installing all apps to perform a misuse audit. MissAuditor predicts exploitable apps from multiple sources. After verifying reviews and app descriptions, we confirmed 156 apps facilitating misuse. To confirm their misuse, we manually investigated their reviews in two steps. First, we scrutinized their top 50 alarming reviews to confirm any exploitable functionalities. Then, we scrutinized their reviews containing our keywords. Solid misuse cases (in at least one of these two steps) give credence to the exploitability of these 156 apps (see full list in Section 9 of the supplemental material).

Reporting to Apple’s App Store and app developers. We informed Apple’s App Store about the potential misuse of all the exploitable apps that we confirmed above. We found that 17 of them have already been deleted from the App Store, possibly due to privacy concerns or other unknown reasons.

For the apps that we confirmed as exploitable, only 90 developers provided their email or chat support on the App Store. We contacted these developers and informed them about the misuse cases (see template in supplemental material). We heard back from 14 of them. Despite asking specific questions, we received generic replies. Six of them acknowledged the misuse or informed us that their privacy teams take necessary steps in whatever way possible; five developers assumed that app users (including abusers) would never misuse their apps (an unrealistic assumption); two developers suggested in their responses that they were not receptive to our feedback; and the remaining one developer did not reply after we asked more specific questions.

Threats to validity. We identified two types of threats in our work: threats that we mitigate and threats that remain.

Threats mitigated. To address a potential bias from our keywords, we constructed a training set that was balanced between reviews that matched and didn’t match our keywords. Using this training set helped our model learn from the context and not from specific keywords. Also, instead of crowd workers, who might not understand the problem well, two coauthors of this article annotated the entire training data. Finally, the ground truth of exploitable apps could be biased if it was formed using only the top alarming reviews. We mitigated this bias by scrutinizing reviews from the seed and snowball datasets (see Sections 5.1 and 6 of the supplemental material) containing our keywords, which can contain evidence missed by the alarming reviews.

Threats remaining. We investigated only a few thousand apps, which may not represent all apps on Apple’s App Store. The performance of MissAuditor may be reduced while testing it on all apps of the App Store. Also, we targeted apps and their reviews only on Apple’s App Store. Upon deployment on other app stores, the performance of our approach could differ. In addition, suppose an app-distribution platform does not offer recommendations for similar apps. In that case, finding good candidates for exploitable apps would be a challenge. An alternative would be to prioritize applying MissAuditor to the apps flagged by app users (victims). And, some reviews in our datasets are old. Due to frequent app updates, the types of misuse may change over time. Although we can confirm that our findings are still relevant (see “Relevance of findings” above), there is a possibility that a few apps may no longer be exploitable. Finally, some findings are based on a random review sample because manually reading all reviews is not feasible. If we were to do so, the results might vary.

Limitations and future directions. We identified the following limitations of this approach, each of which suggests topics for future research: First, MissAuditor may miss some exploitable apps if they do not have alarming reviews at the time of analysis, possibly because they are new apps. However, this limitation can be overcome if MissAuditor is used along with app descriptions. Users could be warned against apps whose descriptions are similar to already identified exploitable apps (or against similar apps found through a recommendation system). Further, MissAuditor can flag these new apps as soon as their alarming reviews arrive. In this way, our computational model can be updated by including newly found exploitable cases in the training set and iteratively identifying a large landscape of misuse.

Second, uncovering exploitable functionalities involves the manual effort of inspecting top alarming reviews. A future direction is to automate this process. Moreover, each functionality is analyzed in isolation from other functionalities. Future audit studies can work on the interdependence of the functionalities that facilitate misuse.

Third, app reviews describe perceptions of an app and its misuse, which can be affected by personal preferences; for example, how much information access is too much to share? Moreover, the user interactions and perceptions of cases involving family power dynamics (parent-child and intimate partners), especially regarding forced consent, are nontrivial to analyze. Manual inspection of these reviews by developers can mitigate but not eliminate this difficulty. Future studies can explore recommending apps based on a user’s personal preferences.

Fourth, some reviews may falsely claim exploitable behavior in an app. These could include reviews written by the app’s competitors or by people portraying themselves as victims. Identifying these misleading cases is challenging and outside the scope of our study.

Finally, app reviews can be ambiguous in differentiating between inherently malicious and benign but exploitable functionalities. For example, the Safer Kid Text Monitoring App is designed to keep loved ones safe (benign app) but is misused according to reviews. Future research can distinguish between these benign apps and other malicious ones.

To view the supplemental material for this article, please visit https://dl.acm.org/doi/10.1145/3685528 and click on the supplemental material link.

Reproducibility

We have released our entire training data along with the computational model on Zenodo.¹² The supplemental material provides details of dataset collection and annotation, training and evaluation, and linguistic analysis.

Acknowledgments

Thanks to the anonymous reviewers for their helpful comments. Thanks to the Department of Defense for partially supporting this research under the Science of Security Lablet. NA acknowledges partial support from EPSRC (EP/W025361/1).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Understanding Mobile App Reviews to Guide Misuse Audits

View in the ACM Digital Library

DOI

10.1145/3685528

August 2025 Issue

Vol. 68 No. 8

Pages: 62-71

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Sep 4 2025

Social Engineering 2.0

Gaurav Belani

Artificial Intelligence and Machine Learning

BLOG@CACM Sep 3 2025

Spotting Scams Based on Fake Profiles

Alex Tray

Architecture and Hardware

BLOG@CACM Aug 29 2025

Feeling Cranky About AI and CS Education

Valerie Barr

Artificial Intelligence and Machine Learning

teacher standing near chalkboard breaks a wooden stick

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

RQINFO: App Reviews for Misuse Audits

RQIDENTIFY: Identifying Exploitable Apps through Misuse Audits

RQFUNCTIONALITY: Uncovering Exploitable Functionalities

Related Work

Discussion

Reproducibility

Acknowledgments

Understanding Mobile App Reviews to Guide Misuse Audits

DOI

August 2025 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

RQ_INFO: App Reviews for Misuse Audits

RQ_IDENTIFY: Identifying Exploitable Apps through Misuse Audits

RQ_{FUNCTIONALITY}: Uncovering Exploitable Functionalities