Computing Applications Virtual extension

Following Linguistic Footprints: Automatic Deception Detection in Online Communication

Posted Sep 1 2008

Introduction
Linguistic Cues to Deception
Discovering Linguistic Cues to Deception in Online Communication
Automatic Deception Detection based on Linguistic Cues
Conclusion
References
Authors
Footnotes
Figures
Tables

Deception occurs in interpersonal, group, media, and public contexts everyday, driven by various motivations such as punishment avoidance, aggression, wish fulfillment, and even enjoyment. Some deception is of little damage (e.g., white lies), while deception with high stakes can result in devastating consequences to both the victims of the deception and society at large. Unfortunately, deception is rarely blatant lies. Human beings, even experienced police officers, judges, and other forensic professionals, are not very good at detecting deception partly because of their natural truth bias, resulting in an accuracy rate of only slightly better than 50%.

Deception is information intentionally transmitted to create a false impression or conclusion.² Deception in face-to-face communication has been extensively studied in social science disciplines to identify cues to deception and to understand human deception detection. As the globalization and popularity of virtual teams grows, people increasingly rely on computer mediation for interpersonal communication, information acquisition, and information dissemination. Online communication enabled by electronic media (e.g., emails and instant messaging) relieves people from contextual restrictions on their behavior and enables selective self-presentation due to the physical separation and optional anonymity of communication partners. As a result, online communication may offer fertile grounds for deception and alter the social and legal distribution of deceptive practice. According to an annual report of the Internet Crime Complaint Center (IC3, http://www.ic3.gov/), in 2007, IC3 processed about 220,000 complaints that could lead to Internet crime investigations by law enforcement and regulatory agencies. The total dollar loss from all cases of fraud was $239.00 million, up from $198,000 million in 2006. Email was one of the primary mechanisms (73.6%) by which fraudulent contact took place.

Despite ample research on deception and on online communication individually, there is relatively little work aiming to understand their interaction until recently. Through a series of studies on deception in online communication, we have obtained a collection of linguistic cues to deception and developed models for automatic deception detection. These cues and models can be used to assist people in detecting online deception and to increase the public awareness of deception in online communication.

Linguistic Cues to Deception

Linguistic cues to deception are rooted in the psychological experience underlying deception. Deception in interpersonal communication is driven by at least two conflicting goals: accomplishing deception and avoiding detection. As a result, a deceiver is engaged in both strategic and non-strategic behaviors.² The former manifests plans and intention of deception, while the latter reflects perceptual, cognitive, and emotional processes during the communication. Such behaviors that are suggestive of the likelihood that the conveyor of the behaviors is attempting to deceive are often called cues to deception. Those cues can be either verbal or nonverbal.

Deception involves the manipulation of language and careful construction of messages or stories that appear truthful to others. In a taxonomy of cues to deception (Figure 1), linguistic cues, which belong to verbal behavior, differ from content-based cues in that the former focuses on how deception is conveyed in a natural language, while the latter focuses on what is conveyed. Although a deceiver has some level of control while constructing the content of a story, the language style used to tell the story may contain clues about his/ her state of mind.⁶ The “slips” in language can reveal underlying anxiety, guilt, or arousal, so deceptive intention is likely to be reflected in linguistic features. In nonverbal behavior, paralinguistic cues consist of keyboard (e.g., message erasing), participatory (e.g., response delays), sequential (e.g., initiation), as well as voice-related cues (e.g., voice pitch). Proxemic-kinesic cues involve special distance or movements of any part of the body (e.g., body gesture).

Linguistic cues are particularly important to deception detection in online communication for three primary reasons. First, linguistic cues are proven effective for detecting deception in face-to-face communication based on the results of a myriad of scientific investigations. Second, text messages are the major source of information in online communication, making linguistic cues one of the few cues available to generate impressions and supply relational information.³ Conversely, other cues such as physiological measures on which polygraph and galvanic skin response are grounded are often absent. Third, natural language processing techniques make it feasible to automate the extraction of linguistic cues, which is desirable while dealing with a large volume of online messages. By identifying such cues, we can build deception models or intelligent agents to help people detect deceptive messages and deter online deception.

Discovering Linguistic Cues to Deception in Online Communication

The earlier discussion leads to two fundamental research questions: What are effective linguistic cues to deception in online communication? Can automated computational models based on the linguistic cues effectively assist deception detection in online communication?

Online communication has dialogic and dynamic features. In deceiving others, deceivers start by employing various strategies to withhold truthful information, followed by opting for vagueness and uncertainty if withholding does not work, and finally resorting to non-immediacy if the first two fail.² These strategies may be executed through a collection of linguistic choices. Of these choices, many are identified as part of verbal non-immediacy, which refers to linguistic patterns that create a psychological sense of distance between communication partners. Such a psychological sense diminishes the clear linkage between an individual and an action. Moreover, there may be changes in subsequent linguistic behavior due to the adaptation in communicative interaction.

A wide range of linguistic cues to deception have been developed from traditional deception research⁴ and criminal investigation practice. For example, compared with truth, deception contains fewer unique words and self-references, superfluous repetitions of words or phrases, and incomplete sentences. Deception also shows more negative emotion, and sounds more evasive, unclear, uncertain, and impersonal. Linguistic cues are an integral part of a number of systems developed for criminal investigation, such as Criteria-Based Content Analysis (CBCA), Reality Monitoring (RM), and Scientific Content Analysis (SCAN)¹⁰. CBCA systematically assesses the capability of a verbal statement using criteria such as logical structure, number of details, and accounts of subjective mental state of the statement. RM measures verbal differences based on the difference in memory quality between experienced events and imagined events. SCAN differentiates statements of doubtful validity and those that are probably accurate using indicators such as missing links and spontaneous corrections. For example, FBI agents are trained to detect deception in suspects’ written statements based on parts of speech by finding deviations from the expected parts of speech.¹

The above traditional linguistic cues are a good starting point for studying deception in online communication. However, given that the language used online differs from that used in traditional communication, existing linguistic cues require validation before they can be applied to online deception, and new cues may emerge in online deception. For example, dominance in online communication is more likely to be exercised through clearly stated subjective orientation and polarity of words than through non-verbal behavior. Moreover, different types of electronic media can bring with them different behavioral affordance that can affect linguistic cues to deception. For example, compared with email, instant messaging 1) allows synchronous and more spontaneous communication and message exchange; 2) gives deceivers less time to plan, edit, and refine messages; and 3) is characterized with less formal language. These characteristics of different media should be taken into consideration while identifying linguistic cues to deception.

Methodology

One of the challenges of conducting deception research is data collection, for which two major research methodologies have been used: field studies and controlled laboratory experiments. Given the ethical implication of deception, the latter approach has been used more frequently.

In our previous laboratory experiments,^9,10 we formed groups of two or more members. For each group, members were asked to work together on one or several group tasks via online communication. They were seated in different rooms and did not see each other before, during, and after the experiments. Typically, one group member was randomly chosen to be the deceiver in the task and was instructed to try to deceive other group members by either intentionally providing wrong information or promoting an idea that was opposite to his/her true belief without being noticed. Other group members, namely truthful participants, were unaware of the special role of their deceiving partners. The identities of participants were usually kept anonymous in the experiments, but they could be revealed if needed. Groups in which all members told the truth formed the control groups. A quasi-experiment without a control group can also be useful to identify cues to deception within groups where the task of deceivers and that of truth-tellers were the same except for the deception manipulation. At the beginning of each experiment, participants responded to a questionnaire on their demographic information and task related experience. At the end, the deceivers answered questions regarding their actual deception behavior during the task to enable a manipulation check.

The electronic text messages exchanged among group members during an experiment were automatically recorded for later analysis. In addition to the communication tools, other software could also be employed during experiments to capture message editing behavior. After experiments, the archived messages were extracted and grouped into deception and truth to support the analysis of linguistic cues to deception.

Analysis and Findings

Linguistic cues in online communication can be operationalized as general linguistic knowledge. Compared with content-based cues, linguistic cues avoid the ground-truth problem and are more amenable to simple parsing approaches.¹⁰ Linguistic cues are embedded in various text units, including words, phrases, sentences, or messages. Previous encoding of cues to deception, mostly done by domain or trained experts, involves extensive manual effort. Natural language processing techniques (e.g., morphological, syntactic, lexical semantic, and speech act analyses) and various lexical resources (e.g., Pennebaker et al.’s LIWC and Whissel’s Dictionary of Affect) have recently been applied to extract cues to deception automatically.⁵¹⁰ Those techniques are able to generate information for encoding linguistic cues in a reliable and efficient manner. For example, a study¹⁰ analyzed linguistic cues by first performing morphological analysis, syntactic analysis, and named entity extraction, and then computing the values of individual cues such as message quantity and complexity. Once the promising cues are encoded, they are tested and validated through statistical analyses.

According to the results of a series of empirical studies, we summarize major findings on linguistic cues to online deception in Table 1, which shows that deceptive messages differ from truthful messages in several linguistic dimensions. For example, deceptive messages tend to be longer, more informal and uncertain, more expressive and non-immediate, less complex, and less diverse than truthful messages. Uncertainty and non-immediacy are consistent with deceivers’ general strategies of obfuscation and equivocation. By using indirect and vague language, deceivers may seem submissive, thus attenuating receivers’ suspicion about a message.³ In Table 1, some linguistic cues to deception such as non-immediacy, affect, and language complexity are effective for both asynchronous and synchronous online communication. Some other cues are identified only from synchronous online communication, including cognitive complexity and spontaneous correction.

It should be noted that the interpretation of deception behavior is subject to a variety of contextual factors, including people, activity, and communication environment. Thus, the linguistic cues identified in one deception context may not be extensible to another.

Automatic Deception Detection based on Linguistic Cues

Deception detection generally goes through four phases: 1) identifying and extracting significant cues to deception; 2) building a deception detection model using the identified cues; 3) reasoning deception with the detection model; and 4) making a detection decision. A deception detection model can be applied manually or automatically.

Detecting online deception is particularly challenging for three reasons. First, non-verbal behavior of communication partners is largely precluded by the communication medium. Second, the lack of social presence in an online communication environment may prevent people from engaging in realistic dialogue to determine message veracity. Third, deceivers normally have more time and opportunities to plan and rehearse their messages in online communication than in traditional face-to-face communication, thereby lowering levels of arousal and reducing the leakage of deception behavior.¹⁰ The above challenges coupled with the extensive level of online communication make manual approaches to detecting online deception even more ineffective.

Automatic deception detection becomes a promising solution considering most linguistic cues can be extracted automatically. We have developed models using machine learning techniques to support humans in making deception detection decisions. For example, in one of our earlier studies,” classification models were developed for deception detection using discriminant analysis, logistic regression, decision trees, and neural networks separately. The evaluation results showed that those models produced promising results, with neural networks having an accuracy rate of about 80%. Our results also showed that feature selection could be conducive to improving the performance of deception detection models. Potential benefits include reduced computational complexity and error rate, mitigation of overfitting problem, and increased generalizability.

These models of automatic deception detection have their own limitations—they assume there is no overlap between deception and truth and that cues to deception are orthogonal to one another. In reality, people are rarely 100 percent certain about a detection decision, indicating theoretical fragility of this dichotomous distinction between deception and truth. Therefore, we extended the fuzzy theory and related inference techniques to model and address uncertainty in deception detection, which improved both interpretability and reliability of deception detection models.¹² Further, to exploit possible dependency relationships between different cues, we explored using statistical language modeling techniques to build deception detection models, which also makes the models easily generalizable to different languages.

Conclusion

Automatic deception detection is an important yet extremely challenging task, particularly in online communication, because of the absence of most nonverbal behavior that are traditionally used as cues to deception. It is also non-trivial to understand online messages. For example, instant messages are usually short and informal, which may introduce ambiguity to message analysis. Therefore, there is a demand for intelligent tools in support of the identification of linguistic cues to deception.

Following linguistic footprints paves the way for automatic deception detection in online communication. We recognize that cues to deception can be context-dependent. The generality and moderators of linguistic cues to deception in online communication merit future research. We are investigating the impact of individuals’ cultural background, deception experience, and deception skill on online deception behavior in real-world group communication. There are many other interesting issues for future deception research, including combining linguistic cues with content-based cues and non-verbal cues, and deception detection in online social networks and virtual worlds

Figures

Figure 1. A Taxonomy of Cues to Deception (Adapted from [9])

Tables

Table 1. A Summary of Linguistic Cues to Deception

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Following Linguistic Footprints: Automatic Deception Detection in Online Communication

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/1378727.1389972

September 2008 Issue

Published: September 1, 2008

Vol. 51 No. 9

Pages: 119-122

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More