As social media emerged as a key source of information, malicious users started to manipulate social platforms to their own ends. Today, online disinformation efforts (so-called infodemics) are routinely plaguing public debates, political events, and information campaigns alike. Detecting fake news via online social media has become a central issue, fostering an arms race between malicious users and platform operators.
To this day, two broad strategies have been developed to automatically detect disinformation campaigns on online media: analyzing the information content—leveraging natural language processing techniques3 or authoritative information sources—or analyzing its context, for example by exploring the interplay between end users, publishers, and news pieces.5 In the following paper, the authors focus on the latter strategy by introducing a new graph-based, contextual technique for fake news detection. Their approach is based on two main pillars: a structurally rich graph representation of social context on one hand, and a dedicated learning framework leveraging an inductive approach to graph representation learning on the other hand.
The new graph representation introduced by the authors regroups all news articles being analyzed, their sources, the users who have engaged in spreading those articles, as well as all their direct neighbors in the social network. This creates a rich social context graph comprising both homogeneous links (modeling interactions between pairs of users or between sources referring to one another) and heterogeneous links (modeling relationships between a news article and its source, or between a user and the news articles she promotes).
Each entity in the graph is initially represented by a static feature vector leveraging classical (for example, GloVe4) embeddings. The social context graph is however highly dynamic, with new nodes and edges constantly appearing following users' actions. As a result, the authors adopt an inductive approach to representation learning, namely GraphSage,1 where node embeddings can be created dynamically and efficiently by sampling and aggregating features from a node's local neighborhood. Those embeddings are further enriched with temporal engagement patterns, as the temporality of users' actions is known to plays a central role in fake news propagation.2
Assessing the performance of systems like FANG is delicate; empirical evaluations in this context typically rely on large, labeled datasets, where one must pick an unambiguous label (like fake or not fake) for each document in the collection. However, such decisions are not always clear-cut, as some pieces of text can be half-true, while some others might be hard to verify—even for human experts—or might even be subjective depending on the exact topic (for example, for controversial topics or opinions.)
The authors do not innovate in this regard: they conduct their experiments on preexisting tweets that were categorized as either fake or not by two authoritative sources, namely Snopes and Politifact. They measure the performance of their approach by applying a dedicated learning framework on top of the graph representation model described, and show that their technique is effective even with very limited training data. The empirical evidence discussed in the paper is broad and includes various fake news detection experiments, an extrinsic evaluation of the generalizability of the new representation learning framework, as well as microscopic analyses of specific cases from their Twitter dataset.
In the end, FANG significantly outperforms the state of the art on the dataset, with an AUC of 75%. This result represents a cautionary tale for future research in this domain. While using a broad range of features and sophisticated techniques, FANG is far from reaching optimal performance, which certainly makes sense given the complexity of the task at hand. Furthermore, this relatively modest score was obtained on a highly curated dataset composed of two unambiguous classes only, while reality is unfortunately much more intricate: fact-checking websites such as Snopes typically consider a whole range of non-binary labels to classify the articles they investigate, leveraging fine-grained ratings such as mostly false, satire, misattributed, or unproven for challenging cases.
The paper was recognized with the Best Paper Award at CIKM in late 2020—no small feat considering that almost 1,000 papers were submitted to the conference's main research track. Even if the performance of fully automated approaches like FANG on non-curated datasets remains largely unclear, the paper is a very compelling piece of work combining a new contextual graph model for social media with recent advances in representation learning to tackle an important and timely problem. I hope you enjoy reading it as much as I have.
To view the accompanying paper, visit doi.acm.org/10.1145/3517214
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found