Posted on

Written by Elliot Jaffe

For many, the word propaganda conjures up imagery of totalitarian regimes, indoctrination, and wartime struggles of eras past. In reality, propaganda takes many forms more common and subtle than in these stereotypical cases. 

A rising percentage of Americans now obtain their news online through news outlets and social media. This means that efficient as it is to inform an audience, it is equally efficient for an unreliable source to misinform, mislead, and enflame. This is because, at its core, propaganda works to manipulate and misconstrue an idea so that the audience agrees with the message and spreads (or propagates) the idea to others.

With a range of information sources online – niche or mainstream, reviewed publication or provocateur – how can instances of propaganda, or at least the manipulative techniques it relies on, be detected without an expert checking each article and post? While media sites have so far relied on content flagging from users, this method’s inaccuracy raises the demand for automatic detection of suspect content. The answer lies in a promising use of artificial intelligence: teaching machines to read online content autonomously and alert us to deceptive posts. But how does a machine actually understand or evaluate an article?

The SemEval2020 (Semantic Evaluation) International Workshop aimed to tackle this issue. During the contest, international research teams were challenged to hone computers’ ability to understand human language. This field, which draws principles from mathematics, computer and data sciences, and linguistics, is known as ‘natural language processing’ (NLP). NLP aims to give machines the ability to interpret human languages as we do: phrasing, emotion, sarcasm, and all. Though still emerging, many uses of NLP are now commonplace; voice assistants like Alexa and Siri, editing services like Grammarly, Google Translate, and handwriting recognition are some many of us have likely already encountered.

SemEval Task 11, “Detection of propaganda techniques in news articles,” breaks down detection into two main strategies for computers to spot message manipulation. First, researchers develop an algorithm that looks for “misuse of logical rules,” meaning that an article whose argument hinges on logical fallacies is flagged (identification) and labeled (classification). Second, the algorithm focuses on emotional provocations, using loaded and sentimental wording to mischaracterize the subject, e.g., ‘hate,’ ‘pure,’ or ‘destroy.’ 

Yet, if it’s hard enough for a person to spot an unsound argument, how can we expect a machine to do any better? After all, where we read and process a sentence, a computer sees only strings of characters. For anything to make sense to computers, we must feed them numbers; in NLP, this is accomplished by describing a word with a set of numerical values called a vector. Vectors of similar words tend to be mathematically grouped together based on their definitions and usage.

Now that we’re operating numerically, a machine can be trained in parsing, moving on from single word definitions to understanding the logical role of words within a sentence, known as semantics. For instance, to “Keep calm and carry on,” we must parse the phrase to know that it’s a suggestion for an implied second party, that the verb “keep” really means “remain” and not “withhold.” “On” in this context must also be understood to refer to a phrasal verb, not the property of an object. By giving plenty of examples (i.e., training data) of words and phrases in context and providing constant feedback on whether the proper syntactic conclusions are being drawn, a computer will gradually learn what adjustments should be made to decrease errors and earn more positive feedback.

The distinction here is that the computers in this challenge aren’t learning ‘proper’ logic per se. Instead, they’re being trained so that when they read a new article, they can tell us whether the structure or sentiment of that article matches any techniques it encountered in the training data, which includes a mix of valid and invalid logical techniques. With this strategy, we can pick up on certain trends: one submission found that cases of the “loaded language” fallacy correlate with higher scores for anger, fear, and sadness. Another saw that the “appeal to fear” fallacy played out over more letters of text than did “minimization,” helping us know precisely where each fault lay in the article.

Once the results of all submissions to the SemEval task are compiled, the hope is that this automatic detection system will help inform the reader when there’s possible manipulation afoot. While the technology is surely impressive, what’s less clear are the implications for how users will interact with online content in the future. This research effort focuses on detecting “propaganda” within the news, but the same techniques could be applied to other forms of media that are not considered explicit propaganda. 

If these algorithms are widely implemented, the distinction between detecting misleading messages and automatic censorship of competing views becomes more critical. An algorithm has little knowledge of personal values, only data. Still, online users have long seen deliberate distortion and misinformation on social media, including content from dubious publications and elected officials. Sites like Twitter and Facebook occasionally counter these false claims by displaying this content with warnings, providing links to verified sources, or even removing posts altogether. However, news sites retain certain autonomy over their content – it’s generally impossible to insert a fact-checking statement directly on a site’s page, but flagging the search result on a web browser is one work-around.

At the very least, similar interventions could become more routine if these new detection strategies are put to use, but it is worth noting that not all platforms will follow the same formula. Though harmful claims can be spotted in news articles, social media will likely be the first and most visibly regulated information source. For now, the research sticks to detection, and the complete SemEval2020 findings will be presented in Barcelona in December.