Content Moderation on a Shoestring

Content moderation is firmly in the category of necessary evil, like door locks, car insurance, and visits to the dentist. It may be slightly ahead of protection rackets by popularity, but let’s face it, nobody is particularly excited to spend money on it.

Naturally, people look for workarounds.

Trigger Words + Generative AI

Imagine the following situation.

A doorman, guarding an entrance to a building, screens the visitors. He sees a man openly carrying a Crocodile Dundee style knife, and asks him to stop. “Please hold on. I am going to contact our expert, a PhD who will determine if you’re a criminal.” The PhD arrives in 15 minutes, asks a few questions about the subject’s childhood, shows him an inkblot asking to interpret it, and after a thorough analysis, declares that the visitor might be dangerous. The doorman dispenses $100 to the PhD for the consultation, and departs.

Next time, the same visitor arrives with a gun hidden inside his jacket. The doorman lets him in without uttering a word.

Sounds absurd? And yet this is roughly the way some online communities engineer their content moderation:

Step 1. They start by looking for trigger words inside the messages. You know, plain text, sometimes even just English “bad words”. Or, even worse, all languages together in one list.

Step 2. If a bad word is spotted, the “expert”, which is Generative AI is called to resolve the situation.

Does it actually work?

Issue 1: GenAI is like Forrest Gump’s box of chocolates.

Sometimes, Generative AI may produce astute insights. Sometimes, it plainly hallucinates the answer, inventing non-existent fragments. Sometimes, it misses the blindingly obvious.

The issue is not even the errors themselves. The issue is inconsistency.

How often does GenAI get it wrong at scale? Who knows! Some say, 30%, some say, 70%. The fun part is, you can’t even be sure you will get the same accuracy tomorrow. Updates may improve it, or degrade it. Or it may change without upgrades.

And let’s not get started about the costs and the throughput…

Issue 2: False Positives with Keywords

If you thought that at least the keyword part is ironclad, boy I have news for you. Let’s assume only complete words are matched (so that classic does not become clbuttic). But still:

casually mentioning a Spanish name like “Loli” (Dolores) causes some social networks to suggest getting help for one’s unhealthy urges
saying that in Dutch may also trigger some alerts (this advertisement does look terrifying if you assume all words in all foreign languages mean exactly the same they do in English). Same for saying end in Swedish. Bonus if foreign words are added: apparently, some communities blacklist words like shine because it means “die” in Japanese.
“I would kill for a cigarette”, “I am dying to see this movie”, … You get the idea.

The number of false positives is far greater than most people realize. We have multiple customers who initially based their moderation on keywords. One customer, who wanted to pre-screen Facebook content, discovered that their keyword filter was flagging as much as 10% of the entire firehose traffic.

Escalating 10% of content doesn’t save money, it is a highway to bankruptcy.

Issue 3: Inflected Forms

English words don’t have too many inflected forms, in comparison with their European and Middle Eastern counterparts. But even English has exceptions. If you want to screen a word like buy, you also need to add bought; unless you want to use b as a stem. The plural of crisis is not crisises; and so on.

For most languages, multiply the number of inflections by 10 or 30. Oh, and to screen languages without spaces (Chinese, Japanese, Thai), you need to divide the text into words first.

You want to screen the word kill? Add will kill (6 forms), killed (7 forms), kill in present (5 forms). And, of course, translate it to every language.

Issue 4: Where Keywords Don’t Apply

To make it even worse, the vast majority of problematic content is expressed in neutral terms, perfectly fine in polite society. Sometimes it’s not even “terms”. How do you capture a bitcoin address? A regex, you say? Not a bad idea, if you’re prepared to manage a set of tens or hundreds of regexes.

A fun story. A hotel booking portal had an issue with vendors looking for loopholes not to pay the commission. The vendors would encourage the users to get off the platform and give a better price (since they didn't have to pay to the middleman). The portal's workaround was to look for email addresses and phone numbers using regexes. There were two issues with this solution though:

The users learned quickly and simply started spelling emails with spaces or at instead of @, and spelling numbers in words for the phone numbers.
The phone number regexes also matched reservation numbers, which means, good-faith users couldn’t exchange pieces of info they needed to exchange.

Not only the workaround has left the loophole open, it also caused issues to honest users.

Issue 5: Obfuscation

Overwhelmed? But we’re not done yet.

Spare a thought for the algospeak (“unalive”, “seggs”, “tism”), the asterisks, and increasingly creative ways to obfuscate the message.

“We will br*k your l*gs with st*ks.” What should we put in the word list to detect this?

Or, can you distinguish between a and а? Because these are two different letters, one is Latin, another Cyrillic. And these are just entry-level obfuscations. There is purportedly decomposed Korean Hangeul, special Unicode characters (full-width, mathematical, etc.), and exotic long-dead Etruscan script used to substitute Latin lookalikes.

Using Sentiment Analysis

Some developers believe that content moderation can be solved by sentiment analysis. As in, problematic content = very negative sentiment. Right? Wrong.

There is indeed overlap between many types of problematic content and negative sentiment. But it’s just that, overlap. Negativity in itself does not mean bullying or racism or criminal activity.

More importantly, problematic content may have positive sentiment, such as: “We sell excellent cocaine.” Or, very often, it is neutral. For example, when trying to exchange contact details like in the hotel booking example.

(Ask yourself: if it were possible, how come the share of content moderation APIs is a fraction of that of sentiment analysis?)

Translate to English and then Moderate English

Imagine that you run a global community. And you already have content moderation for English (for example, built in-house).

We encountered many otherwise intelligent people believing that analyzing text in its source language produces the same result as machine-translating it to English and then running English text analysis.

The idea is bad for cookie-cutter NLP like entity extraction or sentiment analysis.

With content moderation, it is downright horrible — on multiple levels.

Issue 1: It won’t even catch slurs properly.

How many slurs are there in English for Ukrainians? None, right? Obviously, it is not the case in Russian. Meaning, there is no way to retain offensiveness when the target language does not have equivalent slurs.

And even when English has slurs for a particular group, machine translation usually either removes the word completely, or chooses a neutral form. A question like “are you a <slur>?” is translated to a neutral “are you a <nationality/ethnicity>?”

Translation (both human and machine) is designed to produce sanitized translations, correcting mistakes and often eliminating slang. The translators and machine translation vendors want happy customers that won’t get sued for offensive content they inadvertently published.

And that renders most machine translation engines almost useless for bad language.

Issue 2: More subtle untranslatable differences.

Translation is lossy. While human languages can express roughly the same concepts and ideas, much nuance still gets lost.

While Americans are fighting over “pronouns”, Russian language has somewhat of a “preposition war” around a grammatical peculiarity. In Russian, direction and location (to and in) is expressed by two pairs of prepositions:

в (“v”, lit. into)
на (“na”, lit. onto)

Most locations are used with v, except islands, peninsulas, compass directions, or plateaus. There is only a couple of inland location names used with na. They are (I might be missing a couple):

Altai (a region in Russia)
Ukraine
areas around mountain ranges: Urals (a region in Russia), Caucasus (a region part of which is in Russia), Tian Shan (a region in Central Asia part of which was in USSR). Andes, Alps, and Himalayas opt for the standard v.

(Note, directional na can be used when talking about a military offensive, e.g. na Berlin.)

You guessed it: using na before Ukraine implies that it is, too, a region in Russia. This, obviously, does not sit well with the Ukrainians who use v when talking about Ukraine. Some use na when referring to Russia, as a form of linguistic retaliation of sorts.

As expected, all of that is lost in translation, which simply yields neutral in and to in all cases, erasing the emotional component.

Never mind Ukraine, can I use this method elsewhere?

What if we just ignore the biggest military conflict in Europe since WW2, will the rest work? Still no.

If, say, the source language (like many European languages) embeds gender in every noun and adjective, while the target language (say, English) opts for gender-neutral nouns, a crucial aspect will simply be omitted. The translation may not make sense at all: “I generally can’t stand <ethnicity> but like <ethnicity in feminine>” will often be translated to English as “I generally can’t stand <ethnicity> but like <ethnicity>”.

And these are just a couple of examples.

You Could As Well Toss a Coin

In the perfect world (or in a really small, quiet place), a thumb latch is enough to secure your home. In the real world, nobody in their right mind would entrust their security to this kind of protection.

Content moderation is no different. Sadly, the dangers and the nastiness online go far beyond rude jokes. They have the potential to spill offline and cause tangible financial and even physical damage.

The workarounds described above are a wishful thinking. Buy cheap, buy twice.

You want it cheap? You could as well toss a coin or pick a random sample post and forward to a human moderator for a check.

Or just get an affordable but functional content moderation tool that covers all the bases. Like Tisane.