Dave Osler

What Are the Top Languages Used on the Dark Web?

Our Head of Product, Dave Osler, looks at the most common languages spoken on the dark web following the launch of Searchlight Cyber’s new AI-powered language translation capabilities.

What Are the Most Used Languages on the Dark Web?

For us, this is more than just an academic question. It was the question we had to ask ourselves as we looked to build our new AI-powered language translation capabilities, launched in our dark web monitoring and investigation products today. What languages would we have to be able to cover in order to help our customers understand what they were reading on the dark web? This is what we found.

The top 10 languages used on the dark web are:

  1. English
  2. Russian
  3. German
  4. French
  5. Spanish
  6. Bulgarian
  7. Indonesian
  8. Turkish
  9. Italian
  10. Dutch

It might be surprising to readers that the vast majority of the data that we parse from the dark web is written in English.

After English, Russian is by far the most popular language on the dark web, accounting for 66 percent of non-English language content. By comparison, this is followed by German (at 9 percent) and French (at 7 percent). Outside of the top 10 listed above, each language’s share is below 1 percent.

This means that by covering the remaining nine languages – plus Standard Chinese – we are able to automatically translate 94 percent of non-English language content on the dark web for our customers.

The Challenges of Language Translation

Of course, it is not just the coverage of the dark web that counts but also the quality of the translation. The challenge we wanted to solve for our customers is the barrier to detecting and responding to threats on the dark web when the content isn’t in English.

Historically, analysts and investigators monitoring criminal activity on the dark web have approached this by copy and pasting the content they find into generic translation tools. However, this has several limitations.

Firstly, it is an additional, manual step that analysts need to take – and this is time that can really add up if you imagine having to copy and paste the text of every post in a thread on a Russian-language dark web forum. It’s also a nightmare for building a case because the investigator either has to then find a way to store both the English-language and native-language versions of the posts or translate the posts each time they need to refer back to them.

Secondly, there is a challenge in being able to find content, as you would need to search using terms in the original language. Again, this is possible by translating key terms to search but – once again – this is a very manual process with a high degree of error, as nuances and synonyms in the native language are likely to be missed. And sometimes, you’d don’t know what you don’t know.

Finally, there is the issue of accuracy. Generic language translation tools are not designed to be used on dark web data and a lot of the nuance and context of the original text can be lost, meaning that investigators miss critical information that may be vital for investigating crime or identifying a cyberattack against an organization.

AI-Powered Language Translation

Overcoming these challenges for our customers required us to create our own, bespoke solution: AI-powered translation trained on our dark web dataset.

Our translation tool is powered by a Neural Machine Translation (NMT) system. The advantage of this type of AI is that it isn’t simply translating one word at a time – it takes the sentence as a whole and translates into the target language the way a professional human translator would. This vastly increases the accuracy of the translation as the true meaning of the sentence is captured.

The NMT system has been trained on Searchlight Cyber’s data lake – more than 167 million dark web data points from forums, marketplaces, and threat actor posts – once again increasing the accuracy of the translation for the content that our customers are dealing with.

Translating Russian Slang used on the Dark Web

One area that we particularly focused on in this release – due to the prominence of the Russian language on the dark web – is training the AI to more accurately translate the Russian slang words that are used on the dark web. This is where the value of using our own dataset really shows – as dark web users have a very specific lexicon that isn’t going to be recognized by generic language translation tools.

With dark web data going back 15 years, Searchlight is in a unique position to be able to train an AI on the language that is used on the dark web – and these nuances really matter. These are just three examples of where our translation tool was able to more accurately decipher the topic of conversation, where a generic language translation tool failed.

Example 1: Brute-force cyberattack

  • Original Russian: Только мне кажется что акк сбручен либо куплен?
  • Generic language translation tool: Am I the only one who thinks that the account has been linked or purchased?
  • Searchlight’s language translator: Am I the only one who thinks the account is brute-forced or purchased?

Example 2: Bitcoins

  • Original Russian: Менял битки, все прошло очень быстро, доволен селлером, рекомендую!
  • Generic language translation tool: I changed the cue balls, everything went very quickly, I’m happy with the seller, I recommend it!
  • Searchlight’s language translator: I changed Bitcoins, everything went very quickly, I’m happy with the seller, I recommend it!

Example 3: Hemp

  • Original Russian: Когда шишки появятся в Нино ??? Жду жду я бы затестил их сам знаешь
  • Generic language translation tool: When will the cones appear in Nino??? Waiting, waiting, I would test them you know
  • Searchlight’s language translation: When will hemp appear in Nizhny Novgorod??? Waiting, waiting, I would test them you know

As you can see, in each of these cases a key component of the message would have been lost using a language translator that wasn’t trained on Russian slang (if you’re interested, Nizhny Novgorod is a city in Russia).

The Benefits of AI-Powered Translation

Language translation is a brilliant use case for AI and this new feature in Cerberus and DarkIQ will deliver immediate value for our customers, helping them in their vital work to combat crime emanating from the dark web:

  • One-click translation within our products – Saving time and effort, with the ability to revert back to the original language version if required.
  • Ability to search for non-English language content using English search terms – Vastly improving the ability to find relevant intelligence on dark web marketplaces, forums, and other hidden sites.
  • Increased accuracy – Enabling them to make faster connections, gather critical intelligence, and ultimately take more decisive action with the comfort of information that can be trusted.

To find out more about visit our dedicated overview page: AI-Powered Language Translation.

Or BOOK A DEMO to see for yourself how language translation is used in our dark web intelligence products.