Paperless-NGX OCR languages | Languages that work vs. languages that don’t work

A delicate topic: Paperless-NGX OCR languages. Languages that work vs. languages that don’t work in Paperless OCR recognition.

Many documents already contain the language content in the PDF file, and if Paperless is set up correctly, this content is retained.

Now let’s move on to scanned documents – this is where the problem always arises. Paperless runs OCR (optical character recognition) on the document to add the missing content and make the document searchable.

However, this only works for common European languages.

📒 Further articles in this series:

At the top right, just below “Subscribe to Newsletter”, you’ll find the search field!
Type “Paperless” and press the [ENTER] key to display all Paperless articles.

Overview

In this article, I draw on my experience with the Thai language and script, which I have attempted to set up. In the language overview, you will find other languages and what you can achieve with OCR text recognition.

Well, just configuring the OSC languages can cause the web server to stop starting. But even if you manage to get Paperless to load and use your OCR language, you may only get single-letter recognition or even a alphabet soup as a result – bon appétit 😁😂

Video: Paperless-NGX OCR languages | Languages that work vs. languages that don’t work

Language: 🇩🇪|🇬🇧
☝️ Use YouTube subtitles for all languages.

Setting Up OCR Languages

There isn’t much to consider here. In the docker-compose.yml, the following entry must be added under environment:

PAPERLESS_OCR_LANGUAGES: deu tha eng

This activates the languages German, Thai, and English for OCR.
Note that the languages must be separated by a space.

⚠️ Do not use:

# PAPERLESS_OCR_LANGUAGE: deu+tha+eng

You should not configure this value. Why? I have had multiple issues with it — the web server would no longer start because the language files were not loaded. Also note that this setting requires a “+” sign as the separator.

Instead, we configure this in the Paperless web interface, where it works reliably.

Go to the Paperless interface: Configuration → OCR Settings

There, enter the required language codes in the Language field, e.g.: deu+tha+eng

Paperless will then automatically download and apply the corresponding language files.

The result with the following solutions:

Using the board’s own resources, I managed to get the font recognized and the text recognition started.

Result: “ส วั ส ด ี” instead of “สวัสดี”

So the letters are separated by spaces, as are vowels and tone marks – unusable.

The result with Python scripts: Result: “ส วั ส ด ี” instead of “สวัสดี”
The result with external processing “Stirling PDF”: Result: “ส วั ส ด ี” instead of “สวัสดี”

What’s the problem?

OK, it doesn’t work—but why? Let’s test other OCR providers, both online services and local software packages…

Result: Funnily enough, it doesn’t work there either. Result: “ส วั ส ด ี” instead of “สวัสดี.”

So it’s clear: going paperless isn’t the problem.

Paperless performs text recognition (OCR) using the best methods available. The real problem is that OCR recognition works poorly or not at all for certain fonts, especially complex or non-Latin scripts.

End of the road:

The only thing that helps here is patience. We have to wait until OCR technologies have advanced to the point where they can correctly handle “exotic languages” such as Thai. Development is constantly progressing, so it’s only a matter of time before these languages are also reliably supported.

✅ Fully Functional Languages (Latin Scripts)

These languages are correctly processed with word and sentence recognition:

German (deu) – Good recognition, umlauts correct
English (eng) – Best results, optimized recognition
French (fra) – Accents and special characters correct
Spanish (spa) – Accents and ñ characters work properly
Italian (ita) – Good sentence recognition
Portuguese (por) – Basic functionality

⚠️ Limited Functionality Languages

Technically installable, but with recognition problems:

Thai (tha)

Result: Individual characters separated by spaces
Example: “ส วั ส ด ี” instead of “สวัสดี”
Usability: Not searchable/processable

Arabic (ara)

Result: Right-to-left flow often broken, character isolation
Problem: Connected script gets dissolved, context lost
Usability: Limited to unusable

Japanese (jpn)

Result: Kanji, Hiragana, Katakana mixed but error-prone
Problem: Complex characters often misrecognized
Usability: Only conditionally suitable for simple texts

Chinese (chi_sim/chi_tra)

Result: Many character errors, similar characters confused
Problem: Thousands of logograms overwhelm the recognition
Usability: Unreliable for serious document processing

Korean (kor)

Result: Hangul syllable block recognition partially functional
Problem: Complex syllables are often missegmented
Usability: Conditionally usable for simple texts

Russian (rus)

Result: Cyrillic letters mostly correct
Problem: Better results than Asian scripts, but worse than Latin
Usability: Acceptable for standard texts

❌ Non-OCR “Languages”

osd – Layout detection only (text orientation, writing system)

Conclusion

Paperless-ngx OCR works reliably only with Latin scripts. Non-Latin scripts are technically supported but deliver practically unusable results due to incorrect character segmentation and recognition errors.

Usability Ranking:

Latin Scripts – ✅ Fully functional
Cyrillic – ⚠️ Limited usability
Arabic/Thai – ❌ Practically unusable
Asian Scripts – ❌ Unreliable to unusable

OCR text recognition is currently not a solution for documents in non-Latin scripts—the OCR results are not searchable and cannot be further processed.

The only disadvantage is that the “in-text search” function does not work with these OCR-processed documents. And that is actually one of the biggest advantages of the Paperless DMS system.

But so what? Paperless is no worse off than other DMS systems—because they have the same problem.

I am confident that the corresponding language training files will soon be released, particularly in Asia, enabling significantly better text recognition.

Support / Donation Link for the Channel
If my posts have been helpful or supported you in any way, I’d truly appreciate your support 🙏

PayPal Link
Bank transfer, Bitcoin and Lightning

#PaperlessNGX #OCR #DocumentManagement #LanguageRecognition #TextRecognition #DocumentDigitization #RecordsManagement #LanguageSupport #LatinScripts #AsianScripts #ArabicScripts #Thai #Japanese #Chinese #Korean #Russian #LanguageLimitations