Paperless-NGX OCR languages | Languages that work vs. languages that don’t work
A delicate topic: Paperless-NGX OCR languages. Languages that work vs. languages that don’t work in Paperless OCR recognition.
Many documents already contain the language content in the PDF file, and if Paperless is set up correctly, this content is retained.
Now let’s move on to scanned documents – this is where the problem always arises. Paperless runs OCR (optical character recognition) on the document to add the missing content and make the document searchable.
However, this only works for common European languages.
📒 Further articles in this series:
At the top right, just below “Subscribe to Newsletter”, you’ll find the search field!
Type “Paperless” and press the [ENTER] key to display all Paperless articles.
Overview
In this article, I draw on my experience with the Thai language and script, which I have attempted to set up. In the language overview, you will find other languages and what you can achieve with OCR text recognition.
Well, just configuring the OSC languages can cause the web server to stop starting. But even if you manage to get Paperless to load and use your OCR language, you may only get single-letter recognition or even a alphabet soup as a result – bon appétit 😁😂
Video: Paperless-NGX OCR languages | Languages that work vs. languages that don’t work
Language: 🇩🇪|🇬🇧
☝️ Use YouTube subtitles for all languages.
Setting Up OCR Languages
There isn’t much to consider here. In the docker-compose.yml, the following entry must be added under environment:
PAPERLESS_OCR_LANGUAGES: deu tha eng
This activates the languages German, Thai, and English for OCR.
Note that the languages must be separated by a space.
⚠️ Do not use:
# PAPERLESS_OCR_LANGUAGE: deu+tha+eng
You should not configure this value. Why? I have had multiple issues with it — the web server would no longer start because the language files were not loaded. Also note that this setting requires a “+” sign as the separator.
Instead, we configure this in the Paperless web interface, where it works reliably.
Go to the Paperless interface: Configuration → OCR Settings
There, enter the required language codes in the Language field, e.g.: deu+tha+eng
Paperless will then automatically download and apply the corresponding language files.
The result with the following solutions:
Using the board’s own resources, I managed to get the font recognized and the text recognition started.
Result: “ส วั ส ด ี” instead of “สวัสดี”
So the letters are separated by spaces, as are vowels and tone marks – unusable.
The result with Python scripts: Result: “ส วั ส ด ี” instead of “สวัสดี”
The result with external processing “Stirling PDF”: Result: “ส วั ส ด ี” instead of “สวัสดี”
What’s the problem?
OK, it doesn’t work—but why? Let’s test other OCR providers, both online services and local software packages…
Result: Funnily enough, it doesn’t work there either. Result: “ส วั ส ด ี” instead of “สวัสดี.”
So it’s clear: going paperless isn’t the problem.
Paperless performs text recognition (OCR) using the best methods available. The real problem is that OCR recognition works poorly or not at all for certain fonts, especially complex or non-Latin scripts.
End of the road:
The only thing that helps here is patience. We have to wait until OCR technologies have advanced to the point where they can correctly handle “exotic languages” such as Thai. Development is constantly progressing, so it’s only a matter of time before these languages are also reliably supported.
✅ Fully Functional Languages (Latin Scripts)
These languages are correctly processed with word and sentence recognition:
- German (deu) – Good recognition, umlauts correct
- English (eng) – Best results, optimized recognition
- French (fra) – Accents and special characters correct
- Spanish (spa) – Accents and ñ characters work properly
- Italian (ita) – Good sentence recognition
- Portuguese (por) – Basic functionality
⚠️ Limited Functionality Languages
Technically installable, but with recognition problems:
Thai (tha)
- Result: Individual characters separated by spaces
- Example: “ส วั ส ด ี” instead of “สวัสดี”
- Usability: Not searchable/processable
Arabic (ara)
- Result: Right-to-left flow often broken, character isolation
- Problem: Connected script gets dissolved, context lost
- Usability: Limited to unusable
Japanese (jpn)
- Result: Kanji, Hiragana, Katakana mixed but error-prone
- Problem: Complex characters often misrecognized
- Usability: Only conditionally suitable for simple texts
Chinese (chi_sim/chi_tra)
- Result: Many character errors, similar characters confused
- Problem: Thousands of logograms overwhelm the recognition
- Usability: Unreliable for serious document processing
Korean (kor)
- Result: Hangul syllable block recognition partially functional
- Problem: Complex syllables are often missegmented
- Usability: Conditionally usable for simple texts
Russian (rus)
- Result: Cyrillic letters mostly correct
- Problem: Better results than Asian scripts, but worse than Latin
- Usability: Acceptable for standard texts
❌ Non-OCR “Languages”
- osd – Layout detection only (text orientation, writing system)
Conclusion
Paperless-ngx OCR works reliably only with Latin scripts. Non-Latin scripts are technically supported but deliver practically unusable results due to incorrect character segmentation and recognition errors.
Usability Ranking:
- Latin Scripts – ✅ Fully functional
- Cyrillic – ⚠️ Limited usability
- Arabic/Thai – ❌ Practically unusable
- Asian Scripts – ❌ Unreliable to unusable
OCR text recognition is currently not a solution for documents in non-Latin scripts—the OCR results are not searchable and cannot be further processed.
The only disadvantage is that the “in-text search” function does not work with these OCR-processed documents. And that is actually one of the biggest advantages of the Paperless DMS system.
But so what? Paperless is no worse off than other DMS systems—because they have the same problem.
I am confident that the corresponding language training files will soon be released, particularly in Asia, enabling significantly better text recognition.

Support / Donation Link for the Channel
If my posts have been helpful or supported you in any way, I’d truly appreciate your support 🙏
#PaperlessNGX #OCR #DocumentManagement #LanguageRecognition #TextRecognition #DocumentDigitization #RecordsManagement #LanguageSupport #LatinScripts #AsianScripts #ArabicScripts #Thai #Japanese #Chinese #Korean #Russian #LanguageLimitations