Recently, Tencent officially launched its brand-new open-source model HunyuanOCR, with only 1B parameters. The model is based on Tencent's proprietary Huyuan multimodal architecture and has achieved SOTA (state-of-the-art) performance in multiple industry-standard OCR applications. Tencent stated that the "end-to-end" design philosophy of HunyuanOCR allows the model to quickly obtain the optimal results through a single forward inference.

image.png

HunyuanOCR is mainly composed of three core components: native resolution video encoder, adaptive visual adaptation lightweight Huyuan language model. Unlike other OCR models in the market, Hunyuan adopts an end-to-end training and inference approach, and demonstrates excellent reasoning capabilities through large-scale application-oriented data and online reinforcement learning.

In the test of complex document parsing, HunyuanOCR scored 94.1, surpassing multiple leading models including Google Gemini3-pro. Its text detection and recognition capabilities are also very outstanding, covering various application scenarios such as documents, artistic fonts, street scenes, handwriting, advertisements, and receipts. Compared to other open-source and commercial OCR models, it performs excellently. In OCR, this model has a total score of 860 points, becoming the top performer among models with less than 3B parameters.

HunyuanOCR also supports translation functions for 14 languages, and shows excellent performance in the translation field. The model can process complex documents electronically, organize the text in scanned images according to reading order, and can use LaTeX format to represent formulas and HTML format for complex tables.

In terms of application, HunyuanOCR is suitable for tasks such as language document parsing, invoice field extraction, video subtitle recognition, and photo translation, demonstrating broad application potential.

github: https://github.com/Tencent-Hunyuan/HunyuanOCR

Key Points:  

🔍 HunyuanOCR model with 1B parameters achieves multiple SOTA results through end-to-end design.  

📄 The model supports complex document parsing, text detection, and recognition, covering various application scenarios.  

🌐 HunyuanOCR also has translation capabilities for 14 languages, especially suitable for photo translation features.