Zhipu has officially released and open-sourced the professional-level OCR model GLM-OCR. This model, with a lightweight size of 0.9B, achieves breakthrough performance across levels and ranks at the top in multiple authoritative benchmark tests, aiming to solve real-world business pain points in complex document parsing.
Core Performance: SOTA Performance with Small Size
Despite its parameter scale being only 0.9B, GLM-OCR performs surprisingly well. In the authoritative document parsing ranking OmniDocBench V1.5, it topped with an impressive score of 94.6, with its performance approaching that of the general large model Gemini-3-Pro. It has achieved SOTA (industry-leading) results in text recognition, mathematical formula derivation, complex table parsing, and key information extraction (KIE).

Scenario Breakthrough: Directly Addressing Complex Document Pain Points
GLM-OCR has been specially optimized for six challenging business scenarios, showing stable performance:
Complex Tables: Supports merged cells and multi-level headers, directly outputting standard HTML code.
Structured Extraction: Intelligently recognizes cards, tickets, and documents, outputting standard JSON format.
Handwriting and Code: Perfectly compatible with handwritten formulas in education and research, as well as code screenshots from programmers.
Special Markings: Demonstrates high capability in stamp recognition and multilingual mixed layout processing.

Extreme Efficiency: Faster Inference, Lower Cost
In terms of efficiency and cost control, GLM-OCR demonstrates strong commercial competitiveness:
Ultra-fast Inference: PDF processing throughput reaches 1.86 pages/second, significantly better than similar models; supports mainstream deployment methods such as vLLM and Ollama.
Outstanding Cost-effectiveness: API price is as low as 0.2 yuan per million Tokens. Compared to traditional OCR solutions, the cost is only 1/10, and processing 1,000 A4 scanned pages costs about 0.5 yuan.
Technical Insights: Multimodal Architecture and Reinforcement Learning
GLM-OCR inherits the architecture from the GLM-V series and integrates the self-developed CogViT visual encoder. By introducing **Multi-Tokens Prediction Loss (MTP)** and full-task reinforcement learning, the model's generalization ability in complex layouts has been significantly improved. Its unique four-times down-sampling strategy and SwiGLU mechanism ensure efficient fusion of visual information and language decoder.
Currently, GLM-OCR has been open-sourced on
