Grab, a super app company in Singapore, recently shared its experience in developing its own language model on its engineering blog, pointing out that existing large language models perform poorly in understanding Southeast Asian languages. Grab's super app offers services such as ride-hailing, food delivery, shopping, and financial services, covering countries such as Singapore, Malaysia, Indonesia, the Philippines, Vietnam, Thailand, Cambodia, and Myanmar, where documents are typically written in scripts that do not use the Latin alphabet.

AI robot playing games

Image source note: The image is AI-generated

When performing compliance tasks such as customer identity verification, Grab needs to accurately extract information from documents such as ID cards, driver's licenses, and registration certificates. Although they tried optical character recognition (OCR) systems, Grab found that these technologies did not work well when dealing with diverse document templates.

In 2025, Grab began exploring whether large language models could be used to solve this problem. Although some powerful commercial models had capabilities, they often made errors and had delays in understanding Southeast Asian languages, while open-source visual large language models, although more efficient, still lacked accuracy. Therefore, Grab decided to build its own visual large language model, which can vectorize images for easier text extraction.

Grab chose Alibaba Cloud's Qwen2-VL2B model as the foundation because of its moderate size, support for Southeast Asian languages, and ability to dynamically handle images of different resolutions. Subsequently, Grab extracted content in Southeast Asian languages from Common Crawl and built an internal synthetic data pipeline to generate text images under various fonts and backgrounds. The team used low-rank adaptation technology to fine-tune Qwen2-VL, achieving good results in processing Indonesian documents.

Although there are still challenges in recognizing Thai and Vietnamese, Grab ultimately decided to perform full parameter fine-tuning. By training the model to learn the unique visual patterns of Southeast Asian languages, Grab successfully developed a lightweight visual large language model that outperformed various OCR tools and general models. Grab stated that strategically using high-quality data can enable small specialized models to achieve efficiency and effectiveness.

In the future, Grab plans to continue developing more of its own models to expand its document processing technology.

Key points:

📊 Grab found that existing large language models performed poorly in recognizing Southeast Asian languages and decided to develop its own model to solve the problem.  

🔍 The self-developed visual large language model has made significant progress in processing documents such as ID cards and driver's licenses.  

🚀 Grab will continue to develop more models to meet increasingly complex document processing needs.