
Share
Grab has created a specialized vision language model to tackle the document processing challenges unique to Southeast Asia's diverse linguistic landscape and complex document formats.
In the realm of digital services, accurately extracting information from user-submitted documents such as IDs, driver’s licenses, and registration certificates is a critical first step for processes like electronic know-your-customer (eKYC). This task is particularly challenging in Southeast Asia (SEA) due to the region's diverse languages and document formats. Traditional Optical Character Recognition (OCR) systems often struggle with this variety, while proprietary Large Language Models (LLMs) can be error-prone, produce hallucinations, and have high latency. Open-source Vision LLMs offer more efficiency but lack the necessary accuracy for production use.
To address these challenges, Grab embarked on a journey to develop a lightweight, specialized Vision LLM tailored for SEA’s unique requirements. This article delves into the technical details of this endeavor and why it matters to practitioners in the field.
You’re likely familiar with text-based LLMs that process prompts and generate responses. A Vision LLM extends this capability by enabling the model to understand images. The basic architecture comprises three key components:
We evaluated several open-source models capable of performing OCR and Key Information Extraction (KIE). After thorough testing, we selected Qwen2-VL 2B as our base multimodal LLM. Here’s why:
We benchmarked Qwen2-VL against miniCPM using Grab’s dataset. The initial results showed low accuracy, primarily due to limited coverage of SEA languages. This motivated us to fine-tune the model to enhance its OCR and KIE capabilities.

After fine-tuning, we observed significant improvements in accuracy:
To integrate the fine-tuned Vision LLM into our production environment, we:
While our current model has shown promising results, there are still areas for improvement:
Developing a specialized Vision LLM for document processing in Southeast Asia has been a significant step forward for Grab. By addressing the limitations of traditional OCR systems and open-source models, we have created a more accurate and efficient solution tailored to our region’s unique needs. This project not only enhances user experience but also sets a precedent for future AI applications in diverse linguistic environments.
Tags
Original Sources
↗ https://engineering.grab.com/custom-vision-llm-at-grab?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
5 November 2025
88 articles
Related Articles
Related Articles
More Stories