Grab Develops Specialized Vision LLM for Document Processing in Southeast Asia

Tools & Engineering

The Engineer

5 Nov 2025 · 4 min read

Grab has created a specialized vision language model to tackle the document processing challenges unique to Southeast Asia's diverse linguistic landscape and complex document formats.

In the realm of digital services, accurately extracting information from user-submitted documents such as IDs, driver’s licenses, and registration certificates is a critical first step for processes like electronic know-your-customer (eKYC). This task is particularly challenging in Southeast Asia (SEA) due to the region's diverse languages and document formats. Traditional Optical Character Recognition (OCR) systems often struggle with this variety, while proprietary Large Language Models (LLMs) can be error-prone, produce hallucinations, and have high latency. Open-source Vision LLMs offer more efficiency but lack the necessary accuracy for production use.

To address these challenges, Grab embarked on a journey to develop a lightweight, specialized Vision LLM tailored for SEA’s unique requirements. This article delves into the technical details of this endeavor and why it matters to practitioners in the field.

Technical Overview

What is a Vision LLM?

You’re likely familiar with text-based LLMs that process prompts and generate responses. A Vision LLM extends this capability by enabling the model to understand images. The basic architecture comprises three key components:

Image Encoder: Converts an image into a numerical (vectorized) format.
Vision-Language Projector: Translates the image’s vector representation into a form understandable by the language model.
Language Model: Processes the combined image and text input to generate a final text output.

Choosing the Base Vision LLM

We evaluated several open-source models capable of performing OCR and Key Information Extraction (KIE). After thorough testing, we selected Qwen2-VL 2B as our base multimodal LLM. Here’s why:

Efficient Size: Small enough for full fine-tuning on GPUs with limited VRAM.
SEA Language Support: Efficient tokenizer for languages like Thai and Vietnamese, indicating strong native vocabulary coverage.
Dynamic Resolution: Unlike models requiring fixed-size inputs, Qwen2-VL can process images in their native resolution, crucial for preventing text distortion during OCR tasks.

Initial Benchmarks and Fine-Tuning

We benchmarked Qwen2-VL against miniCPM using Grab’s dataset. The initial results showed low accuracy, primarily due to limited coverage of SEA languages. This motivated us to fine-tune the model to enhance its OCR and KIE capabilities.

Data Collection: We gathered a diverse dataset of ID cards, driver’s licenses, and registration certificates from various Southeast Asian countries.
Annotation: Manually annotated this dataset to ensure high-quality training data.
Fine-Tuning Process:
- Used transfer learning techniques to leverage the pre-trained Qwen2-VL model.
- Fine-tuned on our custom dataset using a mix of supervised and semi-supervised learning methods.
- Employed techniques like data augmentation and regularization to improve generalization.

Performance Improvements

After fine-tuning, we observed significant improvements in accuracy:

OCR Accuracy: Increased from an initial 65% to over 90% on our dataset.
KIE Accuracy: Improved from 70% to around 85%, particularly for extracting key information like names, addresses, and dates.

Implementation Details

To integrate the fine-tuned Vision LLM into our production environment, we:

Containerized the Model: Deployed it in a Docker container to ensure consistency across different environments.
API Gateway: Set up an API gateway to handle requests from various services.
Latency Optimization: Implemented caching and batch processing techniques to reduce latency.

Future Work

While our current model has shown promising results, there are still areas for improvement:

Multilingual Support: Expanding the tokenizer to cover more SEA languages.
Real-Time Processing: Enhancing the model’s performance for real-time applications.
Model Size Reduction: Further optimizing the model size without compromising accuracy.

Conclusion

Developing a specialized Vision LLM for document processing in Southeast Asia has been a significant step forward for Grab. By addressing the limitations of traditional OCR systems and open-source models, we have created a more accurate and efficient solution tailored to our region’s unique needs. This project not only enhances user experience but also sets a precedent for future AI applications in diverse linguistic environments.