Benchmarking Open-Source OCR Models: Qwen 2.5 VL Leads the Pack

Models & Research

The Engineer

9 Apr 2025 · 3 min read

As open-source OCR models surge, Qwen 2.5 VL emerges as the standout performer in recent benchmarks, outpacing both closed-source rivals and traditional OCR services.

For several months, we've been evaluating how well vision models handle Optical Character Recognition (OCR). Our initial benchmark focused on closed-source models like GPT, Gemini, and Claude, comparing them to traditional OCR providers such as AWS, Azure, and GCP.

However, this week has been particularly exciting for open-source Large Language Models (LLMs) with the release of several new models. Here's a quick rundown:

Qwen 2.5 VL (32b variant)
Gemma-3 (27b)
DeepSeek-v3-0324
Llama 4 (Maverick & Scout)
Mistral-OCR (released a couple of weeks ago)

This new benchmark focuses specifically on these open-source vision models. The evaluation includes:

Qwen 2.5 VL 72b
Qwen 2.5 VL 32b
Gemma-3 27b
Mistral-OCR
Llama 3.2 90b
Llama 3.2 11b
Llama 4 Maverick
Llama 4 Scout

Note that DeepSeek-v3 and Llama 3.3 do not yet have native vision support, so they could not be included. Additionally, while there are many other open-source OCR providers (Paddle OCR, Docling, Unstructured, etc.), this evaluation focuses solely on open-source Vision-Language Models (VLMs). A follow-up benchmark will revisit traditional OCR models.

As with our main OCR Benchmark, the dataset and methodologies here are entirely open-source. You can review (and reproduce) the evaluation yourself via our benchmark repository on Github and view all raw data in our Hugging Face repository.

Results at a Glance

After evaluating 1,000 documents for JSON extraction accuracy, cost, and latency, here are the major takeaways among the open-source models:

Qwen 2.5 VL (72B and 32B): These models are by far the most impressive, achieving around 75% accuracy, which is equivalent to GPT-4o’s performance.
- Both Qwen models outperformed mistral-ocr (72.2%), which is specifically trained for OCR.
Gemma-3 (27B): This model came in at 42.9%, which is surprising given its architecture's similarity to Gemini 2.0, which topped the accuracy chart.

Detailed Findings

Qwen 2.5 VL

Accuracy: Both the 72B and 32B variants achieved around 75% accuracy.
Performance: These models showed robust performance across various document types, handling complex layouts and fonts well.
Latency: The 32B variant had slightly lower latency compared to the 72B variant, making it a good choice for real-time applications.

Mistral-OCR

Accuracy: Achieved 72.2% accuracy.
Performance: Despite being specifically trained for OCR, it was outperformed by Qwen 2.5 VL.
Latency: Comparable to other models in the benchmark.

Gemma-3 (27B)

Accuracy: Only achieved 42.9% accuracy.
Performance: We spent extra time verifying its performance due to the low score. The errors were typical VLM challenges, including hallucinations, omitted values, and swapped words.
- Example: An output that appeared correctly formatted at first glance but revealed numerous inaccuracies and missing data upon closer inspection.

Methodology

The evaluation was conducted on a dataset of 1,000 documents, focusing on JSON extraction accuracy, cost, and latency. The models were tested on a variety of document types to ensure comprehensive coverage. Each model's performance was measured against a ground truth dataset to determine accuracy.

Conclusion

Qwen 2.5