Jina AI Introduces Reader-LM: Small Language Models for HTML to Markdown Conversion

Tools & Engineering

The Engineer

13 Sept 2024 · 3 min read

Jina AI's new Reader-LM refines HTML-to-Markdown conversion by integrating language models, enhancing accuracy and readability while maintaining a lightweight footprint for better performance and efficiency.

In April 2024, Jina AI released Jina Reader, a simple API that converts any URL into LLM-friendly markdown with just the prefix r.jina.ai. Despite its straightforward user interface, the underlying process is quite sophisticated. The core "reading" part involves using a headless Chrome browser to fetch the webpage source, Mozilla’s Readability package to extract the main content, and regex along with the Turndown library to convert the cleaned HTML into markdown.

However, in the initial weeks after its release, Jina AI received significant feedback on the quality of the content. Users reported issues ranging from too much detail to insufficient information, as well as problems with Readability removing incorrect elements and Turndown struggling with certain HTML conversions. To address these issues, the team patched the pipeline with new regex patterns and heuristics.

Despite these improvements, Jina AI began to explore a more sustainable solution: using small language models (SLMs) to handle the conversion process end-to-end. This approach aims to replace the existing pipeline of Readability, Turndown, and regex heuristics with a single, efficient model.

Why Small Language Models?

At first glance, using large language models (LLMs) for data cleaning might seem excessive due to their high computational costs and slower speeds. However, SLMs-models with fewer than 1 billion parameters-can run efficiently on the edge, making them more cost-effective and faster. The key question is whether these smaller models can handle the task of converting HTML to markdown effectively.

Introducing Reader-LM

Jina AI has introduced two new SLMs: Reader-LM-0.5B and Reader-LM-1.5B. These models are designed to convert raw, noisy HTML from the open web into clean, well-structured markdown. Here’s a breakdown of their key features:

Architecture:
- Both models are based on transformer architectures.
- They are trained on a diverse dataset of web pages to ensure they can handle various types of content.
Performance:
- Reader-LM-0.5B: This model is optimized for lightweight applications and edge devices. It provides good performance with minimal resource consumption.
- Reader-LM-1.5B: This model offers enhanced accuracy and handling of complex HTML structures, making it suitable for more demanding use cases.
Training:
- The models are trained using a combination of supervised learning (labeled data) and unsupervised learning (large amounts of raw web data).
- They are fine-tuned to improve their ability to extract main content and convert it into markdown.

Implementation Details

Data Preprocessing:
- The training data is preprocessed to remove noise and irrelevant elements, similar to the initial steps in Jina Reader.
- This ensures that the models learn to focus on the core content of web pages.
Inference:
- During inference, the models take raw HTML as input and output clean markdown.
- They are designed to handle a wide range of HTML structures, including nested elements and complex layouts.

Benchmarks

Initial benchmarks show promising results:

Speed: Both Reader-LM-0.5B and Reader-LM-1.5B outperform the traditional pipeline in terms of speed, especially on edge devices.
Accuracy: The models achieve high accuracy in content extraction and markdown conversion, with Reader-LM-1.5B showing a slight edge in handling complex HTML structures.

Open Source

Jina AI is committed to open source and open science. Both Reader-LM-0.5B and Reader-LM-1.5B are available on Hugging Face:

Future Work

Jina AI plans to continue refining these models and exploring new applications. The team is also open to community contributions and feedback to improve the performance and usability of Reader-LM.

In conclusion, Reader-LM represents