
Share
Jina AI's new Reader-LM refines HTML-to-Markdown conversion by integrating language models, enhancing accuracy and readability while maintaining a lightweight footprint for better performance and efficiency.
In April 2024, Jina AI released Jina Reader, a simple API that converts any URL into LLM-friendly markdown with just the prefix r.jina.ai. Despite its straightforward user interface, the underlying process is quite sophisticated. The core "reading" part involves using a headless Chrome browser to fetch the webpage source, Mozilla’s Readability package to extract the main content, and regex along with the Turndown library to convert the cleaned HTML into markdown.
However, in the initial weeks after its release, Jina AI received significant feedback on the quality of the content. Users reported issues ranging from too much detail to insufficient information, as well as problems with Readability removing incorrect elements and Turndown struggling with certain HTML conversions. To address these issues, the team patched the pipeline with new regex patterns and heuristics.
Despite these improvements, Jina AI began to explore a more sustainable solution: using small language models (SLMs) to handle the conversion process end-to-end. This approach aims to replace the existing pipeline of Readability, Turndown, and regex heuristics with a single, efficient model.
At first glance, using large language models (LLMs) for data cleaning might seem excessive due to their high computational costs and slower speeds. However, SLMs-models with fewer than 1 billion parameters-can run efficiently on the edge, making them more cost-effective and faster. The key question is whether these smaller models can handle the task of converting HTML to markdown effectively.
Jina AI has introduced two new SLMs: Reader-LM-0.5B and Reader-LM-1.5B. These models are designed to convert raw, noisy HTML from the open web into clean, well-structured markdown. Here’s a breakdown of their key features:
Architecture:
Performance:
Training:

Data Preprocessing:
Initial benchmarks show promising results:
Jina AI is committed to open source and open science. Both Reader-LM-0.5B and Reader-LM-1.5B are available on Hugging Face:
Jina AI plans to continue refining these models and exploring new applications. The team is also open to community contributions and feedback to improve the performance and usability of Reader-LM.
In conclusion, Reader-LM represents
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
13 September 2024
133 articles
Related Articles
Related Articles
More Stories