
Share
Choosing the right format for tabular data fed into large language models can save tokens and boost accuracy, crucial for efficient and reliable AI systems.
When it comes to building reliable AI systems, especially those involving large language models (LLMs), one often overlooked aspect is the format used to pass tabular data. Whether you’re using markdown tables, CSV, JSON, or something else entirely, your choice can significantly impact both the accuracy of your system and the cost associated with token usage.
If the data isn’t formatted in a way that’s easy for an LLM to consume, you might be unnecessarily reducing the accuracy of your entire pipeline. For instance, if the model has trouble parsing the structure of the data, it may misinterpret or miss key information.
Different formats can vary widely in terms of token usage. Some formats use several times more tokens than others to represent the same data. Since many LLM providers charge based on token consumption, your choice of format can directly affect your inference costs.
To understand which format works best, we conducted a controlled experiment using GPT-4.1-nano, a popular and powerful LLM. Here’s how we set it up:
Q: "How many years of experience does Grace X413 have? (Return just the number, e.g., '12'.)"
A: "15"
Q: "What is Alice W204's salary? (Return just the number, e.g., '85200'.)"
A: "131370"
We passed a relatively large number of records to the LLM to test its limits. In practice, with large structured datasets, you might want to chunk the data and/or query it to extract only the most relevant information before passing it to the model.

For formats like CSV, HTML tables, and markdown tables that include headers, repeating those headers periodically (e.g., every 100 records) can help with understanding. However, for simplicity, we didn’t do this in our tests.
We evaluated the accuracy of the LLM’s answers across the 11 tested formats:
While not explicitly tested in our experiment, it’s worth noting that formats like CSV and JSON tend to be more token-efficient compared to markdown tables or natural language. For example, a simple CSV record might look like this:
ID,name,age,city,department,salary,experience,project_count
X413,Grace,32,San Francisco,Engineering,150000,15,5
In contrast, the same data in a markdown table would use more tokens due to the additional formatting:
| ID | Name | Age | City | Department | Salary | Experience | Project Count |
|------|-------|-----|---------------|------------|---------|------------|---------------|
| X413 | Grace | 32 | San Francisco | Engineering| 150000
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 October 2025
88 articles
Related Articles
Related Articles
More Stories