
Share
Hugging Face’s conversion of the PixParse dataset to Parquet boosts efficiency for OCR and PDF research, offering faster data access and simpler manipulation compared to its previous WebDataset format.
Hugging Face has recently made available a new dataset called "PixParse" in the Parquet format. This conversion is significant for researchers and practitioners working with Optical Character Recognition (OCR) and PDF processing, as it offers improved performance and easier data manipulation.
The PixParse dataset, originally available in WebDataset (WDS) format, has been auto-converted to Parquet. Parquet is a columnar storage file format that is highly efficient for read-heavy workloads and large datasets. This conversion means:
The PixParse dataset contains approximately 3.41 million rows of data, with the first 7,000 rows being publicly accessible for preview. The dataset is primarily split into a single training set (train), which also contains about 3.41 million rows.

Each row in the dataset includes several fields:
Here's a sample record from the dataset:
{
"__key__": "klpb0135",
"__url__": "hf://datasets/pixparse/idl-wds@e70d43a77ae233778613acf860df8b2d91e0673b/idl-train-00000.tar",
"json": {
"pages": [
{
"text": [
"BROWN & WILLIAMSON TOBACCO CORPORATION",
"RESEARCH & DEVELOPMENT",
"INTERNAL CORRESPONDENCE",
"SECRET. Do Not Copy Or Make Notes Of",
"This Page And Do Not Give Any Of The",
"TO :",
"MR. P. H. HARPER",
"Information Contained In This Document",
"To Anyone Except In Conformance With",
"CC:",
"MR. T. E. SANDEFUR",
"The Moorgate Secrecy Protocol.",
"DR. P. L. AULBACH",
"FROM:",
"MR. D. S. ROTH",
"DATE:",
"JANUARY 13, 1983",
"SUBJECT: KENT 80/KS TAR REDUCTION - FINLAND/907",
"Ref S/R 107/82",
"After evaluations and rejections of tar reduction samples 49B, 50B, and",
"51B we have reviewed deliveries and designs of several KENT products to",
"find one which meets the 10.0 mg/cig. (DPM) and 0.7 mg/cig. nicotine",
"constraints.",
"We propose the following two trials.",
"Trial 1",
"Trial 2",
"Blend (%) :",
"MGLF-31",
"69.9",
"69.9",
"Oriental",
"10.3",
"10.3",
"MRT",
"5.0",
"0.0",
"F.C. WTS",
"14.8",
"19.8",
"Final Casing",
"MGE-651",
"MGE-651",
"Final Flavor",
"MGE-
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
3 April 2024
133 articles
Related Articles
Related Articles
More Stories