RealKIE: Five New Datasets for Enterprise Key Information Extraction

Models & Research

The Engineer

4 Apr 2024 · 3 min read

RealKIE introduces five rigorous datasets tailored for enterprise needs, challenging existing key information extraction methods with complex documents like SEC filings and non-disclosure agreements.

RealKIE, a new benchmark introduced by researchers Benjamin Townsend, Madison May, Katherine Mackowiak, and Christopher Wells, brings five challenging datasets to the table. These datasets are specifically designed to advance key information extraction (KIE) methods in enterprise applications. The datasets include SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts, each presenting unique challenges that reflect real-world scenarios.

What's New and Why It Matters

The RealKIE benchmark is a significant step forward for KIE research because it addresses the practical issues faced in enterprise settings. Traditional datasets often lack the complexity and variability found in actual documents used by businesses. Here’s what makes these datasets stand out:

Diverse Document Types: The datasets cover a wide range of document types, ensuring that models trained on them can handle various formats.
Realistic Challenges:
- Poor Text Serialization: Documents with inconsistent formatting and structure.
- Sparse Annotations in Long Documents: Large documents where only certain sections are annotated.
- Complex Tabular Layouts: Tables and structured data that need to be accurately extracted.

Dataset Details

1. SEC S1 Filings

Documents: Initial public offering (IPO) filings with detailed financial information.
Challenges: Dense text, complex tables, and varying formats across different companies.

2. US Non-disclosure Agreements (NDAs)

Documents: Legal agreements that protect confidential information.
Challenges: Varied legal jargon, sparse annotations in long paragraphs, and inconsistent formatting.

3. UK Charity Reports

Documents: Annual reports from charitable organizations.
Challenges: Mixed content types (text, tables, images), varying lengths, and detailed financial sections.

4. FCC Invoices

Documents: Billing statements for communication services.
Challenges: Structured data in tabular form, with specific fields requiring accurate extraction.

5. Resource Contracts

Documents: Agreements related to natural resource management.
Challenges: Complex legal language, multiple clauses, and detailed technical specifications.

Annotation Process

The researchers provide an in-depth description of the annotation process, which is crucial for ensuring high-quality data. Key points include:

Manual Annotation: Trained annotators reviewed each document to ensure accuracy.
Consistency Checks: Regular reviews to maintain consistency across annotations.
Tooling: Custom tools were developed to facilitate efficient and accurate annotation.

Document Processing Techniques

To handle the diverse and challenging nature of these documents, the researchers employed several advanced techniques:

Optical Character Recognition (OCR): Used to convert scanned documents into machine-readable text.
Layout Analysis: Algorithms to identify and segment different parts of the document (e.g., tables, paragraphs).
Preprocessing: Cleaning and normalizing the text to improve model performance.

Baseline Modeling Approaches

The paper also presents baseline models for each dataset, providing a starting point for researchers and practitioners. Key approaches include:

Rule-Based Methods: Simple heuristics for extracting specific information.
Machine Learning Models: Supervised learning algorithms trained on annotated data.
Deep Learning Models: Neural networks capable of handling complex patterns in the data.

Availability

All annotated data, OCR outputs, and code to reproduce the baselines are available for download at indicodatasolutions.github.io/RealKIE. This transparency supports further research and development in key information extraction technologies.

Conclusion

The RealKIE benchmark is a valuable resource for advancing KIE methods in enterprise applications. By addressing real-world challenges, it provides a robust testing ground for developing more effective NLP models. Whether you're a researcher or a practitioner, these datasets offer a wealth of opportunities to improve and refine your information extraction capabilities.