HEADLINE: DOCCI: A New Dataset for Fine-Grained Vision-Language Research

Models & Research

The Engineer

6 May 2024 · 3 min read

DOcci offers detailed annotations linking complex visual scenes with nuanced language, pushing the boundaries of T2I and I2T research by emphasizing subtle differences and connections in image sets.

Descriptions of Connected and Contrasting Images (DOCCI) - ECCV 2024

Vision-language datasets are crucial for advancing both text-to-image (T2I) and image-to-text (I2T) research. However, existing datasets often fall short in providing the fine-grained detail necessary to train models that can capture complex visual and linguistic relationships. To address this gap, researchers from Google, Princeton University, and UNC Chapel Hill have introduced Descriptions of Connected and Contrasting Images (DOCCI), a dataset featuring long, human-annotated English descriptions for 15,000 images.

What Changed?

DOCCI introduces a new level of detail in vision-language datasets. Unlike other datasets that often rely on short captions, DOCCI provides comprehensive, human-written descriptions averaging 136 words per image. These descriptions are designed to capture key challenges such as spatial relations, counting, text rendering, and world knowledge. The dataset is curated by a single researcher to ensure consistency and focus on specific research objectives.

Why It Matters

For practitioners in the field of vision-language research, DOCCI offers several significant advantages:

Richer Descriptions: The detailed annotations allow models to learn more nuanced associations between images and text.
Consistency: Being curated by a single researcher ensures that the dataset is consistent in its focus on key challenges.
Versatility: DOCCI can be used for both training and evaluating T2I and I2T models, making it a versatile resource.

Dataset Details

Size: 15,000 images
Description Length: Average of 136 words per image
Challenges Addressed:
- Spatial relations (e.g., "the cat is on the left side of the table")
- Counting (e.g., "there are three apples in the basket")
- Text rendering (e.g., "the sign reads 'Welcome to New York'")
- World knowledge (e.g., "the Eiffel Tower is visible in the background")

Implementation and Results

To demonstrate the effectiveness of DOCCI, the researchers fine-tuned a PaLI 5B model on this dataset. The results were impressive:

Image-to-Text Generation: The PaLI 5B model finetuned on DOCCI showed equal or superior performance compared to larger models like LLaVA-1.5 7B and InstructBLIP 7B.
Text-to-Image Generation: DOCCI also serves as a useful testbed for T2I generation, highlighting the limitations of current models in capturing long descriptions and fine details.

Key Takeaways

Training Resource: DOCCI is an effective training resource for I2T models, improving their ability to generate detailed and accurate text.
Evaluation Tool: The dataset can be used to evaluate the strengths and weaknesses of T2I models, particularly in handling complex and long descriptions.
Research Focus: By focusing on specific challenges, DOCCI helps researchers identify areas where current models need improvement.

Conclusion

DOCCI represents a significant step forward in vision-language research by providing detailed, human-annotated descriptions that capture a wide range of visual and linguistic challenges. For practitioners looking to enhance their I2T and T2I models, this dataset offers a valuable resource for both training and evaluation.