
Share
DOcci offers detailed annotations linking complex visual scenes with nuanced language, pushing the boundaries of T2I and I2T research by emphasizing subtle differences and connections in image sets.
Vision-language datasets are crucial for advancing both text-to-image (T2I) and image-to-text (I2T) research. However, existing datasets often fall short in providing the fine-grained detail necessary to train models that can capture complex visual and linguistic relationships. To address this gap, researchers from Google, Princeton University, and UNC Chapel Hill have introduced Descriptions of Connected and Contrasting Images (DOCCI), a dataset featuring long, human-annotated English descriptions for 15,000 images.
DOCCI introduces a new level of detail in vision-language datasets. Unlike other datasets that often rely on short captions, DOCCI provides comprehensive, human-written descriptions averaging 136 words per image. These descriptions are designed to capture key challenges such as spatial relations, counting, text rendering, and world knowledge. The dataset is curated by a single researcher to ensure consistency and focus on specific research objectives.
For practitioners in the field of vision-language research, DOCCI offers several significant advantages:

To demonstrate the effectiveness of DOCCI, the researchers fine-tuned a PaLI 5B model on this dataset. The results were impressive:
DOCCI represents a significant step forward in vision-language research by providing detailed, human-annotated descriptions that capture a wide range of visual and linguistic challenges. For practitioners looking to enhance their I2T and T2I models, this dataset offers a valuable resource for both training and evaluation.
Tags
Original Sources
↗ https://google.github.io/docci/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 May 2024
88 articles
Related Articles
Related Articles
More Stories