
Share
CUPS harnesses motion and depth from stereo images to create detailed pseudo-labels, enabling a monocular model to achieve top-tier accuracy in unsupervised panoptic segmentation without human-labeled data.
In a significant leap forward for unsupervised panoptic segmentation, researchers from TU Darmstadt, TU Munich, University of Oxford, MCML, ELIZA, and hessian.AI have introduced CUPS (Scene-Centric Unsupervised Panoptic Segmentation). This method leverages motion and depth information from stereo pairs to generate high-quality pseudo-labels, which are then used to train a monocular panoptic segmentation network. The result is state-of-the-art performance on complex scene-centric benchmarks without the need for any manual annotations.
CUPS addresses a critical gap in unsupervised panoptic segmentation by focusing on scene-centric data rather than object-centric training sets. Traditional methods often rely on manually annotated datasets, which are time-consuming and expensive to create. CUPS eliminates this dependency, making it easier to apply panoptic segmentation to real-world scenarios with complex scenes.
Pseudo-Label Generation:
Training Pipeline:
CUPS leverages several advanced techniques to achieve its results:

CUPS has been tested on several scene-centric benchmarks, including Cityscapes. The results are impressive:
For practitioners and researchers, CUPS offers a powerful tool for unsupervised panoptic segmentation that can be applied to a wide range of real-world scenarios. By eliminating the need for manual annotations, it reduces the overhead associated with data preparation and makes it feasible to apply panoptic segmentation to large-scale datasets and dynamic environments.
CUPS represents a significant advancement in unsupervised panoptic segmentation by focusing on scene-centric data and leveraging motion and depth information from stereo pairs. The method's ability to generate high-quality pseudo-labels and its robust training pipeline make it a valuable addition to the field, paving the way for more efficient and accurate scene understanding.
Tags
Original Sources
↗ https://visinf.github.io/cups/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
7 April 2025
88 articles
Related Articles
Related Articles
More Stories