
Share
Researchers at UC Berkeley and Peking University introduce SegAnyMo, a novel method using semantic cues alongside motion data to accurately segment moving objects in videos without human annotations.
By Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang
UC Berkeley / Peking University
CVPR 2025
Moving object segmentation is a critical task for high-level visual scene understanding and has numerous downstream applications. While humans can effortlessly segment moving objects in videos, previous approaches have often relied on optical flow, which can lead to imperfect predictions due to issues like partial motion, complex deformations, motion blur, and background distractions.
In this paper, we introduce SegAnyMo, a novel approach that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model uses Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, particularly in challenging scenarios and fine-grained segmentation of multiple objects.
SegAnyMo takes 2D tracks and depth maps generated by off-the-shelf models as input. These inputs are processed by a motion encoder to capture motion patterns, producing featured tracks. Next, a tracks decoder integrates DINO features to decode the featured tracks by decoupling motion and semantic information, ultimately generating dynamic trajectories.
Finally, using SAM2, we group dynamic tracks belonging to the same object and generate fine-grained moving object masks. This process ensures that even in complex scenes with multiple moving objects, the model can accurately segment each object.

We observed that in highly challenging scenes, such as those with drastic camera movement or rapid object motion, relying solely on motion information is insufficient. For example, without DINO feature information, the model might incorrectly classify a stationary road surface as dynamic, even though the road lacks the ability to move.
In one of our test videos (first row), this issue becomes evident. The DINO features help the model distinguish between static and dynamic elements more accurately, leading to better segmentation results.
SegAnyMo represents a significant advancement in video object segmentation by combining motion cues with semantic features. This approach not only improves the accuracy of segmenting moving objects but also handles complex scenes more effectively. With its state-of-the-art performance on diverse datasets, SegAnyMo has the potential to enhance numerous applications in computer vision and beyond.
Tags
Original Sources
↗ https://motion-seg.github.io/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
2 April 2025
88 articles
Related Articles
Related Articles
More Stories