SegAnyMo: Combining Motion and Semantic Cues for Video Object Segmentation

Models & Research

The Engineer

2 Apr 2025 · 3 min read

Researchers at UC Berkeley and Peking University introduce SegAnyMo, a novel method using semantic cues alongside motion data to accurately segment moving objects in videos without human annotations.

Segment Any Motion in Videos Without Human Annotations

By Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang
UC Berkeley / Peking University
CVPR 2025

Abstract

Moving object segmentation is a critical task for high-level visual scene understanding and has numerous downstream applications. While humans can effortlessly segment moving objects in videos, previous approaches have often relied on optical flow, which can lead to imperfect predictions due to issues like partial motion, complex deformations, motion blur, and background distractions.

In this paper, we introduce SegAnyMo, a novel approach that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model uses Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, particularly in challenging scenarios and fine-grained segmentation of multiple objects.

Overview

SegAnyMo takes 2D tracks and depth maps generated by off-the-shelf models as input. These inputs are processed by a motion encoder to capture motion patterns, producing featured tracks. Next, a tracks decoder integrates DINO features to decode the featured tracks by decoupling motion and semantic information, ultimately generating dynamic trajectories.

Finally, using SAM2, we group dynamic tracks belonging to the same object and generate fine-grained moving object masks. This process ensures that even in complex scenes with multiple moving objects, the model can accurately segment each object.

Why DINO Features?

We observed that in highly challenging scenes, such as those with drastic camera movement or rapid object motion, relying solely on motion information is insufficient. For example, without DINO feature information, the model might incorrectly classify a stationary road surface as dynamic, even though the road lacks the ability to move.

In one of our test videos (first row), this issue becomes evident. The DINO features help the model distinguish between static and dynamic elements more accurately, leading to better segmentation results.

Technical Details

Input Data: 2D tracks and depth maps from off-the-shelf models.
Motion Encoder: Captures motion patterns by processing the input data and producing featured tracks.
Tracks Decoder: Integrates DINO features to decode the featured tracks, decoupling motion and semantic information.
Dynamic Trajectories: The decoder generates dynamic trajectories that capture the movement of objects over time.
SAM2 for Mask Densification: Uses an iterative prompting strategy to group dynamic tracks and generate fine-grained moving object masks.

Implementation Notes

Spatio-Temporal Trajectory Attention: This mechanism helps the model focus on relevant motion patterns while ignoring background noise.
Motion-Semantic Decoupled Embedding: By separating motion and semantic information, the model can better handle complex scenes with multiple moving objects.
Benchmarks: SegAnyMo outperforms state-of-the-art models on various datasets, including challenging scenarios with rapid camera movements and multiple dynamic objects.

Conclusion

SegAnyMo represents a significant advancement in video object segmentation by combining motion cues with semantic features. This approach not only improves the accuracy of segmenting moving objects but also handles complex scenes more effectively. With its state-of-the-art performance on diverse datasets, SegAnyMo has the potential to enhance numerous applications in computer vision and beyond.