StreamMOS: Enhancing LiDAR-Based Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

Models & Research

The Engineer

29 Jul 2024 · 3 min read

StreamMOS introduces a groundbreaking approach by integrating multi-view perception and dual-span memory to ensure consistent segmentation of moving objects across LiDAR frames in autonomous systems.

In the world of autonomous driving and mobile robotics, accurately segmenting moving objects from LiDAR data is a critical yet challenging task. Most existing methods leverage spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, these approaches often treat each prediction as an independent event, leading to inconsistent segmentation results across frames.

To address this issue, researchers Zhiheng Li, Yubo Cui, Jiexi Zhong, and Zheng Fang have introduced StreamMOS (Streaming Moving Object Segmentation), a novel streaming network with a memory mechanism. This approach builds strong associations between features and predictions across multiple inferences, ensuring more consistent and accurate segmentation over time.

Key Technical Innovations

Dual-Span Memory Mechanism:
- Short-Term Memory: Captures historical features to serve as spatial priors for moving objects. These features are fused with current data to enhance the inference process.
- Long-Term Memory: Stores previous predictions and uses them to refine current forecasts at both voxel and instance levels through a voting mechanism.
Multi-View Encoder:
- Utilizes cascade projection and asymmetric convolution to extract motion features from different representations of the environment. This multi-view approach helps in capturing comprehensive and robust motion information.

Implementation Details

Short-Term Memory:
- Historical features are stored in a buffer that is updated with each new frame.
- These features are then fused with the current frame's data using temporal fusion techniques, such as concatenation or attention mechanisms.
Long-Term Memory:
- Previous predictions are stored and used to refine the current prediction through a voting process. This ensures that the model leverages past knowledge to improve consistency.
- The voting mechanism can operate at both voxel (3D grid) and instance levels, providing fine-grained refinement.

Multi-View Encoder:
- Cascade Projection: Projects LiDAR point clouds into multiple views (e.g., bird's-eye view, front view) to capture different perspectives.
- Asymmetric Convolution: Applies convolutional operations with asymmetric kernels to efficiently extract motion features from these views.

Experimental Results

The researchers evaluated StreamMOS on two popular datasets: SemanticKITTI and Sipailou Campus. The results showed that StreamMOS achieved competitive performance, demonstrating its effectiveness in handling the challenges of moving object segmentation in dynamic environments.

SemanticKITTI: Improved accuracy by leveraging historical features and long-term predictions.
Sipailou Campus: Showed robustness in real-world scenarios with varying environmental conditions.

Why It Matters

For practitioners working on autonomous driving and mobile robotics, StreamMOS offers a more reliable and consistent approach to moving object segmentation. By integrating both short-term and long-term memory mechanisms, the model can better handle dynamic scenes and provide more accurate predictions over time. This is particularly important for applications where real-time decision-making is crucial.

Future Work

The authors plan to release the code for StreamMOS on GitHub, making it accessible for further research and development in the field of LiDAR-based object segmentation.