MVInpainter: Bridging 2D and 3D Editing with Multi-View Consistent Inpainting

Models & Research

The Engineer

19 Aug 2024 · 3 min read

MVInpainter addresses the gap in realistic 2D-3D editing by ensuring multi-view consistency, making it a game-changer for handling unpredictable real-world scenarios beyond lab conditions.

NeurIPS 2024

Authors: Chenjie Cao, Chaohui Yu, Yanwei Fu, Fan Wang, Xiangyang Xue
Affiliations: Fudan University, Alibaba DAMO Academy, Hupan Lab
Links: arXiv, Code and Model

Abstract

Recent advancements in Novel View Synthesis (NVS) and 3D generation have been impressive, but they often fall short when applied to real-world, uncontrolled environments. These methods typically focus on specific categories or synthetic assets and rely heavily on camera poses, which limits their practicality.

To address these challenges, researchers from Fudan University, Alibaba DAMO Academy, and Hupan Lab have introduced MVInpainter, a novel approach that reformulates 3D editing as a multi-view 2D inpainting task. By partially inpainting multiple views with reference guidance, MVInpainter simplifies the complexity of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions.

Key Technical Contributions

Multi-View Inpainting: Instead of generating an entirely new view from scratch, MVInpainter partially inpaints multi-view images. This approach reduces the difficulty of handling real-world scenes.
Cross-View Consistency: Enhanced by video priors and appearance guidance, ensuring that the generated views are consistent across different perspectives.
Slot Attention for Pose-Free Training: Aggregates high-level optical flow features from unmasked regions to control camera movement without requiring explicit pose information.

Architecture Overview

MVInpainter is designed with two variants:

MVInpainter-O: Trained on object-centric data, focusing on object-level NVS.
MVInpainter-F: Trained on forward-facing data, suitable for object removal and scene-level inpainting.

Both variants share a common SD-inpainting backbone but differ in LoRA/motion weights and masking strategies. The key components include:

Reference Key-Value (Ref-KV) Attention: Used in spatial self-attention blocks of the denoising U-Net to provide appearance guidance.
Slot Attention-Based Flow Grouping Module: Learns implicit pose features by aggregating optical flow information from unmasked regions.

Masking Adaption

To ensure accurate mask shapes for inference, MVInpainter employs a masking adaption technique. This process starts with a simple 4-point bottom face of the object and applies perspective warping through dense matching to warp the mask into the correct shape. This ensures that the inpainting is contextually appropriate and consistent across multiple views.

Results

Scene Editing

MVInpainter demonstrates impressive results in various scene editing tasks, including:

Object Removal: Seamlessly removing objects from scenes while maintaining natural-looking backgrounds.
Synthesis: Generating new elements that fit seamlessly into existing scenes.
Insertion: Adding new objects to scenes with consistent lighting and perspective.
Replacement: Replacing objects in scenes with others while preserving the overall scene coherence.

Multi-View Inpainted Images

The multi-view inpainting results show high fidelity and consistency across different views. This is particularly evident in complex, real-world scenarios where traditional methods often struggle.

3DGS (3D Generation Synthesis)

MVInpainter also excels in 3D generation synthesis, producing coherent and realistic 3D scenes from multiple inpainted views. The ability to generate high-quality 3D content without explicit pose information is a significant advancement in the field.

Conclusion

MVInpainter represents a significant step forward in bridging the gap between 2D and 3D editing. By leveraging multi-view consistent inpainting, it simplifies the complexity of real-world NVS tasks and opens up new possibilities for practical applications in 3D generation and scene editing.