MeshFormer: Efficient 3D Mesh Generation with Sparse Views and Explicit 3D Bias

Models & Research

The Engineer

21 Aug 2024 · 3 min read

MeshFormer revolutionizes 3D mesh generation with its efficient, single-pass process using sparse views and explicit 3D bias, outpacing rivals by requiring fewer resources and less time for training.

Introduction

At NeurIPS 2024, a team of researchers from UC San Diego, Hillbot Inc., Zhejiang University, and UCLA introduced MeshFormer, a novel approach to high-quality 3D mesh generation. MeshFormer stands out for its ability to produce detailed, textured meshes with fine-grained geometric details in a single feed-forward pass, all while being trained efficiently on just 8 H100 GPUs over two days. This is a significant improvement over concurrent methods that often require more than one hundred GPUs and complex multi-stage training processes.

Key Contributions

Efficient Training: MeshFormer can be trained using 8 H100 GPUs for just 2 days.
High-Quality Output: Generates high-quality, textured meshes with fine-grained geometric details in a single pass.
3D Inductive Bias: Explicitly leverages 3D native structure and input guidance to improve mesh quality.

Method Overview

Input and Representation

MeshFormer takes a sparse set of multi-view RGB images and normal maps as input. These inputs can be estimated using existing 2D diffusion models, which significantly aids in guiding the geometry's learning process. The model uses a 3D feature volume representation, where features are stored in 3D sparse voxels.

Architecture

Voxel Former: This module combines transformers with 3D convolutions to leverage an explicit 3D structure and projective bias.
Sparse Voxel Processing: Efficiently processes the sparse voxel grid to handle large scenes without excessive memory usage.

Technical Details

Feature Storage: Instead of using a triplane representation, MeshFormer stores features in 3D sparse voxels. This approach allows for better preservation of geometric details and more efficient memory usage.
Transformer + 3D Convolutions: The Voxel Former module uses transformers to capture global context and 3D convolutions to process local spatial information. This combination helps in generating high-quality meshes by leveraging both global and local features.
Normal Maps: In addition to RGB inputs, the network also takes normal maps as input and generates corresponding normal maps. These normal maps can be predicted using 2D diffusion models, providing additional guidance for geometry refinement.
Supervision: The model is trained with Signed Distance Function (SDF) supervision combined with surface rendering. This direct learning approach eliminates the need for complex multi-stage training processes, making the model more efficient and easier to train.

Implementation and Results

MeshFormer's efficiency is demonstrated by its ability to be trained on just 8 H100 GPUs in two days, a stark contrast to other methods that require significantly more resources. The model's performance is validated through benchmarks on datasets such as GSO (Google Scanned Objects) and OmniObject3D, where it consistently produces high-quality meshes with fine-grained details.

Applications

Single-Image-to-3D: MeshFormer can be integrated with 2D diffusion models to enable fast single-image-to-3D reconstruction.
Text-to-3D: The model's ability to generate detailed meshes quickly makes it suitable for text-to-3D tasks, where a textual description is converted into a 3D object.

Conclusion

MeshFormer represents a significant advancement in 3D mesh generation by efficiently leveraging 3D native structures and input guidance. Its ability to produce high-quality, textured meshes with fine-grained details in a single pass, while being trained on limited resources, makes it a promising tool for various applications in computer vision and graphics.