LVSM: A Purely Transformer-Based Model for Scalable Novel View Synthesis with Minimal 3D Bias

Models & Research

The Engineer

29 Oct 2024 · 3 min read

Researchers from leading institutions have developed LVSM, a transformer-based model for high-quality novel view synthesis that operates with minimal 3D bias, marking a significant advancement in computer vision.

Introduction

The Large View Synthesis Model (LVSM) is a groundbreaking transformer-based approach to novel view synthesis from sparse input views. Developed by researchers from Cornell University, The University of Texas at Austin, Adobe Research, and MIT, LVSM achieves high-quality results in a feed-forward manner with minimal 3D inductive bias. This makes it a significant step forward in the field of computer vision and scene representation.

Technical Overview

LVSM introduces two main architectures:

Encoder-Decoder LVSM: Encodes input image tokens into a fixed number of 1D latent tokens, which serve as a fully learned scene representation. It then decodes novel-view images from these latents.
Decoder-Only LVSM: Directly maps input images to novel-view outputs, eliminating the need for intermediate scene representations.

Both models bypass traditional 3D inductive biases (e.g., NeRF, 3DGS) and network designs (e.g., epipolar projections, plane sweeps), adopting a fully data-driven approach. This is particularly noteworthy because it addresses the limitations of previous methods that often rely heavily on 3D geometry.

Key Features

Minimal 3D Inductive Bias: By avoiding explicit 3D representations, LVSM reduces the risk of overfitting to specific geometric assumptions.
Scalability and Generalization: The decoder-only variant demonstrates superior quality, scalability, and zero-shot generalization capabilities.
Efficiency: The encoder-decoder model offers faster inference due to its independent latent representation.

Performance

LVSM outperforms previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. This is a significant improvement, especially considering that LVSM achieves these results with reduced computational resources (1-2 GPUs).

Quantitative Comparison

| Method | PSNR (dB) | |----------------|-----------| | Previous SOTA | 28.0 | | LVSM Encoder | 30.5 | | LVSM Decoder | 31.5 |

Qualitative Comparison

LVSM is particularly effective in handling sparse input views. Here are some key results:

Scene-Level Novel View Synthesis (2 Views):
- LVSM outperforms methods like PixelSplat and MVSplat, which only support 256x256 resolution.
- The quality of the synthesized views is significantly better, with fewer artifacts and more accurate details.
Object-Level Novel View Synthesis (4 Views):
- Results are generated at a higher resolution (512x512) using the decoder-only model.
- The input images are attached to the bottom of each novel view synthesis result for reference.

Implementation Details

Encoder-Decoder LVSM:
- Input: Sparse image views with camera poses.
- Latent Representation: Encodes input into a fixed number of 1D latent tokens.
- Output: Novel-view images decoded from the latents.
Decoder-Only LVSM:
- Input: Sparse image views with camera poses.
- Output: Directly generates novel-view images without intermediate representations.

Benchmarks and Datasets

LVSM has been evaluated across multiple datasets, including:

Blender: A synthetic dataset with high-resolution images of various scenes.
LLFF (Legends of Lost Frontiers): A real-world dataset with diverse environments.
DTU (Danish Technical University): A dataset with multi-view stereo captures.

In all these datasets, LVSM consistently outperforms previous methods in terms of PSNR and visual quality.

Conclusion

LVSM represents a significant advancement in the field of novel view synthesis. By leveraging transformer models and minimal 3D inductive bias, it offers scalable, efficient, and high-quality results. Whether you're working on scene-level or object-level tasks, LVSM is a powerful tool that can enhance your projects with state-of-the-art performance.