OmniSVG: A Unified Framework for High-Quality SVG Generation

Models & Research

The Engineer

10 Apr 2025 · 3 min read

OmniSVG harnesses pre-trained vision-language models to revolutionize SVG generation, offering high-quality outputs for complex designs at lower computational costs than traditional methods.

OmniSVG, a groundbreaking model presented at NeurIPS 2025 by researchers from Fudan University and StepFun, addresses the long-standing challenge of generating high-quality Scalable Vector Graphics (SVGs). Traditional methods either produce unstructured outputs with significant computational costs or are limited to simple monochrome icons. OmniSVG, on the other hand, leverages pre-trained Vision-Language Models (VLMs) to generate complex SVGs efficiently and effectively.

Key Technical Innovations

Unified Framework: OmniSVG uses a pre-trained VLM, specifically Qwen-VL, to handle multimodal inputs like text and images. This allows for a seamless integration of different generation modalities.
Tokenization of SVG Commands: By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples the structural logic from low-level geometry. This approach enables efficient training while preserving the complexity and expressiveness of SVGs.

Method Overview

OmniSVG's architecture is built on a pre-trained vision-language model (Qwen-VL) that can process both text and image inputs. The key innovation lies in the SVG tokenizer, which converts vector graphics commands into a unified representation space. This tokenization allows the model to handle SVG generation as a sequence prediction task, making it more efficient and scalable.

Generation Process

Input Tokenization: Text and image inputs are converted into prefix tokens.
SVG Tokenization: SVG commands and coordinates are parameterized into discrete tokens.
Sequence Prediction: The model generates the SVG token sequence using the pre-trained VLM, ensuring that the output is both structurally sound and visually appealing.

Versatility in Generation Modalities

OmniSVG excels in multiple generation modalities:

Text-to-SVG: Converts textual descriptions into high-quality SVGs.
Image-to-SVG: Translates raster images into vector graphics, preserving details and structure.
Character Reference SVG: Generates SVGs based on character references, useful for creating consistent styles across different elements.

MMSVG-2M Dataset

To advance the field of SVG synthesis, the researchers introduced MMSVG-2M, a multimodal dataset containing two million richly annotated SVG assets. This dataset is divided into three subsets:

Icon: Simple, monochrome icons.
Illustration: Complex illustrations with detailed structures.
Character: Intricate character designs.

The MMSVG-2M dataset provides a standardized evaluation protocol for conditional SVG generation tasks, facilitating benchmarking and comparison of different models.

Performance Benchmarks

Extensive experiments demonstrate that OmniSVG outperforms existing methods in terms of both quality and efficiency. Here are some key benchmarks:

Text-to-SVG: Generates high-fidelity SVGs from textual descriptions, with detailed structures and smooth curves.
Image-to-SVG: Converts raster images into vector graphics while maintaining the original details and aesthetic qualities.
Character Reference SVG: Produces consistent and stylistically coherent SVGs based on character references.

Practical Implications

OmniSVG's ability to generate high-quality and complex SVGs across various modalities makes it a powerful tool for graphic designers and researchers. The model's efficiency and versatility suggest potential integration into professional SVG design workflows, enhancing productivity and creativity.