Nvidia Releases Nemotron-Nano-9B-V2: A Compact, High-Performance SLM with Toggleable Reasoning

Models & Research

The Engineer

19 Aug 2025 · 3 min read

Nvidia's latest SLM, Nemotron-Nano-9B-V2, packs a punch with its reduced parameter count and toggleable reasoning feature, giving users control over AI self-checking for the first time.

Nvidia is making waves in the small language model (SLM) space with the release of Nemotron-Nano-9B-V2, a compact yet powerful model designed to fit on a single Nvidia A10 GPU. This new model not only achieves top performance in its class on selected benchmarks but also introduces a unique feature: toggleable AI reasoning, allowing users to enable or disable self-checking before generating output.

What Changed and Why It Matters

Technical Changes

Parameter Reduction: Nemotron-Nano-9B-V2 has been pruned from 12 billion parameters to 9 billion. This reduction is significant as it optimizes the model for deployment on a single A10 GPU, a popular choice in many production environments.
Hybrid Architecture: The model leverages a hybrid of Transformer and Mamba architectures, which allows it to process larger batch sizes and operate up to 6x faster than similar-sized transformer models.

Why It Matters

GPU Efficiency: By fitting on a single A10 GPU, the model is more accessible for deployment in resource-constrained environments, such as edge devices and smart devices.
Performance: Despite its compact size, Nemotron-Nano-9B-V2 maintains high performance, making it suitable for a wide range of applications, from instruction following to code generation.

Key Features

Multi-Language Support

Nemotron-Nano-9B-V2 handles multiple languages, including:

English
German
Spanish
French
Italian
Japanese
Extended support for Korean, Portuguese, Russian, and Chinese

This broad language coverage makes it a versatile tool for international applications.

Toggleable Reasoning

One of the standout features is the ability to toggle on and off AI reasoning. This feature allows users to enable self-checking before the model outputs an answer, which can be particularly useful in scenarios where accuracy is critical.

Architecture Details

Nemotron-Nano-9B-V2 is based on Nemotron-H, a set of hybrid Mamba-Transformer models detailed in a recent arXiv paper. Unlike pure Transformer models, which can become computationally expensive as sequence lengths grow, the hybrid architecture combines the strengths of both architectures to achieve better performance and efficiency.

Benchmarks and Performance

Performance: The model has achieved top performance in its class on selected benchmarks.
Speed: It processes larger batch sizes and is up to 6x faster than similar-sized transformer models, thanks to its hybrid architecture.

Availability

Nemotron-Nano-9B-V2 and its pre-training datasets are available right now on Hugging Face and through Nvidia’s model catalog.

Context in the Market

While many leading large language models (LLMs) have over 70 billion parameters, Nemotron-Nano-9B-V2 stands out for its compact size and high performance. This makes it a compelling choice for applications where resource efficiency is crucial, such as smart devices and edge computing.

Conclusion

Nvidia's release of Nemotron-Nano-9B-V2 marks a significant step in the development of small language models. By combining parameter reduction, hybrid architecture, multi-language support, and toggleable reasoning, this model offers a powerful yet efficient solution for a wide range of applications. As the demand for AI on resource-constrained devices continues to grow, Nemotron-Nano-9B-V2 is well-positioned to meet those needs.