
Share
OpenAI's new gpt-oss models strike a balance between computational efficiency and cognitive prowess, offering developers powerful AI tools without the need for massive computing resources.
OpenAI has recently released two versions of its open-source models, gpt-oss-120b and gpt-oss-20b, which are making waves in the AI community. These models offer a balance between intelligence and efficiency, making them particularly appealing for practitioners looking to deploy powerful models with limited resources.
gpt-oss-120b:
gpt-oss-20b:
Both models are released in MXFP4 precision, which significantly reduces their size and memory requirements:
This means the gpt-oss-120b can be run on a single NVIDIA H100, and the gpt-oss-20b is easily deployable on consumer GPUs or laptops with more than 16GB of RAM.
The efficiency of these models is largely due to their sparse architecture. The gpt-oss-120b activates only 4.4% of its total parameters per forward pass, which translates to faster inference times. For instance, the 20B model can generate dozens of output tokens per second on recent MacBooks, thanks to its relatively small proportion of active parameters (3.6B out of 20.9B).
Both models score impressively well for their size and sparsity:

The gpt-oss-120b stands out as the most intelligent model that can be run on a single H100, while the 20B version is the smartest option for consumer GPUs. This makes them highly attractive for practitioners looking to balance performance and resource constraints.
While gpt-oss-120b doesn't surpass DeepSeek R1 (score of 59) or Qwen3 235B (score of 64), it is significantly more efficient:
DeepSeek R1:
Qwen3 235B:
Both gpt-oss models are text-only, similar to competing models from DeepSeek, Alibaba, and others.
The gpt-oss models use a standard Mixture of Experts (MoE) architecture:
OpenAI's gpt-oss models represent a significant step forward in balancing intelligence and efficiency. The gpt-oss-120b and gpt-oss-20b offer competitive performance while being highly resource-efficient, making them ideal for a wide range of applications from high-end servers to consumer devices.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
7 August 2025
88 articles
Related Articles
Related Articles
More Stories