
Share
As LLMs evolve, subtle yet impactful changes like refined positional embeddings and more efficient compute usage are reshaping the landscape, offering practitioners significant performance gains without radical design overhauls.
In the seven years since the original GPT architecture was introduced, large language models (LLMs) have seen a series of refinements that, while not entirely revolutionary, have significantly improved efficiency and performance. This article delves into the architectural changes in modern LLMs like DeepSeek V3 and GLM-5, focusing on how these updates impact practitioners.
At first glance, today's models might seem structurally similar to their predecessors from 2019. However, several key refinements have emerged:
DeepSeek V3 is one of the latest models to push the boundaries of LLM architecture. Here are some notable changes:
Sparse Mixture-of-Experts (MoE): DeepSeek V3 introduces a sparse MoE layer, which dynamically selects experts for each input token. This approach reduces the computational overhead while maintaining or even improving model performance.
Mathematical Optimizations: DeepSeek V3 incorporates several mathematical optimizations to reduce computational complexity:
GLM-5 is another significant player in the LLM landscape. Its architecture includes:

Nemotron 3 Super, a recent addition, brings its own set of innovations:
Comparing LLMs is challenging due to variations in datasets, training techniques, and hyperparameters. However, examining architectural changes provides valuable insights into the strategies developers are employing to enhance model performance:
While the core architecture of LLMs remains fundamentally similar, recent advancements in positional embeddings, attention mechanisms, and activation functions have led to more efficient and powerful models. DeepSeek V3, GLM-5, and Nemotron 3 Super exemplify these trends with their innovative approaches to sparsity, hybrid attention, and adaptive depth.
For practitioners, understanding these architectural changes is crucial for selecting the right model and optimizing its performance for specific tasks.
Tags
Original Sources
↗ https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
21 July 2025
133 articles
Related Articles
Related Articles
More Stories