Modern LLM Architecture Evolution: DeepSeek V3, GLM-5, and Beyond

Models & Research

The Engineer

21 Jul 2025 · 4 min read

As LLMs evolve, subtle yet impactful changes like refined positional embeddings and more efficient compute usage are reshaping the landscape, offering practitioners significant performance gains without radical design overhauls.

In the seven years since the original GPT architecture was introduced, large language models (LLMs) have seen a series of refinements that, while not entirely revolutionary, have significantly improved efficiency and performance. This article delves into the architectural changes in modern LLMs like DeepSeek V3 and GLM-5, focusing on how these updates impact practitioners.

Key Structural Changes

At first glance, today's models might seem structurally similar to their predecessors from 2019. However, several key refinements have emerged:

Positional Embeddings: Absolute positional embeddings have given way to Rotational Positional Embeddings (RoPE). RoPE allows for better handling of long sequences by encoding relative positions rather than absolute ones.
Attention Mechanisms: Multi-Head Attention (MHA) has largely been replaced by Grouped-Query Attention (GQA). GQA reduces the computational load by grouping query vectors, leading to more efficient attention mechanisms.
Activation Functions: The Swish-Gated Linear Unit (SwiGLU) has become a popular choice over GELU. SwiGLU offers better performance and efficiency in deep networks.

DeepSeek V3

DeepSeek V3 is one of the latest models to push the boundaries of LLM architecture. Here are some notable changes:

Sparse Mixture-of-Experts (MoE): DeepSeek V3 introduces a sparse MoE layer, which dynamically selects experts for each input token. This approach reduces the computational overhead while maintaining or even improving model performance.
- Expert Selection: The selection process is based on gating networks that decide which experts to activate for each token.
- Efficiency: Sparse MoE allows for parallel processing of tokens, making it highly scalable and efficient.
Mathematical Optimizations: DeepSeek V3 incorporates several mathematical optimizations to reduce computational complexity:
- Matrix Factorization: Large matrices are factorized into smaller components, reducing the number of parameters and speeding up training.
- Efficient Normalization: Layer normalization techniques have been optimized to minimize overhead without sacrificing performance.

GLM-5

GLM-5 is another significant player in the LLM landscape. Its architecture includes:

Hybrid Attention Mechanisms: GLM-5 combines GQA with Local Self-Attention (LSA). This hybrid approach balances global and local context, making it suitable for a wide range of tasks.
- GQA for Global Context: Handles long-range dependencies effectively.
- LSA for Local Context: Focuses on nearby tokens, improving efficiency in tasks like translation and summarization.

Transformer XL: GLM-5 builds upon the Transformer XL architecture, which uses segment-level recurrence to capture longer-term dependencies. This is particularly useful for understanding complex narratives or documents.

Nemotron 3 Super

Nemotron 3 Super, a recent addition, brings its own set of innovations:

Adaptive Depth: The model dynamically adjusts the number of layers based on the complexity of the input. This adaptive depth mechanism ensures that simpler tasks are processed more efficiently without compromising performance on complex ones.
Parameter Sharing: Nemotron 3 Super shares parameters across different layers to reduce redundancy and improve training efficiency.

Benchmarking Challenges

Comparing LLMs is challenging due to variations in datasets, training techniques, and hyperparameters. However, examining architectural changes provides valuable insights into the strategies developers are employing to enhance model performance:

Dataset Variability: Different models are trained on diverse datasets, making direct comparisons difficult.
Training Techniques: Advanced techniques like data augmentation and curriculum learning can significantly impact performance.
Hyperparameter Tuning: Fine-tuning hyperparameters is crucial for optimizing model performance but adds another layer of complexity.

Conclusion

While the core architecture of LLMs remains fundamentally similar, recent advancements in positional embeddings, attention mechanisms, and activation functions have led to more efficient and powerful models. DeepSeek V3, GLM-5, and Nemotron 3 Super exemplify these trends with their innovative approaches to sparsity, hybrid attention, and adaptive depth.

For practitioners, understanding these architectural changes is crucial for selecting the right model and optimizing its performance for specific tasks.