Semantic Calibration Emerges Naturally in LLMs, Apple Researchers Find

Models & Research

The Engineer

25 Mar 2026 · 3 min read

Researchers at Apple discovered that large language models naturally achieve semantic calibration, meaning they can assess the true meaning of their responses beyond just predicting the next word accurately.

Large Language Models (LLMs) have made significant strides in generating human-like text, but one of their persistent challenges is providing meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it's been unclear whether they can assess the actual meaning of their responses beyond the token level. A recent paper from Apple researchers Preetum Nakkiran, Arwen Bradley, Adam Goliński, Eugene Ndiaye, Michael Kirchhof, and Sinead Williamson sheds light on this issue.

Key Findings:

Base LLMs are semantically calibrated: Using a sampling-based notion of semantic calibration, the researchers found that base LLMs can meaningfully assess confidence in open-domain question-answering tasks.
Theoretical mechanism for semantic calibration: The paper establishes why semantic calibration emerges as a byproduct of next-token prediction, leveraging a connection between calibration and local loss optimality.
B-calibration definition: A general notion of calibration parameterized by equivalence classes (semantic or otherwise).
Testable predictions validated through experiments:
- Base LLMs are semantically calibrated across question-answering tasks.
- Reinforcement learning (RL) instruction-tuning breaks this calibration.
- Chain-of-thought reasoning also breaks calibration.

Technical Details:

Semantic Calibration:

Definition: Semantic calibration is the ability of a model to accurately assess its confidence in the meaning of its responses, not just the next token.
Sampling-based notion: The researchers use a sampling method to evaluate whether the model's confidence matches the actual correctness of its answers.

Theoretical Contributions:

B-calibration: This concept generalizes calibration by considering equivalence classes. For example, if two responses are semantically equivalent (e.g., "Paris" and "the capital of France"), they can be treated as a single class.
Local loss optimality: The theory shows that next-token prediction, when optimized locally, naturally leads to semantic calibration.

Experimental Validation:

Base LLMs: Across various question-answering tasks, base LLMs exhibit strong semantic calibration.
RL instruction-tuning: This method, which fine-tunes models using reinforcement learning based on human feedback, tends to break the natural semantic calibration of base LLMs.
Chain-of-thought reasoning: Breaking down complex questions into simpler steps (chain-of-thought) also disrupts semantic calibration.

Implications:

Model Evaluation: Understanding when and why semantic calibration emerges can help in evaluating and improving models.
Training Strategies: The findings suggest that certain training strategies, like RL instruction-tuning, may need to be adjusted to maintain semantic calibration.
Practical Applications: For applications where confidence estimates are crucial (e.g., medical diagnosis or legal advice), ensuring semantic calibration can improve reliability.

Conclusion:

This research provides a principled explanation for the emergence of semantic calibration in LLMs and highlights the importance of maintaining this property during model training and fine-tuning. As LLMs continue to be integrated into various real-world applications, understanding and preserving semantic calibration will be crucial for ensuring their reliability and trustworthiness.