Navigating LLM Post-Training: From Supervised Fine-Tuning to RLHF

Models & Research

The Engineer

15 Sept 2025 · 3 min read

Explore the complex path from pre-trained LLMs to instruction-following models through supervised fine-tuning and RLHF, delving into dataset creation, loss functions, and reward models that refine AI capabilities.

When it comes to large language models (LLMs), the journey from pre-training to a finely-tuned, instruction-following model is both intricate and essential. This article breaks down the key steps in this process, focusing on supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). We'll dive into the technical details, including dataset creation, loss functions, and reward models, to give you a comprehensive understanding of how these techniques work.

1. The Journey from Pre-training to an Instruct-tuned Model

Pre-trained LLMs are powerful but often lack specific task-oriented skills. Post-training methods like SFT and RLHF help bridge this gap by aligning the model with human preferences and instructions. Here’s a high-level overview:

Pre-training: The model is trained on large amounts of unstructured data to learn general language patterns.
Supervised Fine-Tuning (SFT): The model is fine-tuned on a dataset of human-labeled examples to improve its task-specific performance.
Reinforcement Learning with Human Feedback (RLHF): The model is further optimized using reinforcement learning, guided by human feedback to align its outputs with desired behaviors.

2. E2E Life Cycle of Post-training

The end-to-end life cycle of post-training involves several stages:

Data Collection: Gathering high-quality labeled data for SFT and preference data for RLHF.
Model Training: Fine-tuning the pre-trained model using SFT and then optimizing it with RLHF.
Evaluation: Assessing the model’s performance and making adjustments as needed.

3. What Is Supervised Fine-Tuning (SFT)?

Supervised fine-tuning is a crucial step in aligning LLMs with specific tasks. Here’s how it works:

SFT Dataset

Data Sources: The dataset typically consists of pairs of inputs and desired outputs, often created by human annotators.
Data Quality: High-quality data is essential. This includes accurate labels, diverse examples, and representative tasks.

DATA EXAMPLES

Input: "What is the capital of France?"
Output: "Paris"

Data Quality in SFT Datasets

Accuracy: Ensuring that the outputs are correct.
Diversity: Covering a wide range of tasks and inputs.
Representativeness: Reflecting the types of queries the model will encounter in production.

How SFT Data Is Batched and Padded

Batching: Data is often batched to improve training efficiency. Each batch contains multiple input-output pairs.
Padding: Sequences are padded to ensure uniform length, which is necessary for efficient processing by neural networks.

SFT Loss Function - Negative Log Likelihood (NLL)

Loss Function: The NLL loss function measures the discrepancy between the model’s predicted probabilities and the true labels.
Formula: ( \text{NLL} = -\sum_{i=1}^{n} y_i \log(p_i) )
- Where ( y_i ) is the true label and ( p_i ) is the predicted probability.

Numerical Stability

Log-sum-exp Trick: To avoid numerical instability in the NLL calculation, the log-sum-exp trick is often used.
Formula: ( \log(\sum_{i=1}^{n} e^{x_i}) = x^* + \log(\sum_{i=1}^{n} e^{x_i - x^*}) )
- Where ( x^* ) is the maximum value in ( x_i ).

4. What Are the Common RL Training Techniques?

Reinforcement learning (RL) techniques, particularly RLHF, are used to further optimize LLMs. Here’s a breakdown:

RL Rewards

Rewards: These provide feedback to the model during training, guiding it towards better outputs.
Types of Rewards:
- Verifiable Rewards: Based on objective criteria (e.g., correctness).
- Preference Rewards: Based on human preferences (e.g., fluency and coherence).

Reward Models and Preferences

Reward Models: These models predict the reward for a