The Impact of Human Raters on Data Quality and Model Training

Models & Research

The Engineer

12 Feb 2024 · 3 min read

Human annotators play a crucial role in refining machine learning datasets, but their methods and biases can introduce flaws that affect model accuracy and fairness. This article explores how to mitigate these issues.

In the world of machine learning, high-quality data is often referred to as the "fuel" for training deep learning models. This is especially true when it comes to task-specific labeled data, which primarily comes from human annotation. Whether it's classification tasks or reinforcement learning with human feedback (RLHF) for language model alignment, the quality of this data can significantly impact the performance and reliability of your models.

Human Raters ↔ Data Quality

Collecting high-quality human data involves a series of operational steps, each contributing to the overall data quality. Let's break down these steps:

Task Design: The first step is designing the task workflow to ensure clarity and reduce complexity. Detailed guidelines are essential, but they need to be concise enough to be useful. Very long and complicated guidelines can overwhelm raters and require extensive training to be effective.
Select and Train Raters: Choosing annotators with the right skill set and ensuring consistency is crucial. Training sessions are a must, and regular feedback and calibration sessions help maintain high standards. This ongoing process ensures that raters stay aligned with the task requirements and can adapt to any changes in the guidelines.
Data Collection and Aggregation: Once the data starts coming in, various ML techniques can be applied to clean, filter, and aggregate it smartly. Techniques like outlier detection and consensus-based methods can help improve the quality of the aggregated data.

The Wisdom of the Crowd

The concept of "the wisdom of the crowd" plays a significant role in human data collection. When multiple raters are involved, their collective judgments often outperform individual ones. However, this is not always straightforward:

Rater Agreement: High agreement among raters generally indicates high data quality. Metrics like Cohen's Kappa or Fleiss' Kappa can be used to measure inter-rater reliability.

Rater Disagreement & Two Paradigms:
- Consensus-Based Aggregation: In this approach, the majority vote is taken as the ground truth. This works well when there is a clear consensus among raters.
- Disagreement Analysis: Sometimes, disagreements can provide valuable insights. Analyzing these discrepancies can help refine task guidelines and improve rater training.

Data Quality ↔ Model Training

The quality of your data directly influences the performance of your models during training. Here are some key techniques to consider:

Influence Functions: These functions help identify which data points have the most significant impact on a model's predictions. By analyzing these influence functions, you can detect and remove noisy or misleading data points.
Prediction Changes During Training: Monitoring how predictions change over the course of training can provide insights into the learning process. If certain data points consistently cause large changes in predictions, they might be problematic and need further investigation.
Noisy Cross-Validation: Traditional cross-validation techniques assume that the data is clean. However, in practice, noisy data is common. Noisy cross-validation methods can help you better understand how noise affects your model's performance.

Conclusion

While high-quality data is essential for training effective models, it often gets less attention than model development itself. The community recognizes the value of good data but sometimes overlooks the meticulous work required to collect and maintain it. By focusing on task design, rater selection and training, and smart data aggregation techniques, you can significantly improve the quality of your human-labeled data, leading to better-trained models.