
Share
Human annotators play a crucial role in refining machine learning datasets, but their methods and biases can introduce flaws that affect model accuracy and fairness. This article explores how to mitigate these issues.
In the world of machine learning, high-quality data is often referred to as the "fuel" for training deep learning models. This is especially true when it comes to task-specific labeled data, which primarily comes from human annotation. Whether it's classification tasks or reinforcement learning with human feedback (RLHF) for language model alignment, the quality of this data can significantly impact the performance and reliability of your models.
Collecting high-quality human data involves a series of operational steps, each contributing to the overall data quality. Let's break down these steps:
Task Design: The first step is designing the task workflow to ensure clarity and reduce complexity. Detailed guidelines are essential, but they need to be concise enough to be useful. Very long and complicated guidelines can overwhelm raters and require extensive training to be effective.
Select and Train Raters: Choosing annotators with the right skill set and ensuring consistency is crucial. Training sessions are a must, and regular feedback and calibration sessions help maintain high standards. This ongoing process ensures that raters stay aligned with the task requirements and can adapt to any changes in the guidelines.
Data Collection and Aggregation: Once the data starts coming in, various ML techniques can be applied to clean, filter, and aggregate it smartly. Techniques like outlier detection and consensus-based methods can help improve the quality of the aggregated data.
The concept of "the wisdom of the crowd" plays a significant role in human data collection. When multiple raters are involved, their collective judgments often outperform individual ones. However, this is not always straightforward:

The quality of your data directly influences the performance of your models during training. Here are some key techniques to consider:
Influence Functions: These functions help identify which data points have the most significant impact on a model's predictions. By analyzing these influence functions, you can detect and remove noisy or misleading data points.
Prediction Changes During Training: Monitoring how predictions change over the course of training can provide insights into the learning process. If certain data points consistently cause large changes in predictions, they might be problematic and need further investigation.
Noisy Cross-Validation: Traditional cross-validation techniques assume that the data is clean. However, in practice, noisy data is common. Noisy cross-validation methods can help you better understand how noise affects your model's performance.
While high-quality data is essential for training effective models, it often gets less attention than model development itself. The community recognizes the value of good data but sometimes overlooks the meticulous work required to collect and maintain it. By focusing on task design, rater selection and training, and smart data aggregation techniques, you can significantly improve the quality of your human-labeled data, leading to better-trained models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
12 February 2024
133 articles
Related Articles
Related Articles
More Stories