
Share
Researchers Danial Dervovic and Michael Cashmore unveil a novel method to assess machine learning models when labels are missing not at random, addressing a critical gap in model evaluation techniques.
In a recent paper, "Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels," Danial Dervovic and Michael Cashmore tackle an often-overlooked issue in machine learning: how to evaluate classifiers when labels are missing during model evaluation. This problem is particularly relevant when data is Missing Not At Random (MNAR), a scenario where the probability of a label being missing depends on unobserved data.
The authors introduce a multiple imputation technique for evaluating classifier metrics like precision, recall, and ROC-AUC in the presence of missing labels. This method not only provides point estimates but also predictive distributions for these metrics, which is crucial for understanding the uncertainty associated with the evaluation.
Evaluating models with incomplete labels is a common challenge in real-world applications. Ignoring samples with missing labels can lead to biased metrics, especially under MNAR conditions. This paper offers a robust solution that:
The multiple imputation technique involves several steps:

The paper includes several implementation details that are useful for practitioners:
This technique can be particularly useful in industries where data collection is challenging or where labels are expensive to obtain. For example:
By using multiple imputation for model evaluation, practitioners can make more informed decisions about their models' performance and reliability, even when faced with incomplete data.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
29 April 2025
88 articles
Related Articles
Related Articles
More Stories