What is the optimal evaluation metric for multilabel classification models?

Jan 9
4 min read

Updated: Jan 22

(Version 1.18 – January 07, 2026)

Ms. Yumi HEO

What is multilabel classification? Imagine classifying the colour of a single skin mole not just as light brown or dark brown but as both light brown and dark brown simultaneously. This is the core task of multilabel classification: training a machine learning model to assign multiple relevant labels to a single input.

However, accurately evaluating this model often encounters a greater challenge than evaluating a standard classification model. The truth is that there isn't a single optimal metric for every scenario. The best choice depends heavily on your specific objective and the inherent characteristics of your dataset.

Through the development of the AITIS Skin Cancer app, a tool designed to help you self-monitor skin health, we have found that selecting the appropriate evaluation metric is critical for the accuracy of our AI models. To accurately determine if a skin mole is cancerous, a model must consider multiple clinical variables simultaneously. Most diagnoses are not based on a single symptom but rather a combination of several positive indicators. We would like to introduce two essential metrics for these types of multilabel classification models: Example-based and Label-based metrics [1].

Label-Based vs. Example-Based

When evaluating a multi-label classifier, you can look at performance from two distinct views: focusing on how well the model handles each individual label across the whole dataset (Label-based), or focusing on how well the model handles all labels for a single data point (Example-based).

Let's use a skin mole colour classification example to see the difference.

The Scenario:

Imagine this model is tasked with classifying a single skin mole based on three possible colours: light brown (L), dark brown (D), black (B).

Metric Type	Focus	Key Question
Label-based	A single colour across all skin moles.	How well did the model find light-brown moles overall?
Example-based	All colours for one skin mole.	How well did the model classify this specific mole?

Label-Based Metrics

Label-based metrics look at the performance of the model on one specific label (e.g., 'light brown') and compare the model's predictions to the truth for every skin mole in the test set.

How it Works:

To calculate a label-based metric like Precision or Recall for the 'light brown' colour, the model temporarily treats the problem as a simple binary classification: Is it light brown, or is it NOT light brown?

Example skin mole	True Labels	Predicted Labels	Binary Result for light brown (L)
skin mole 1	{L, D}	{L, B}	True Positive (TP): Correctly predicted L
skin mole 2	{B, D}	{L, D}	False Positive (FP): Incorrectly predicted L
skin mole 3	{L}	{D, B}	False Negative (FN): Missed L

By aggregating the True Positives (TP), False Positives (FP), and False Negatives (FN) for all skin moles, we can calculate the performance metrics for the 'light brown' label alone. This process is repeated for the 'dark brown' and 'black' labels. For a detailed evaluation, the most effective visualisation is a confusion matrix for each label, allowing you to see the True Positives and False Negatives clearly.

Key Metrics in this Category:

· Precision, Recall, F1-Score (Macro and Micro Averages) [2]

· Area Under the ROC Curve (AUC) for each label [2]

Example-Based Metrics

Example-based metrics evaluate the prediction for a single data point (a single skin mole) as a complete set, then average those scores across the entire dataset. This is where the concept of partial correctness comes into play.

How it Works:

Consider skin mole 1 from the table above:

· True Labels: {light brown, dark brown}

· Predicted Labels: {light brown, black}

The model was not perfect, but it wasn't a total failure either.

Jaccard Index (Intersection over Union, IoU) measures the overlap between the two sets.

· Intersection: {light brown} (1 correct label)

· Union: {light brown, dark brown, black} (3 unique labels)

· Score: 1 / 3 (approx. 0.33)

· Interpretation: The prediction was 33% correct for this skin mole.

Subset Accuracy (Exact Match) is the strictest example-based metric. [1]

· Is the predicted set exactly equal to the true set? No.

· Score: 0

Key Metrics in this Category:

· Jaccard Index (or Sample Accuracy)

· Subset Accuracy (Exact Match Ratio)

Choosing Your Evaluation Metric

The choice between Label-Based and Example-Based metrics should directly reflect your objective.

If you need to ensure the model is robust enough to correctly identify every single colour, especially the ones that appear less often, the label-based metric should be considered.

If you want to measure how often the model gets the entire set of colours perfect, or close to it, the example-based metric should be considered.

In practice, the optimal evaluation method is rarely a single metric. A comprehensive strategy involves monitoring both metrics such as reviewing the label-specific confusion matrices to check the results for individual labels that are high-priority or problematic and monitoring Jaccard Index to review how often a model makes a "useful" prediction on a datapoint basis.

In conclusion, there is no one-size-fits-all metric. By understanding whether your project demands high fidelity across all possible labels (Label-Based) or high accuracy for each item (Example-Based), you can select the right tools to evaluate and improve your multilabel classification model.

References:

‌[1] M.-L. Zhang and Z.-H. Zhou, “A Review on Multi-Label Learning Algorithms,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819–1837, Aug. 2014, doi: https://doi.org/10.1109/tkde.2013.39.

[2] X.-Z. Wu and Z.-H. Zhou, “A Unified View of Multi-Label Performance Measures,” PMLR, pp. 3780–3788, Jul. 2017, Accessed: Dec. 15, 2025. [Online]. Available: https://proceedings.mlr.press/v70/wu17a.html