Handling Class Imbalance in Image Datasets
- Mar 30
- 6 min read
(Version 1.00 – March 30th, 2026)
Ms. Yumi HEO
The problem with imbalanced data
Imagine training a model to detect a rare disease from medical scans. Your dataset has 950 healthy scans and only 50 diseased ones. If you train naively, the model quickly learns to simply predict "healthy" every single time. It could achieve 95% accuracy but this is completely useless.
This is the class imbalance problem. It is everywhere in the real world and it's also common in image datasets. Of course, collecting more samples for the minority class is one of the reliable ways to improve performance on that class in a dataset [1], but this is often expensive or simply rare.
General approaches: oversampling & undersampling
Before reaching for any fancy technique, two simple but often effective strategies exist at the data level [2].
Oversampling
Duplicate minority-class samples until the class distribution is more balanced. Simple, but risks overfitting. The model may memorise the repeated images rather than generalise from them.
Undersampling
Randomly discard majority-class samples to reduce the imbalance. This reduces training time but throws away potentially useful information.
Image-specific techniques: augmentation & synthesis
For image data specifically, the richest lever you have is data augmentation applied selectively to the minority class. Instead of just duplicating images, you create new variations — each one slightly different, each one teaching the model something new. Albumentations [3] is a useful library for applying data augmentation.
Geometric transformations
Flipping, rotating, cropping, and resizing are the classics. A cat is still a cat when mirrored. A tumour is still a tumour when slightly rotated. These operations are label-preserving and cheap.
Colour & photometric transforms
Adjusting brightness, contrast, saturation, and hue. Particularly useful when the minority class was captured under different lighting conditions.
Advanced augmentationsT
echniques like CutMix [4], Mixup [5], and GridDistortion [6] can be applied to minority classes specifically, creating harder and more varied training examples. Libraries like albumentations make this straightforward.
Heavy augmentation applied only to the minority class is another way of oversampling. You are not just copying. You are generating new views of the same underlying phenomenon.

Figure 1. Data augmented skin mole images
Generative models (GANs, Diffusion)
At the frontier, GANs or diffusion models [7] can synthesise entirely new minority-class images. This is expensive to set up but can dramatically expand a small dataset. However, there is a risk that the generated images may contain artefacts that the model learns to rely on rather than the true underlying features

Figure 2. Synthetic skin mole image using the StableDiffusionlmg2lmgPipeline [8]
SMOTE — and why it's not for CNNs
SMOTE (Synthetic Minority Over-sampling Technique) [9] is a well-known algorithm that creates synthetic minority samples by interpolating between existing ones in feature space. Pick a minority sample, find its k-nearest neighbours, and create a new point somewhere along the line between them.
SMOTE works well when your feature space is a flat, low-dimensional vector which is related to the classical machine learning models like SVMs. An SVM operating on a feature vector of tens to low hundreds of values finds SMOTE-generated points reasonable neighbours in that space.
For CNNs and other deep image models, applying SMOTE directly on raw pixel arrays destroys the spatial structure that convolutional layers depend on [10]. Interpolating between two images in pixel space produces a blurry, unrealistic hybrid that offers no useful signal about the minority class.
Use SMOTE for
- SVMs & classical ML
- HOG descriptors, SIFT vectors, or other hand-crafted image features encoded as flat vectors.
Avoid SMOTE for
- CNNs & Vision Transformers
- Raw pixels or feature maps retain spatial structure that pixel-space interpolation corrupts. Use augmentation instead.
Neural network tools: class weights & pos_weight
When augmentation is not enough, or when you cannot change the data itself, you can instead tell the loss function to care more about the minority class. This is handled directly inside the training loop.
Class weights [11]
Most deep learning frameworks support passing a class_weight dictionary to the loss function. A simple rule is that the weight for a class should be inversely proportional to its frequency.
A rule of thumb for computing class_weight automatically from your dataset using scikit-learn's formula. This is the example using scikit-learn to get class_weight.
from sklearn.utils.class_weight import compute_class_weightweights = compute_class_weight( class_weight='balanced', classes=np.unique(labels), y=labels)class_weights = torch.tensor(weights, dtype=torch.float)In PyTorch, you can compute this and pass it to nn.CrossEntropyLoss(weight=...). In Keras, it goes into model.fit(class_weight=...). The effect is that every gradient update from a minority-class sample is scaled up, giving it more influence on the model's weights. This is the example using PyTorch CrossEntropyLoss to get and apply class_weight.
import torchimport torch.nn as nnimport numpy as np# Your class counts — e.g. 3 classes: 900 / 80 / 20 samplesclass_counts = torch.tensor([900, 80, 20], dtype=torch.float)# Inverse-frequency weighting: rarer classes get higher weightclass_weights = 1.0 / class_counts # → tensor([0.0011, 0.0125, 0.0500])class_weights = class_weights / class_weights.sum() # normalise (optional); → tensor([0.0175, 0.1965, 0.7860])criterion = nn.CrossEntropyLoss(weight=class_weights)pos_weight (for binary classification)
When your problem is binary — present vs absent, healthy vs non-healthy — PyTorch's nn.BCEWithLogitsLoss has a pos_weight parameter. Set it to the ratio of negative samples to positive samples, and the loss will automatically penalise false negatives more heavily than false positives. This is the example using PyTorch BCEWithLogitsLoss to get and apply pos_weight.
import torchimport torch.nn as nn# Your dataset: 900 negatives, 100 positivesnum_negatives = 900num_positives = 100# pos_weight = neg / pos — tells the loss to penalise missed positives morepos_weight = torch.tensor([num_negatives / num_positives]) # → tensor([9.])criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)[Differences between class weight and pos_weight]
CrossEntropyLoss(weight=...) | BCEWithLogitsLoss(pos_weight=...) | |
Use case | Multi-class (≥2 classes) | Binary only |
What it scales | Each class's contribution to the loss | False negatives specifically |
Input shape | [batch, num_classes] | [batch] or [batch, 1] |
Weight shape | [num_classes] | [1] |
Class weights and pos_weight do not create any new data. They simply re-weight the gradient signal. This makes them a clean, low-risk intervention. This is always a worth trying before more invasive techniques.
Putting it all together
In practice, the best results come from layering these techniques thoughtfully.
Step 1 · Data level - Sampling
Oversample or undersample
Use oversampling with augmentation, or mild undersampling on the majority class.
Step 2 · Data level - Augmentation
Augment minority class
Apply aggressive augmentation only to minority-class images during training.
Step 3 · Loss functionRe-weight the loss by adding class_weights or pos_weight
For SVMs specifically, SMOTE on feature vectors is a strong addition to this pipeline. For neural networks, skip the original SMOTE and lean into augmentation plus re-weighting the loss.
Class imbalance is not an edge case. It is the natural state of most real-world datasets. Address it deliberately, evaluate with the right metrics, and your model will learn what actually matters.
References
[1] “Tools for handling class imbalance,” Ultralytics, Jun. 17, 2025. https://community.ultralytics.com/t/tools-for-handling-class-imbalance/1149
[2] M. Salmi, D. Atif, D. Oliva, A. Abraham, and S. Ventura, “Handling imbalanced medical datasets: review of a decade of research,” Artificial Intelligence Review, vol. 57, no. 10, Sep. 2024, doi: 10.1007/s10462-024-10884-2.
[3] “Albumentations Documentation,” http://albumentations.ai . https://albumentations.ai/docs/
[4] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features,” arXiv:1905.04899 [cs], Aug. 2019, Available: https://arxiv.org/abs/1905.04899
[5] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” arXiv:1710.09412 [cs, stat], Apr. 2018, Available: https://arxiv.org/abs/1710.09412
[6] M. Hägglund, M. Mørreaunet, M.-B. Moser, and E. I. Moser, “Grid-Cell Distortion along Geometric Borders,” Current Biology, vol. 29, no. 6, pp. 1047-1054.e3, Mar. 2019, doi: 10.1016/j.cub.2019.01.074.
[7] T. Ø. Eliassen and Y. Ma, “Data Synthesis with Stable Diffusion for Dataset Imbalance - Computer Vision,” journal-article, 2021. [Online]. Available: https://cs230.stanford.edu/projects_fall_2022/reports/17.pdf
[8] http://Huggingface.co , 2016. https://huggingface.co/docs/diffusers/v0.37.0/en/api/pipelines/stable_diffusion/img2img#diffusers.StableDiffusionImg2ImgPipeline (accessed Mar. 17, 2026).
[9] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, no. 16, pp. 321–357, Jun. 2002, doi: https://doi.org/10.1613/jair.953.
[10] G. Szlobodnyik and L. Farkas, “Data augmentation by guided deep interpolation,” Applied Soft Computing, vol. 111, p. 107680, Jul. 2021, doi: 10.1016/j.asoc.2021.107680.
[11] W. Huang, G. Song, M. Li, W. Hu, and K. Xie, “Adaptive weight optimization for classification of imbalanced data,” in Lecture notes in computer science, 2013, pp. 546–553. doi: 10.1007/978-3-642-42057-3_69.

