AI & TechnologyJune 10, 20256 min read

How Skin Lesion Classifiers Are Built: Architecture & Training

A technical primer on how skin lesion classification CNNs work — from the HAM10000 benchmark dataset to architecture tradeoffs, class imbalance solutions, and published validation results from peer-reviewed research.

How Skin Lesion Classifiers Are Built: Architecture & Training

As of June 10, 2025.

AI-assisted dermatology tools help patients get faster, more consistent access to specialist review. A core component of these tools is a skin lesion classifier: a CNN (Convolutional Neural Network) that takes a dermoscopy image as input and outputs a probability distribution across lesion categories. This article explains how such classifiers are typically built, covering data, architecture, training decisions, and validation — drawing on published research.

What is the HAM10000 dataset?

HAM10000 (Human Against Machine with 10,000 training images) is a publicly available benchmark of 10,015 dermoscopy images spanning seven skin lesion categories, assembled by Tschandl et al. and published in Nature Scientific Data in 2018. It remains the most widely used training corpus for skin lesion classification research as of mid-2025.

The seven classes are: melanocytic nevi (nv), melanoma (mel), benign keratosis (bkl), basal cell carcinoma (bcc), actinic keratosis (akiec), vascular lesions (vasc), and dermatofibroma (df). Images were collected across two decades from the Medical University of Vienna and Melanoma Institute Australia, covering a range of acquisition devices. The class distribution is heavily skewed, with approximately 67% of images being benign nevi — a well-known challenge for any team training a classifier on this data. The dataset is available through the ISIC (International Skin Imaging Collaboration) archive.

Why do researchers use CNN instead of a standard neural network?

A CNN (Convolutional Neural Network) applies learned spatial filters that slide across an image, detecting texture, colour gradient, and structural features like asymmetric borders regardless of where they appear in the frame. A fully connected network would need to learn these spatial relationships separately for every pixel position, requiring far more parameters and training data to achieve comparable results.

For dermoscopy images, this matters in practice. A suspicious pigment network pattern in the upper-left corner of a 450x600 image should produce the same feature activations as the same pattern centred in the frame. Convolution provides that translation invariance by design. It also means the same trained weights apply to images of different resolutions by adjusting the final pooling layer — useful when input images come from different dermatoscope models.

Published comparisons consistently show that CNN backbones outperform simple fully connected networks on HAM10000 and similar dermoscopy datasets by a substantial margin on weighted F1 before any hyperparameter tuning.

A deeper primer on how computer vision models process skin images is available in our article on how computer vision models detect skin conditions.

Which architectures are used in published skin lesion classification research?

Research teams have evaluated a range of CNN architectures pretrained on ImageNet and fine-tuned on HAM10000. Published results show EfficientNet variants and ResNet-based models among the best-performing options on the ISIC benchmarks, balancing accuracy with parameter efficiency.

Architectures commonly reported in the literature include:

Model Year Params (M) Notes
ResNet-50 2015 25.6 Strong baseline, widely cited in dermatology ML research
Inception-V3 2016 23.8 Good at multi-scale features
EfficientNet-B3 2019 12.2 Favourable accuracy-to-size ratio in published results
ViT-B/16 2020 86.6 High accuracy on large datasets; parameter-heavy relative to HAM10000 scale

The Vision Transformer (ViT-B/16) typically requires more training data to generalise well. HAM10000's 10,015 images is modest by computer vision standards, which is why CNN architectures with compound scaling — like EfficientNet — appear frequently in published skin lesion classification work. The ISIC annual challenge leaderboard provides a public record of model performance across these architectures.

For context on how architecture choices interact with Fitzpatrick skin tone generalisation, see our article on training AI on diverse skin tones.

How do researchers handle class imbalance during training?

Class imbalance is the central practical challenge in skin lesion classification CNN work. When approximately 67% of training images belong to a single class (benign nevi), a naive model learns to predict that class most of the time and still achieves a passable accuracy number — while failing badly on the clinically critical minority classes like melanoma.

Two complementary techniques are widely reported in the literature: weighted cross-entropy loss (giving rare classes proportionally higher gradient signal during backpropagation) and augmentation-based oversampling for minority classes. Augmentation typically includes random flip, rotation, colour jitter, and random crop with resize to generate synthetic variants, bringing minority classes to a higher effective training count per epoch.

Published results consistently show that these adjustments improve both overall weighted F1 and per-class recall for melanoma — reducing false-negative rates meaningfully compared to training on raw class proportions. In a clinical context, missing a melanoma is far more consequential than a false-positive that prompts a follow-up, which is why recall on the melanoma class is treated as the primary optimisation target in most published work.

What training setups are typical in published research?

Published skin lesion classification papers typically use PyTorch or TensorFlow with pretrained ImageNet weights, AdamW or SGD optimisers, cosine learning rate schedules, and batch sizes between 16 and 64. Input images are commonly resized to between 224x224 and 300x300 pixels and normalised to ImageNet mean and standard deviation.

A standard data split is 80% train, 10% validation, and 10% test, stratified by class. Early stopping on validation weighted-F1 is common, and ensemble averaging across multiple training seeds is a well-established technique for reducing variance on small medical imaging datasets — consistent with findings in the dermatologist-level classification paper by Esteva et al.. FP16 mixed-precision training is now standard practice for reducing GPU memory consumption without degrading accuracy.

What does a CNN's feature maps reveal about skin lesion patterns?

In early convolutional layers, a CNN responds to low-level edge contrasts and colour boundaries. Deeper layers activate on lesion-specific patterns: irregular pigment networks, atypical vascular structures, and asymmetric colour distributions that correlate with malignancy in clinical dermoscopy criteria.

Grad-CAM (Gradient-weighted Class Activation Mapping) is a widely used interpretability technique that generates heatmaps showing which image regions most influenced each classification decision. For melanoma predictions, activation typically concentrates on areas of irregular pigment network and blue-white veil — consistent with the dermoscopic criteria described in American Academy of Dermatology (AAD) clinical guidelines at aad.org. Surfacing these heatmaps alongside confidence scores in clinical review interfaces allows dermatologists to verify which features the model weighted most heavily before confirming or overriding the classification.

The foundational work showing CNN accuracy at dermatologist level was published by Esteva et al. in Nature in 2017 at nature.com, and remains a key benchmark for dermatology ML (Machine Learning) research.

How are skin lesion classifiers validated against clinical benchmarks?

The ISIC challenge test sets provide consistent, blinded evaluation that prevents overfitting to a single institution's data distribution. Published ISIC 2019 results (25,331 images, 8 classes) show normalised multi-class accuracy scores ranging across participating teams, with top academic groups reporting results in the 0.60–0.65 range on normalised accuracy — a publicly accessible leaderboard that researchers use to compare approaches.

Dermatologist-to-dermatologist agreement on dermoscopy classification studies typically falls between 70% and 85% depending on lesion type and image quality, consistent with findings reported in the HAM10000 dataset paper. This inter-rater variability provides important context for evaluating AI model agreement rates.

AI-assisted skin lesion tools are not diagnostic devices and are not designed to replace clinician judgment. The clinical value lies in prioritisation and pre-classification — helping certified dermatologists allocate review time to the cases most likely to require urgent attention. Health Canada's regulatory position on AI-assisted medical devices is evolving; any deployed tool must operate within applicable Canadian federal guidelines at canada.ca.

Sources

Frequently Asked Questions

You might also like

Start Your Journey

Ready to Take Control of Your Skin Health?

Join Canadians who are already using DermaDex for instant skin analysis and access to certified dermatologists.

Free AI Analysis

No credit card required

HIPAA Compliant

Your data is secure

Instant Results

Get answers in seconds