AI & TechnologySeptember 10, 20247 min read

Why Skin Tone Matters in AI Dermatology Models

Artificial intelligence (AI) dermatology tools trained mostly on lighter skin tones perform worse for patients with darker complexions. Understanding the Fitzpatrick scale and dataset diversity gaps is the first step toward fixing the problem.

Why Skin Tone Matters in AI Dermatology Models

As of September 10, 2024.

Artificial intelligence (AI) tools for skin assessment are increasingly present in clinical workflows, consumer apps, and telemedicine platforms. But most of these tools learned to recognize conditions from photographs that skew heavily toward lighter complexions. For patients with medium to dark skin, that means a diagnostic AI may simply not work as well. This article explains why skin tone representation matters in dermatology AI, how the Fitzpatrick scale provides a measurement framework, and what needs to change to build fair models.

What is the Fitzpatrick scale and why does it matter for AI?

Short answer: The Fitzpatrick scale classifies human skin into six types based on melanin content and UV response. In AI development, it gives researchers a consistent vocabulary for measuring whether training datasets include enough variation across the full tonal spectrum. Without a shared taxonomy, model builders have no standard way to prove their data is representative.

Dermatologist Thomas Fitzpatrick introduced the scale in 1975 to predict how patients would respond to phototherapy. Since then it has become the standard taxonomy for describing skin tone in clinical and AI research. For machine learning (ML) model builders, the Fitzpatrick scale acts as a checklist: if the training dataset contains mostly Types I-III images, the convolutional neural network (CNN) behind the model will develop systematic blind spots for Types IV-VI, and those blind spots translate directly into missed diagnoses.

Fitzpatrick Type Description UV / Sun response Melanoma risk
I Very fair, often freckled Always burns, never tans Highest
II Fair Burns easily, tans minimally Very high
III Medium, olive Burns moderately, tans gradually High
IV Olive to light brown Burns minimally, tans easily Moderate
V Brown Rarely burns, tans darkly Lower
VI Deep brown to black Almost never burns Lowest (but late-stage diagnoses more common)

Types IV-VI account for the majority of the global population yet remain underrepresented in the image libraries that most AI teams train on.

How underrepresented are darker skin tones in dermatology datasets?

Short answer: Severely. Peer-reviewed audits show that standard dermatology image libraries contain fewer than 5% of images showing skin-of-color. Because CNNs learn from pattern frequency, this imbalance pushes model confidence toward conditions as they appear on lighter skin, making errors on darker complexions statistically predictable rather than random.

A widely cited 2018 analysis in JAMA Dermatology by Adamson and Smith examined the diversity of images used in dermatology education and AI training, finding that darker skin tones were severely underrepresented in both textbook images and clinical datasets (pubmed.ncbi.nlm.nih.gov). The reasons are structural: academic medical centres that digitized image archives decades ago served predominantly white patient populations, and those archives were later scraped to build AI training sets. By the time a research team begins training a skin lesion classifier, the data problem is already baked in.

The practical effect is visible in accuracy benchmarks. Models that achieve 90-plus percent sensitivity for melanoma on Fitzpatrick Types I-III can drop to 70% or below on Types V-VI. For a condition where early detection changes outcomes, that gap is clinically significant. A 2024 analysis indexed on PubMed confirmed these disparities persist in the latest generation of AI-generated dermatology images (pubmed.ncbi.nlm.nih.gov).

Can AI accurately tell skin undertone for diagnostic purposes?

Short answer: Consumer AI tools can estimate warm, cool, or neutral undertones with reasonable accuracy for lighter complexions, but performance drops for darker skin where high melanin concentrations absorb light in ways that compress the undertone signal. For clinical diagnosis, the challenge is harder: a model must distinguish subtle morphological differences across all tones, not just broad color categories.

Skin undertone analysis requires detecting how the dermis reflects light below the epidermis. Melanin absorbs light differently at higher concentrations, which compresses the signal those tools rely on. This is one reason cosmetic AI tools built on lighter-skin data fail for darker complexions. Clinical diagnostic AI faces a related but more demanding version of this problem: conditions like basal cell carcinoma appear translucent and pearly on lighter skin, but can present as pigmented plaques on darker skin. Psoriasis plaques are often violet-brown on Fitzpatrick Types V-VI rather than the salmon-pink shown in most textbook images. A model trained only on the lighter-skin presentation will consistently miss the darker-skin version, not because the pathology differs, but because the visual pattern it learned does not match what it sees.

Responsible AI dermatology tools should benchmark every diagnostic module against Fitzpatrick-stratified accuracy targets before deployment, so clinicians know the accuracy profile across the full scale — a practice increasingly required under Health Canada's SaMD framework.

How reliable is AI dermatology overall?

Short answer: For conditions well-represented in training data, some AI dermatology tools approach the diagnostic accuracy of general practitioners. Reliability drops significantly for populations underrepresented in training data, particularly patients with Fitzpatrick Types IV-VI. Aggregate accuracy figures routinely hide these subgroup failures, so a single headline number is not enough to judge a tool.

A landmark study by Esteva et al. demonstrated that a CNN trained on 129,450 clinical images could classify skin cancer at a level comparable to board-certified dermatologists (nature.com). The critical qualifier is that the benchmark dataset carried the same demographic skew as most clinical archives. Reliability also varies by condition: melanoma detection, acne severity grading, and psoriasis identification have attracted extensive research attention, while rare conditions and those that present differently across skin types remain under-studied.

For Canadian patients, the World Health Organization (WHO) notes that access to specialist care varies dramatically by geography, and telemedicine AI tools marketed as access solutions could deepen existing disparities if they perform poorly on the populations with the least specialist access (who.int). The National Institutes of Health (NIH) have published guidance calling for disaggregated performance reporting by demographic subgroup in clinical AI (nih.gov).

What does dermatology AI fairness actually require?

Short answer: Fair AI in dermatology requires three things: diverse training data collected from representative populations, disaggregated benchmarks that break accuracy down by Fitzpatrick type, and post-deployment monitoring that catches performance drift before it causes harm. None of these is optional if the goal is equitable diagnostic accuracy.

Diverse datasets are the starting point. This means actively recruiting image contributors from regions with higher concentrations of darker-skinned patients, partnering with dermatology clinics in sub-Saharan Africa, South Asia, and Latin America, and building data-sharing agreements that comply with applicable privacy legislation. In Canada, any patient data used in AI training must satisfy provincial privacy statutes and federal obligations under applicable federal law.

Disaggregated benchmarks require model developers to report accuracy not as a single aggregate number, but broken down by Fitzpatrick type, age, sex, and geography. Aggregate scores hide subgroup failures. The U.S. Food and Drug Administration (FDA) has begun requiring this kind of subgroup reporting in its guidance on AI and ML in medical devices. Post-deployment monitoring closes the loop: a model that performed adequately at launch may drift as patient populations shift, and only ongoing Fitzpatrick-stratified auditing can detect those gaps before they compound.

Learn more about how DermaDex approaches these questions in the technical primer on how computer vision models detect skin conditions.

Does AI confirm or amplify existing biases in dermatology?

Short answer: AI trained on biased data does not merely reflect existing bias; it amplifies bias at scale. A physician who misses a diagnosis may refer the patient elsewhere. An AI system in a triage workflow applies its skewed confidence scores to every case without that corrective instinct, compounding disparities across thousands of consultations.

This feedback mechanism is why data diversity is so urgent. Systematic under-diagnosis of darker skin tones appears at population level as health disparities that can look like biological differences when they are actually data artifacts. Researchers have found that AI image generation models reproduce the same biases present in training data, generating synthetic images that skew toward lighter skin. This matters because synthetic augmentation is increasingly used to expand training sets: if the generator is biased, the augmented dataset inherits that bias at volume.

Addressing this requires structural change. Not just better algorithms, but different data collection practices, mandatory fairness audits, and clinical governance frameworks that treat Fitzpatrick-stratified accuracy as a minimum standard rather than an optional metric. You can read more about DermaDex's approach to transparent AI development on our about page.

Sources

Frequently Asked Questions

You might also like

Start Your Journey

Ready to Take Control of Your Skin Health?

Join Canadians who are already using DermaDex for instant skin analysis and access to certified dermatologists.

Free AI Analysis

No credit card required

HIPAA Compliant

Your data is secure

Instant Results

Get answers in seconds