Can AI tell me my skin undertone?

Consumer AI tools can estimate warm, cool, or neutral undertones with reasonable accuracy for lighter complexions, but performance drops for darker skin tones where high melanin concentrations absorb light in ways that compress the undertone signal these tools rely on. Current AI models are better suited to identifying broad tonal categories than precise undertone classification. For clinical diagnostic purposes, skin undertone detection is a much harder problem because conditions present differently across the Fitzpatrick scale, and most models were trained on datasets that underrepresent Types IV-VI. Until training data diversity improves substantially, AI undertone detection remains more reliable for cosmetic guidance than clinical use.

Does AI confirm your biases?

AI does not simply confirm existing biases; it can amplify them at scale. A physician who makes a diagnostic error may refer the patient elsewhere. An AI system embedded in a clinical workflow applies its confidence scores to every case without that corrective mechanism. When training data underrepresents darker skin tones, the model develops systematic blind spots for those populations. Every missed diagnosis that goes uncorrected can feed back into evaluation data that is then used to validate the model. At population level, these amplified errors appear as health disparities that look like biological differences but are actually the product of skewed data collection and inadequate validation.

Is there an AI that can change skin color?

Generative AI image tools can simulate different skin tones in photographs, and these tools are used in cosmetic and fashion applications. In a clinical context, researchers have explored using synthetic image generation to diversify training datasets by producing images of conditions on skin tones underrepresented in real-world archives. However, studies have found that many generative models reproduce the same biases present in their own training data, generating images that still skew toward lighter tones. Synthetic augmentation can help at the margins, but it is not a substitute for collecting real, diverse clinical images. Building genuinely representative training datasets requires direct partnerships with dermatology clinics serving diverse patient populations.

Skin Tone Bias in AI Dermatology Models | DermaDex

Q: How reliable is AI dermatology?

For conditions well-represented in training data and on patients similar to those training populations, some AI dermatology tools match or exceed the diagnostic accuracy of general practitioners. A landmark 2019 Nature study showed a CNN classifying skin cancer at a level comparable to board-certified dermatologists. Reliability falls off significantly for patients with darker skin tones, rare conditions, and presentations that differ from the dominant patterns in training data. In practical terms, an AI tool with 90%+ sensitivity for melanoma on lighter skin may perform at 70% or below on darker skin, a gap large enough to affect clinical outcomes. Reliability figures should always be interpreted alongside the demographic profile of the validation dataset.

As of September 10, 2024.

Artificial intelligence (AI) tools for skin assessment are increasingly present in clinical workflows, consumer apps, and telemedicine platforms. But most of these tools learned to recognize conditions from photographs that skew heavily toward lighter complexions. For patients with medium to dark skin, that means a diagnostic AI may simply not work as well. This article explains why skin tone representation matters in dermatology AI, how the Fitzpatrick scale provides a measurement framework, and what needs to change to build fair models.

What is the Fitzpatrick scale and why does it matter for AI?

Short answer: The Fitzpatrick scale classifies human skin into six types based on melanin content and UV response. In AI development, it gives researchers a consistent vocabulary for measuring whether training datasets include enough variation across the full tonal spectrum. Without a shared taxonomy, model builders have no standard way to prove their data is representative.

Dermatologist Thomas Fitzpatrick introduced the scale in 1975 to predict how patients would respond to phototherapy. Since then it has become the standard taxonomy for describing skin tone in clinical and AI research. For machine learning (ML) model builders, the Fitzpatrick scale acts as a checklist: if the training dataset contains mostly Types I-III images, the convolutional neural network (CNN) behind the model will develop systematic blind spots for Types IV-VI, and those blind spots translate directly into missed diagnoses.

Fitzpatrick Type	Description	UV / Sun response	Melanoma risk
I	Very fair, often freckled	Always burns, never tans	Highest
II	Fair	Burns easily, tans minimally	Very high
III	Medium, olive	Burns moderately, tans gradually	High
IV	Olive to light brown	Burns minimally, tans easily	Moderate
V	Brown	Rarely burns, tans darkly	Lower
VI	Deep brown to black	Almost never burns	Lowest (but late-stage diagnoses more common)

Types IV-VI account for the majority of the global population yet remain underrepresented in the image libraries that most AI teams train on.

How underrepresented are darker skin tones in dermatology datasets?

Short answer: Severely. Peer-reviewed audits show that standard dermatology image libraries contain fewer than 5% of images showing skin-of-color. Because CNNs learn from pattern frequency, this imbalance pushes model confidence toward conditions as they appear on lighter skin, making errors on darker complexions statistically predictable rather than random.

A widely cited 2018 analysis in JAMA Dermatology by Adamson and Smith examined the diversity of images used in dermatology education and AI training, finding that darker skin tones were severely underrepresented in both textbook images and clinical datasets (pubmed.ncbi.nlm.nih.gov). The reasons are structural: academic medical centres that digitized image archives decades ago served predominantly white patient populations, and those archives were later scraped to build AI training sets. By the time a research team begins training a skin lesion classifier, the data problem is already baked in.

The practical effect is visible in accuracy benchmarks. Models that achieve 90-plus percent sensitivity for melanoma on Fitzpatrick Types I-III can drop to 70% or below on Types V-VI. For a condition where early detection changes outcomes, that gap is clinically significant. A 2024 analysis indexed on PubMed confirmed these disparities persist in the latest generation of AI-generated dermatology images (pubmed.ncbi.nlm.nih.gov).

Can AI accurately tell skin undertone for diagnostic purposes?

Short answer: Consumer AI tools can estimate warm, cool, or neutral undertones with reasonable accuracy for lighter complexions, but performance drops for darker skin where high melanin concentrations absorb light in ways that compress the undertone signal. For clinical diagnosis, the challenge is harder: a model must distinguish subtle morphological differences across all tones, not just broad color categories.

Skin undertone analysis requires detecting how the dermis reflects light below the epidermis. Melanin absorbs light differently at higher concentrations, which compresses the signal those tools rely on. This is one reason cosmetic AI tools built on lighter-skin data fail for darker complexions. Clinical diagnostic AI faces a related but more demanding version of this problem: conditions like basal cell carcinoma appear translucent and pearly on lighter skin, but can present as pigmented plaques on darker skin. Psoriasis plaques are often violet-brown on Fitzpatrick Types V-VI rather than the salmon-pink shown in most textbook images. A model trained only on the lighter-skin presentation will consistently miss the darker-skin version, not because the pathology differs, but because the visual pattern it learned does not match what it sees.

Responsible AI dermatology tools should benchmark every diagnostic module against Fitzpatrick-stratified accuracy targets before deployment, so clinicians know the accuracy profile across the full scale — a practice increasingly required under Health Canada's SaMD framework.

How reliable is AI dermatology overall?

Short answer: For conditions well-represented in training data, some AI dermatology tools approach the diagnostic accuracy of general practitioners. Reliability drops significantly for populations underrepresented in training data, particularly patients with Fitzpatrick Types IV-VI. Aggregate accuracy figures routinely hide these subgroup failures, so a single headline number is not enough to judge a tool.

A landmark study by Esteva et al. demonstrated that a CNN trained on 129,450 clinical images could classify skin cancer at a level comparable to board-certified dermatologists (nature.com). The critical qualifier is that the benchmark dataset carried the same demographic skew as most clinical archives. Reliability also varies by condition: melanoma detection, acne severity grading, and psoriasis identification have attracted extensive research attention, while rare conditions and those that present differently across skin types remain under-studied.

For Canadian patients, the World Health Organization (WHO) notes that access to specialist care varies dramatically by geography, and telemedicine AI tools marketed as access solutions could deepen existing disparities if they perform poorly on the populations with the least specialist access (who.int). The National Institutes of Health (NIH) have published guidance calling for disaggregated performance reporting by demographic subgroup in clinical AI (nih.gov).

What does dermatology AI fairness actually require?

Short answer: Fair AI in dermatology requires three things: diverse training data collected from representative populations, disaggregated benchmarks that break accuracy down by Fitzpatrick type, and post-deployment monitoring that catches performance drift before it causes harm. None of these is optional if the goal is equitable diagnostic accuracy.

Diverse datasets are the starting point. This means actively recruiting image contributors from regions with higher concentrations of darker-skinned patients, partnering with dermatology clinics in sub-Saharan Africa, South Asia, and Latin America, and building data-sharing agreements that comply with applicable privacy legislation. In Canada, any patient data used in AI training must satisfy provincial privacy statutes and federal obligations under applicable federal law.

Disaggregated benchmarks require model developers to report accuracy not as a single aggregate number, but broken down by Fitzpatrick type, age, sex, and geography. Aggregate scores hide subgroup failures. The U.S. Food and Drug Administration (FDA) has begun requiring this kind of subgroup reporting in its guidance on AI and ML in medical devices. Post-deployment monitoring closes the loop: a model that performed adequately at launch may drift as patient populations shift, and only ongoing Fitzpatrick-stratified auditing can detect those gaps before they compound.

Learn more about how DermaDex approaches these questions in the technical primer on how computer vision models detect skin conditions.

Does AI confirm or amplify existing biases in dermatology?

Short answer: AI trained on biased data does not merely reflect existing bias; it amplifies bias at scale. A physician who misses a diagnosis may refer the patient elsewhere. An AI system in a triage workflow applies its skewed confidence scores to every case without that corrective instinct, compounding disparities across thousands of consultations.

This feedback mechanism is why data diversity is so urgent. Systematic under-diagnosis of darker skin tones appears at population level as health disparities that can look like biological differences when they are actually data artifacts. Researchers have found that AI image generation models reproduce the same biases present in training data, generating synthetic images that skew toward lighter skin. This matters because synthetic augmentation is increasingly used to expand training sets: if the generator is biased, the augmented dataset inherits that bias at volume.

Addressing this requires structural change. Not just better algorithms, but different data collection practices, mandatory fairness audits, and clinical governance frameworks that treat Fitzpatrick-stratified accuracy as a minimum standard rather than an optional metric. You can read more about DermaDex's approach to transparent AI development on our about page.

Sources

Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatology. 2018. https://pubmed.ncbi.nlm.nih.gov/30073261/
Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017. https://www.nature.com/articles/nature21056
Joerg L, et al. AI-generated dermatologic images show deficient skin tone diversity. PubMed/NCBI. 2024. https://pubmed.ncbi.nlm.nih.gov/38431553/
World Health Organization. Universal health coverage. https://www.who.int/health-topics/universal-health-coverage
National Institutes of Health. NIH Almanac. https://www.nih.gov/about-nih/what-we-do/nih-almanac

Why Skin Tone Matters in AI Dermatology Models