AI & TechnologyApril 8, 20256 min read

What Confidence Scores Mean in AI Skin Diagnosis

When an AI skin analysis tool reports 87% confidence, that number has a specific statistical meaning with several important caveats. This article explains how machine learning confidence scores work, how they are calibrated, and how clinicians should interpret them.

What Confidence Scores Mean in AI Skin Diagnosis

As of April 8, 2025.

When an AI dermatology tool returns a confidence score alongside a skin-condition flag, patients and clinicians sometimes treat that number as a simple pass/fail grade. It is not. A confidence score is a calibrated probability estimate produced by a machine learning (ML) model, and reading it correctly changes how you act on the result.

This article explains what those numbers actually represent, where they break down, and how threshold choices affect the tradeoff between catching serious conditions and avoiding unnecessary referrals.

What is an AI confidence score?

Short answer: A confidence score is the model's estimated probability that its top-ranked prediction is correct for a given input image. A score of 0.87 means the model assigns 87% probability to its first-choice label, based on patterns learned during training. That estimate is only as reliable as the model's calibration, specifically how well its stated probabilities match actual outcomes in the real world.

In a convolutional neural network (CNN) used for skin image classification, the final layer applies a softmax function to convert raw logit values into a probability distribution across all possible output classes. The highest value in that distribution becomes the confidence score reported to the user. A model trained to distinguish eczema, psoriasis, and contact dermatitis outputs three probabilities that sum to 1.0; the top value is the confidence score. This is distinct from accuracy: accuracy measures performance across many cases, while a single confidence score describes the model's certainty about one specific image. The two numbers converge only when the model is well-calibrated across a large and representative test set.

What is a good confidence score?

Short answer: There is no universal threshold that qualifies as good. The appropriate cutoff depends on what you are screening for, the cost of a missed diagnosis versus a false alarm, and how well the model is calibrated on populations similar to the patient in front of you. For low-risk conditions a 0.75 threshold may be acceptable; for suspected melanoma, many clinicians prefer to flag anything below 0.90 for human review regardless of the computational cost.

Context matters more than the raw number. A 0.82 score for a clearly benign sebaceous cyst carries different clinical weight than a 0.82 score for a suspicious pigmented lesion. The artificial intelligence (AI) model does not know which question is more consequential; that judgment stays with the clinician. Research published in JAMA Dermatology has shown that AI diagnostic accuracy (Area Under the Curve, or AUC, values above 0.90) for melanoma detection is achievable, but AUC is a population-level metric. At the individual scan level, confidence scores can still mislead, particularly on skin tones underrepresented in training data. Our technical primer on computer vision models covers how training data composition shapes output distributions.

Are AI models well-calibrated for medical use?

Short answer: Modern deep learning models are often poorly calibrated out of the box. A widely cited 2017 paper by Guo et al. demonstrated that deep neural networks tend to be overconfident: a model reporting 0.95 confidence may be correct only around 80% of the time on held-out data. Post-hoc calibration methods such as temperature scaling reduce this gap, but calibration degrades whenever the deployment population differs from the training population.

For skin diagnosis, calibration drift is a consistent practical concern. A model trained predominantly on lighter Fitzpatrick skin types (I through III) may output high confidence scores on darker skin tones (IV through VI) while making more errors, because the softmax output reflects pattern similarity to training examples regardless of whether those examples were representative. This is one reason why skin tone diversity in training data matters. A 2018 study by Haenssle et al. comparing a CNN against 58 dermatologists on melanoma detection found that the algorithm's stated confidence scores consistently outperformed most dermatologists in specificity, but that the gap narrowed significantly when dermatologists could see the AI output alongside their own judgment. Health Canada has not yet issued specific calibration standards for AI-assisted dermatology tools, but the FDA's guidance on Software as a Medical Device (SaMD) references performance across demographic subgroups as a pre-market evaluation requirement.

How do confidence thresholds affect false positives and false negatives?

Short answer: Setting a higher confidence threshold, for example 0.90 instead of 0.75, reduces the number of low-probability cases the model accepts as definitive. This lowers the false-positive rate but raises the false-negative rate. Every threshold choice is a tradeoff, and the right setting depends on the clinical consequence of each error type in your specific workflow.

The Receiver Operating Characteristic (ROC) curve visualizes this tradeoff across all possible thresholds. For a screening tool the goal is high sensitivity (catching most true positives), which usually means accepting a higher false-positive rate. For a tool used to confirm a diagnosis before a procedure, high specificity matters more. The table below maps common threshold ranges to recommended actions and expected error profiles for a general-purpose skin lesion classifier:

Confidence threshold Recommended clinical action False-positive risk False-negative risk
< 0.60 Flag for mandatory human review; do not act on AI output Low High
0.60 to 0.74 Present to clinician alongside differential; treat as suggestive only Moderate Moderate-high
0.75 to 0.84 Use as supporting evidence; document in chart; consider follow-up Moderate Moderate
0.85 to 0.89 Provisionally act on finding; schedule follow-up to confirm Moderate-low Moderate-low
0.90 and above High confidence; suitable as primary factor in triage decision Low Low

These ranges assume a well-calibrated model evaluated on a demographically diverse dataset. Adjust cutoffs downward for any subpopulation underrepresented in the model's training set, and revisit thresholds whenever a new model version is deployed.

How should a probability score change patient or clinician behaviour?

Short answer: A high confidence score should increase the prior probability you assign to a diagnosis, not replace clinical judgment. Treat it as one structured input among several, alongside symptom history, lesion morphology, and patient context. A score below your threshold for unilateral action should trigger a referral or second look, not dismissal of the AI output entirely.

In practice this means documenting the score alongside the clinical note. If a CNN's output was 0.78 for atopic dermatitis and the clinician agreed with that read, the chart should record both. If the score was 0.91 for a benign seborrheic keratosis but the lesion has an atypical border, the clinician's physical finding overrides the probability estimate. This is consistent with how the Canadian Medical Association (CMA) frames physician responsibility for AI-assisted decisions: the physician remains accountable for the final decision regardless of what the algorithm outputs. The Personal Health Information Protection Act (PHIPA) in Ontario similarly places data stewardship and care accountability on the regulated health professional, not the software vendor.

For patients, the score is less useful as a standalone figure. A plain-language statement such as "the scan strongly suggests eczema and a dermatologist will review" communicates more than a raw percentage and prevents over-reliance on a number that carries technical assumptions most patients cannot evaluate on their own. Clinicians who explain confidence scores in plain terms typically see higher patient engagement with recommended follow-up steps.

Sources

  • Esteva, A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115-118. Esteva et al., PubMed 2017
  • Haenssle, H.A., et al. (2018). Man against machine: deep learning convolutional neural network for dermoscopic melanoma recognition versus 58 dermatologists. Annals of Oncology. Haenssle et al., PubMed 2018
  • National Institutes of Health (NIH). Artificial intelligence in health and biomedical research. National Institutes of Health (NIH)

Frequently Asked Questions

You might also like

Start Your Journey

Ready to Take Control of Your Skin Health?

Join Canadians who are already using DermaDex for instant skin analysis and access to certified dermatologists.

Free AI Analysis

No credit card required

HIPAA Compliant

Your data is secure

Instant Results

Get answers in seconds