What is a good AI GPT score?

This question refers to language model confidence scoring, which works differently from image-based diagnostic AI. Large language models like GPT do not output traditional confidence scores alongside responses; instead they generate token probabilities that some interfaces surface as a proxy for certainty. In a medical context, these token-level probabilities are not calibrated for clinical use and should not be treated as equivalent to validated diagnostic confidence scores from purpose-built medical AI tools. For skin diagnosis, rely on tools that report AUC (Area Under the Curve) values and calibration metrics measured on demographically diverse dermatology datasets, rather than general-purpose language model outputs.

Is Gemini or ChatGPT better for medical questions?

Neither Gemini nor ChatGPT is a validated medical diagnostic tool. Both are general-purpose language models that can summarize medical literature and describe conditions, but neither produces clinically calibrated diagnostic confidence scores for skin conditions. They lack access to patient images, are not trained on labeled dermatology datasets, and have not been evaluated under Health Canada or FDA frameworks for Software as a Medical Device (SaMD). For AI-assisted skin diagnosis, purpose-built tools trained and evaluated on dermatology image datasets with known sensitivity and specificity metrics are the appropriate choice. Always consult a certified dermatologist for any diagnosis or treatment decision.

AI Confidence Scores in Skin Diagnosis | DermaDex

Q: What is an AI confidence score?

An AI confidence score is the probability value a machine learning model assigns to its top prediction for a given input. In skin diagnosis, a score of 0.87 means the model places 87% probability on its first-choice diagnosis based on patterns in the image. The score comes from the final softmax layer of a convolutional neural network (CNN), which converts raw output values into a probability distribution. A higher score indicates the model found strong visual evidence matching a known pattern, but it does not guarantee correctness. Calibration, meaning how well stated probabilities match real-world accuracy, determines whether the number is trustworthy. Well-calibrated models produce roughly 87% accuracy at the 0.87 confidence level; poorly calibrated ones may be systematically over- or underconfident.

Q: What is a good confidence score?

A good confidence score is one that is high relative to the model's own calibrated performance on cases similar to the one in question. There is no single universal cutoff. For general-purpose skin screening, many clinical workflows treat scores at or above 0.85 as reliable enough to inform triage decisions, while routing anything below 0.75 to mandatory human review. The appropriate threshold shifts based on clinical stakes: suspected melanoma warrants a higher bar than a suspected benign sebaceous cyst. The most useful question is not whether the score exceeds a fixed number but whether the model was calibrated on data that includes patients like the one being assessed, covering skin tone, age range, and lesion type.

As of April 8, 2025.

When an AI dermatology tool returns a confidence score alongside a skin-condition flag, patients and clinicians sometimes treat that number as a simple pass/fail grade. It is not. A confidence score is a calibrated probability estimate produced by a machine learning (ML) model, and reading it correctly changes how you act on the result.

This article explains what those numbers actually represent, where they break down, and how threshold choices affect the tradeoff between catching serious conditions and avoiding unnecessary referrals.

What is an AI confidence score?

Short answer: A confidence score is the model's estimated probability that its top-ranked prediction is correct for a given input image. A score of 0.87 means the model assigns 87% probability to its first-choice label, based on patterns learned during training. That estimate is only as reliable as the model's calibration, specifically how well its stated probabilities match actual outcomes in the real world.

In a convolutional neural network (CNN) used for skin image classification, the final layer applies a softmax function to convert raw logit values into a probability distribution across all possible output classes. The highest value in that distribution becomes the confidence score reported to the user. A model trained to distinguish eczema, psoriasis, and contact dermatitis outputs three probabilities that sum to 1.0; the top value is the confidence score. This is distinct from accuracy: accuracy measures performance across many cases, while a single confidence score describes the model's certainty about one specific image. The two numbers converge only when the model is well-calibrated across a large and representative test set.

What is a good confidence score?

Short answer: There is no universal threshold that qualifies as good. The appropriate cutoff depends on what you are screening for, the cost of a missed diagnosis versus a false alarm, and how well the model is calibrated on populations similar to the patient in front of you. For low-risk conditions a 0.75 threshold may be acceptable; for suspected melanoma, many clinicians prefer to flag anything below 0.90 for human review regardless of the computational cost.

Context matters more than the raw number. A 0.82 score for a clearly benign sebaceous cyst carries different clinical weight than a 0.82 score for a suspicious pigmented lesion. The artificial intelligence (AI) model does not know which question is more consequential; that judgment stays with the clinician. Research published in JAMA Dermatology has shown that AI diagnostic accuracy (Area Under the Curve, or AUC, values above 0.90) for melanoma detection is achievable, but AUC is a population-level metric. At the individual scan level, confidence scores can still mislead, particularly on skin tones underrepresented in training data. Our technical primer on computer vision models covers how training data composition shapes output distributions.

Are AI models well-calibrated for medical use?

Short answer: Modern deep learning models are often poorly calibrated out of the box. A widely cited 2017 paper by Guo et al. demonstrated that deep neural networks tend to be overconfident: a model reporting 0.95 confidence may be correct only around 80% of the time on held-out data. Post-hoc calibration methods such as temperature scaling reduce this gap, but calibration degrades whenever the deployment population differs from the training population.

For skin diagnosis, calibration drift is a consistent practical concern. A model trained predominantly on lighter Fitzpatrick skin types (I through III) may output high confidence scores on darker skin tones (IV through VI) while making more errors, because the softmax output reflects pattern similarity to training examples regardless of whether those examples were representative. This is one reason why skin tone diversity in training data matters. A 2018 study by Haenssle et al. comparing a CNN against 58 dermatologists on melanoma detection found that the algorithm's stated confidence scores consistently outperformed most dermatologists in specificity, but that the gap narrowed significantly when dermatologists could see the AI output alongside their own judgment. Health Canada has not yet issued specific calibration standards for AI-assisted dermatology tools, but the FDA's guidance on Software as a Medical Device (SaMD) references performance across demographic subgroups as a pre-market evaluation requirement.

How do confidence thresholds affect false positives and false negatives?

Short answer: Setting a higher confidence threshold, for example 0.90 instead of 0.75, reduces the number of low-probability cases the model accepts as definitive. This lowers the false-positive rate but raises the false-negative rate. Every threshold choice is a tradeoff, and the right setting depends on the clinical consequence of each error type in your specific workflow.

The Receiver Operating Characteristic (ROC) curve visualizes this tradeoff across all possible thresholds. For a screening tool the goal is high sensitivity (catching most true positives), which usually means accepting a higher false-positive rate. For a tool used to confirm a diagnosis before a procedure, high specificity matters more. The table below maps common threshold ranges to recommended actions and expected error profiles for a general-purpose skin lesion classifier:

Confidence threshold	Recommended clinical action	False-positive risk	False-negative risk
< 0.60	Flag for mandatory human review; do not act on AI output	Low	High
0.60 to 0.74	Present to clinician alongside differential; treat as suggestive only	Moderate	Moderate-high
0.75 to 0.84	Use as supporting evidence; document in chart; consider follow-up	Moderate	Moderate
0.85 to 0.89	Provisionally act on finding; schedule follow-up to confirm	Moderate-low	Moderate-low
0.90 and above	High confidence; suitable as primary factor in triage decision	Low	Low

These ranges assume a well-calibrated model evaluated on a demographically diverse dataset. Adjust cutoffs downward for any subpopulation underrepresented in the model's training set, and revisit thresholds whenever a new model version is deployed.

How should a probability score change patient or clinician behaviour?

Short answer: A high confidence score should increase the prior probability you assign to a diagnosis, not replace clinical judgment. Treat it as one structured input among several, alongside symptom history, lesion morphology, and patient context. A score below your threshold for unilateral action should trigger a referral or second look, not dismissal of the AI output entirely.

In practice this means documenting the score alongside the clinical note. If a CNN's output was 0.78 for atopic dermatitis and the clinician agreed with that read, the chart should record both. If the score was 0.91 for a benign seborrheic keratosis but the lesion has an atypical border, the clinician's physical finding overrides the probability estimate. This is consistent with how the Canadian Medical Association (CMA) frames physician responsibility for AI-assisted decisions: the physician remains accountable for the final decision regardless of what the algorithm outputs. The Personal Health Information Protection Act (PHIPA) in Ontario similarly places data stewardship and care accountability on the regulated health professional, not the software vendor.

For patients, the score is less useful as a standalone figure. A plain-language statement such as "the scan strongly suggests eczema and a dermatologist will review" communicates more than a raw percentage and prevents over-reliance on a number that carries technical assumptions most patients cannot evaluate on their own. Clinicians who explain confidence scores in plain terms typically see higher patient engagement with recommended follow-up steps.

Sources

Esteva, A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115-118. Esteva et al., PubMed 2017
Haenssle, H.A., et al. (2018). Man against machine: deep learning convolutional neural network for dermoscopic melanoma recognition versus 58 dermatologists. Annals of Oncology. Haenssle et al., PubMed 2018
National Institutes of Health (NIH). Artificial intelligence in health and biomedical research. National Institutes of Health (NIH)

What Confidence Scores Mean in AI Skin Diagnosis