Prediction-model Calibration: Hosmer-Lemeshow Test + Calibration Curve
Assess whether a prediction model's "predicted probabilities match the actual event rate" (calibration): enter predicted probabilities and actual outcomes to get the Hosmer-Lemeshow goodness-of-fit test, the calibration curve, the calibration slope and intercept (CITL), and the Brier score. Discrimination (AUC) describes ranking ability; calibration describes how accurate the probabilities are — the two are complementary. Computed locally in your browser; data are not uploaded.
① Input data
One subject per row: predicted probability (0~1) actual outcome (1=occurred/0=did not), space- or comma-separated. Predicted probabilities usually come from logistic regression or a nomogram.
How to use & methodology
How does calibration differ from discrimination (AUC)?
Discrimination (AUC) measures the model's ability to 'rank apart' those who do and do not have the event; calibration measures how well the predicted probabilities 'numerically match' the actual event rate. A model can have a high AUC but poor calibration (probabilities systematically too high/low). Both should be assessed.
How do I read the Hosmer-Lemeshow P value?
It tests for a 'mismatch between predicted and observed'. P>0.05 means no evidence of mismatch, i.e. acceptable calibration; P<0.05 suggests poor calibration. Note the direction is opposite to usual tests, and it is sample-size sensitive: easily significant with very large samples and underpowered with small ones, so judge alongside the calibration curve.
What do the calibration slope and intercept mean?
The ideal calibration slope is 1: below 1 is common with overfitting/over-confidence (predictions too extreme), above 1 means predictions too conservative. The ideal calibration intercept (CITL) is 0: positive means overall underestimation of risk, negative means overall overestimation.
Where do the predicted probabilities come from?
Usually the event probability (0~1) computed for each subject by logistic regression or a nomogram. Assess calibration on an independent validation set rather than only the development set; external validation better reflects real performance.