Research Tools › Model Calibration

Prediction-model Calibration: Hosmer-Lemeshow Test + Calibration Curve

Assess whether a prediction model's "predicted probabilities match the actual event rate" (calibration): enter predicted probabilities and actual outcomes to get the Hosmer-Lemeshow goodness-of-fit test, the calibration curve, the calibration slope and intercept (CITL), and the Brier score. Discrimination (AUC) describes ranking ability; calibration describes how accurate the probabilities are — the two are complementary. Computed locally in your browser; data are not uploaded.

① Input data

One subject per row: predicted probability (0~1) actual outcome (1=occurred/0=did not), space- or comma-separated. Predicted probabilities usually come from logistic regression or a nomogram.

Number of groups:

How to use & methodology

How does calibration differ from discrimination (AUC)?

Discrimination (AUC) measures the model's ability to 'rank apart' those who do and do not have the event; calibration measures how well the predicted probabilities 'numerically match' the actual event rate. A model can have a high AUC but poor calibration (probabilities systematically too high/low). Both should be assessed.

How do I read the Hosmer-Lemeshow P value?

It tests for a 'mismatch between predicted and observed'. P>0.05 means no evidence of mismatch, i.e. acceptable calibration; P<0.05 suggests poor calibration. Note the direction is opposite to usual tests, and it is sample-size sensitive: easily significant with very large samples and underpowered with small ones, so judge alongside the calibration curve.

What do the calibration slope and intercept mean?

The ideal calibration slope is 1: below 1 is common with overfitting/over-confidence (predictions too extreme), above 1 means predictions too conservative. The ideal calibration intercept (CITL) is 0: positive means overall underestimation of risk, negative means overall overestimation.

Where do the predicted probabilities come from?

Usually the event probability (0~1) computed for each subject by logistic regression or a nomogram. Assess calibration on an independent validation set rather than only the development set; external validation better reflects real performance.