Probability Calibration in Machine Learning

In a typical machine learning classifier, the model outputs a number between 0 and 1 that's labeled "probability." For most models, this number is a confidence score, not a probability — and the difference matters enormously when those numbers drive real decisions.

This article walks through why most machine learning models produce uncalibrated probabilities by default, the standard methods for calibrating them (Platt scaling and isotonic regression), and the discipline that separates real calibration from cosmetic.

We covered the bettor's-eye view of calibration in Probability Calibration for Bettors. This article is the practitioner's-eye view — the technical version of the same concept for readers who want to understand the machinery.

What "calibrated" means, formally

A binary classifier is calibrated when, among predictions it assigns probability p, the actual outcome rate is also p.

If you bucket all predictions where the model said 60% and check what fraction of those predictions were correct, you should get approximately 60%. If instead 50% were correct, the model is overconfident at the 60% bucket. If 70% were correct, it's underconfident.

For most use cases this calibration property matters more than raw accuracy. A model that predicts 55% on bets that hit 55% of the time is more useful than a model that predicts 70% on bets that hit 55% of the time, even though both have identical accuracy. The first model lets you size bets correctly. The second one will lead you to overbet your edge.

The standard visualization is the calibration plot (also called a reliability diagram): predicted probability on the X axis, observed frequency on the Y axis. A perfectly calibrated model produces points on the diagonal. Deviations show where the model is systematically over- or underconfident.

Why most classifiers aren't naturally calibrated

The specific reason depends on the model family.

Naive Bayes assumes feature independence. When features are actually correlated (which they almost always are), the model multiplies similar evidence multiple times and pushes probabilities toward the extremes — predictions cluster near 0 and 1 rather than spreading across the range.

Support vector machines don't output probabilities natively. The standard approach (Platt scaling) fits a logistic regression on top of SVM scores to produce probability-like outputs, but those outputs are calibrated only as well as the Platt fit allows.

Random forests and gradient-boosted trees output averages or majority votes of tree predictions. These tend to be systematically biased toward 0.5 at the extremes (the model rarely produces 95% predictions even when warranted) and overconfident in the middle (50-70% predictions are often closer to 50-55% in reality). The bias direction depends on the specific algorithm and hyperparameters.

Neural networks are widely known to produce overconfident probabilities, especially deep networks trained to convergence. The Guo et al. 2017 paper "On Calibration of Modern Neural Networks" demonstrated that modern deep learning models routinely produce confidences far above their actual accuracy — a 99% prediction might be right only 90% of the time.

Logistic regression is the exception. Its loss function (cross-entropy with the logit link) produces calibrated probabilities by construction when the model is well-specified. This is why logistic regression is sometimes used as a baseline calibration target — its outputs are inherently meaningful as probabilities.

For sports modeling specifically, the dominant architecture (gradient-boosted trees) is not naturally calibrated. Calibration is a separate post-processing step, not an automatic property of the model.

Method 1: Platt scaling (sigmoid calibration)

Platt scaling fits a logistic regression on the model's outputs. The output of the base model becomes the single input to a logistic regression that maps it to a calibrated probability.

The math: given base model output s, calibrated probability is

p_cal = 1 / (1 + exp(A × s + B))

where A and B are parameters learned during calibration from a held-out validation set.

This is essentially fitting a sigmoid curve through the calibration data. It works well when the relationship between base predictions and true probability is roughly sigmoid-shaped — which is often the case for SVMs and some neural network architectures.

Implementation in scikit-learn:

from sklearn.calibration import CalibratedClassifierCV

model = SomeBaseClassifier()
calibrated = CalibratedClassifierCV(model, cv=5, method='sigmoid')
calibrated.fit(X_train, y_train)
calibrated_probs = calibrated.predict_proba(X_test)

Strengths. Platt scaling has only two parameters (A and B). This makes it data-efficient — you can fit a reasonable Platt scaler with relatively few calibration examples (a few hundred to a few thousand). It's also smooth, which produces well-behaved probabilities at the extremes.

Weaknesses. Platt scaling assumes the calibration mapping is sigmoid-shaped. When the actual relationship between base predictions and true probability isn't sigmoid, Platt scaling can't capture the full miscalibration pattern. It also tends to perform poorly when the base model produces probabilities that are biased in non-symmetric ways across the probability range.

Method 2: Isotonic regression

Isotonic regression fits a piecewise-constant, monotonically non-decreasing function to the calibration data. Unlike Platt scaling, it doesn't assume any specific functional form for the calibration mapping. It only assumes the mapping is monotonic (predictions that are higher than other predictions should map to higher calibrated probabilities than other predictions, which is true for any reasonable classifier).

Implementation in scikit-learn:

from sklearn.isotonic import IsotonicRegression

iso = IsotonicRegression(out_of_bounds='clip')
iso.fit(model_predictions_val, true_outcomes_val)
calibrated_probs = iso.predict(model_predictions_test)

Or as part of the standard calibration class:

from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(model, cv=5, method='isotonic')

Strengths. Isotonic regression is non-parametric and can correct any monotonic miscalibration pattern, including asymmetric ones that Platt scaling can't handle. It typically produces better calibration than Platt scaling on larger calibration sets, particularly for tree-based models whose miscalibration tends to be more complex than a sigmoid.

Weaknesses. Isotonic regression can overfit when the calibration set is small. As a rough heuristic, isotonic regression needs at least 1,000 calibration examples to outperform Platt scaling reliably. Below that, Platt scaling's smaller parameter count makes it more robust. Isotonic regression also produces a step function, which can cause unstable predictions near step boundaries.

For sports modeling specifically, isotonic regression is typically the right choice once you have enough calibration data. The miscalibration patterns from gradient-boosted trees are usually more complex than a sigmoid can capture, and a season's worth of predictions is far more than the 1,000-example threshold.

Method 3: Temperature scaling (for neural networks)

For deep learning models, temperature scaling is a single-parameter variant of Platt scaling that's particularly well-suited to neural network logits.

The idea: divide the model's pre-softmax logits by a learned temperature parameter T before applying softmax. T > 1 reduces confidence (flattens the probability distribution); T < 1 increases confidence (sharpens the distribution).

Temperature scaling has just one parameter to learn (T itself), which makes it extremely data-efficient. It's the standard go-to for calibrating modern neural networks and is what many production deep learning systems use.

This is less relevant for tree-based sports prediction models but worth knowing about — if your system uses a neural component, temperature scaling is usually the right calibration approach for that component specifically.

How to validate calibration

Three standard metrics for evaluating calibration quality:

Brier score. The mean squared error between predicted probabilities and actual outcomes. Lower is better. Brier score combines calibration and resolution (how well the model discriminates between classes) into a single number. A perfect model has Brier score 0; a coin-flip predictor has Brier score 0.25.

Log loss (cross-entropy loss). The negative log probability the model assigned to the correct outcome, averaged across predictions. Lower is better. Log loss penalizes confidently wrong predictions more harshly than Brier score does, which makes it sensitive to calibration in the tails.

Expected calibration error (ECE). Buckets predictions by probability and measures the average absolute difference between predicted and observed rates per bucket, weighted by bucket size. Directly measures calibration without the resolution component. Lower is better; perfect calibration produces ECE of 0.

The standard validation protocol is:

Split data into train, validation, and test sets chronologically (no random shuffling for time-series data like sports).
Train the base model on the train set.
Fit the calibration method (Platt or isotonic) on the validation set, using the base model's predictions on validation data and the actual validation outcomes.
Evaluate Brier score, log loss, and ECE on the test set, comparing pre-calibration and post-calibration metrics.

If calibration improves all three metrics on the test set, the calibration is working. If it improves some metrics but degrades others, the calibration may be overfit to the validation set. If it degrades metrics on the test set entirely, the calibration set was too small or the calibration method is wrong for the data.

Common pitfalls

A few specific mistakes that produce calibration that looks fine but isn't:

Calibrating on training data. Using the model's predictions on the training data to fit the calibrator. This is circular — the training predictions are already overfit to the training labels, so the calibration mapping is also overfit. Always use a separate calibration set.

Calibration set leakage. If features in your validation set contain information from the future relative to the prediction time (e.g., final game stats used to predict that game's outcome), the calibration mapping will look better than it actually is. This is the same lookahead bias problem we cover in Why Backtests Overstate ROI.

Stale calibration. Sports environments shift. A calibrator fit on 2024 data may not apply correctly to 2026 predictions because the underlying distribution of features and outcomes has drifted. Production systems re-fit calibrators periodically (monthly or quarterly is typical for sports).

Calibrating on insufficient data. Isotonic regression with 200 calibration examples will produce a wildly overfit step function. Platt scaling with 50 examples will produce unreliable A and B parameters. Calibration needs adequate sample size — at least 500 examples for Platt scaling, ideally 1,000+ for isotonic regression.

Cosmetic calibration. Some products advertise calibrated probabilities while only doing the post-processing step on cherry-picked validation data, or only on the predictions that ended up being correct. Real calibration uses all predictions, including the wrong ones, and is validated on a held-out set the calibrator hasn't seen.

ParlayX's approach specifically

Per the calibration page, ParlayX's NBA models use isotonic regression calibrators fit on per-prop validation data. The calibrators are re-fit on a rolling basis as new prediction-outcome pairs accumulate, with a separate calibrator per prop type (points, rebounds, assists, threes).

We chose isotonic over Platt because the miscalibration patterns from our gradient-boosted base models are not well-captured by sigmoid curves, and we have more than enough calibration data per prop to support isotonic regression's higher data requirement.

The calibration page publishes the post-calibration calibration curves themselves — what the calibrator's outputs look like compared to observed outcomes on a held-out test period. This is the test of whether the calibration is actually working, not just whether it was attempted.

The summary

Probability calibration is the step that turns a machine learning model's outputs from confidence scores into actual probabilities. It's required for any application where the probability needs to be interpretable — including sports betting where bet sizing depends on the probability estimate.

The standard methods (Platt scaling for small calibration sets, isotonic regression for larger ones, temperature scaling for neural networks) handle most cases. The discipline that matters is the validation protocol: separate calibration set, no leakage, adequate sample size, and re-fitting as the environment shifts.

A sports analytics product that publishes its calibration data and methodology is one that's done the work. A product that doesn't either hasn't calibrated, hasn't validated, or has something to hide.

ParlayX provides analytics tools and educational content, not betting advice. Sports betting involves financial risk and is intended for adults only. If you or someone you know has a gambling problem, call 1-800-GAMBLER for confidential help, 24 hours a day.