Methodology
Probabilistic · calibratedA transparent account of how the forecast is produced — the model, the data it learns from, how it scored on past World Cups, and exactly where it stops being certain. The brand says “Prediction Engine”; the claims here say probability. Both are true.
What this model is — and is not
An honest frame before any numbers. A disclosed limitation is a feature in a forecast, not a flaw.
It is
- A statistical model of international football: strength ratings → a goal model → simulated tournaments.
- Calibrated and backtested on the 2018 and 2022 World Cups before being trusted on 2026.
- Fully reproducible — fixed data hash, fixed RNG seed, versioned code.
- Independent. Built from public match results only.
It is not
- An oracle. A 7% champion probability means roughly a 1-in-14 shot, not a prediction the team wins.
- A betting product. It uses no market or odds data, by design.
- Injury/lineup/transfer-aware in real time — it sees results, not team news.
- Affiliated with FIFA or any official body.
Data sources
One public dataset of international results, cleaned and tournament-weighted. No market data, no scraped odds.
Every full international since the 19th century. After cleaning (dropping unplayed and unparseable rows), 49,291 matches feed the model. Each match carries a competition tier — a World Cup final counts for more than a friendly — and the exact tier weights were tuned on the backtest (below).
Source pinned by SHA-256 37e5ce3b82849279… — see the reproducibility note.
The pipeline
Three stages turn historical results into a champion probability for all 48 teams.
Elo ratings
Standard international-football Elo (the eloratings.net tradition), with two domain-specific refinements: competition-tier weighting and an altitude correction.
The update per match scales with the competition tier and the goal margin. An altitude correction stops high-altitude home wins (Quito, La Paz, Bogotá) from inflating ratings: a sea-level side visiting altitude is already expected to do worse, so beating them earns a smaller rating gain. Tier weights, tuned on the backtest:
Final-refit top of the table: Spain · Argentina · France · Brazil · England · Netherlands · … — Ecuador sits #16 and Bolivia #84, a sign the altitude fix holds.
Dixon–Coles goal model + host advantage
Elo says who is stronger; Dixon–Coles turns that into goals. It models each side's expected goals as Poisson rates with a low-score correction, and fits one shared host-advantage term.
For a match between teams i and j, the log goal rates are log λ = c + attack₍ᵢ₎ − defense₍ⱼ₎ + h·home₍ᵢ₎ − altitude·burden₍ᵢ₎. The Dixon–Coles τ correction fixes the well-known under-counting of 0-0, 1-0, 0-1 and 1-1 scorelines. An Elo priorpulls each team’s attack/defense toward its rating-implied strength, and the fit is time-weighted (recent matches count more).
Host advantage. A single coefficient (h = +0.29) lifts the home/host side’s goal rate. For 2026 it applies to the three hosts (USA, Mexico, Canada) on their genuine home fixtures; it is discounted at neutral World Cup venues, and the altitude term handles Mexico City’s elevation separately.
Fit on data from 2008-01-01, 4-year half-life (1460 days), Elo-prior strength 20. Training cutoff 2026-06-03.
Calibration
Calibration was assessed, not assumed. We tested whether a post-hoc map would improve the probabilities before deciding what the published run carries.
| Method | 2018 ΔBrier | 2022 ΔBrier | Log loss | Verdict |
|---|---|---|---|---|
| platt | +0.5% | -0.3% | worse both years | net-neutral on Brier, no calibration gain |
| isotonic | +0.3% | +2.3% | much worse 2022 | rejected — >1% Brier regression (small-window overfit) |
Assessed finding: adequate as-is, no map applied. Isotonic regression overfits the small World Cup window (+2.3% Brier on 2022); Platt scaling is net-neutral on Brier and degrades log loss. The raw model is acceptably calibrated — visible in the reliability diagram below — so the published run records calibration_method = none. This is a checked-and-adequate result, recorded with its evidence, not an omission.
Backtest results
The model was refit strictly before each past World Cup opener — no leakage — then scored on the matches that followed. Tuning happened on 2022; 2018 was held out.
| World Cup | Brier | Log loss | Baseline Brier | ΔBrier | Gate |
|---|---|---|---|---|---|
| 2018 (held-out) | 0.5976 | 0.9985 | 0.5856 | +0.0120 (0.48σ) | within 1σ ✓ |
| 2022 (tuned) | 0.6123 | 1.0286 | 0.6018 | +0.0105 (0.35σ) | within 1σ ✓ |
Both years beat a uniform (⅓, ⅓, ⅓) model comfortably (Brier 0.667 / log loss 1.099). The comparison column is an Elo-logistic baseline on the same information.
2022 tuning trajectory (8 steps)
Coordinate descent on a combined Brier/log-loss objective normalised to the default config. The data preferred a longer half-life and a stronger Elo prior — the hypothesis for why traditional powers were under-rated.
| Step (knob moved) | Loss | Brier | Log loss |
|---|---|---|---|
| default (730 / 5 / default tiers) | 1.0000 | 0.6250 | 1.0453 |
| half-life → 1460 d | 0.9946 | 0.6212 | 1.0405 |
| λ prior → 20 | 0.9905 | 0.6186 | 1.0360 |
| tier 1 → 1.10 | 0.9898 | 0.6182 | 1.0355 |
| tier 2 → 0.75 | 0.9891 | 0.6177 | 1.0348 |
| tier 3 → 0.60 | 0.9874 | 0.6165 | 1.0333 |
| tier 4 → 0.45 | 0.9856 | 0.6151 | 1.0317 |
| tier 5 → 0.30 | 0.9819 | 0.6123 | 1.0286 |
Reliability diagram
Calibration, drawn. For each predicted-probability bin, where does the model land versus how often the outcome actually happened? Points on the dashed diagonal are perfectly calibrated.
With only ~64 matches per World Cup the bins are sparse — large bubbles carry the signal, small ones are noise-dominated, so read the diagram by bubble size. The points hug the diagonal without a systematic over- or under-confidence bias, which is why no calibration map was applied.
Rendered from reliability_2018.csv / reliability_2022.csv.
Monte Carlo simulation
To get tournament odds, the engine plays the whole World Cup tens of thousands of times — group stage through final — sampling each match from its Dixon–Coles scoreline distribution.
Each simulated tournament resolves group standings (with the full tie-breaker chain below), selects the eight best third-placed teams, fills the knockout bracket via the official slotting matrix, and plays to a champion. Aggregating across runs gives every team’s probability of reaching each stage. A convergence check comparing 10k and 50k simulations found the largest champion-share drift was 0.634pp (on Spain) — just above the 0.5pp target, driven by noise in the 10k arm. The published forecast therefore runs at 50,000 simulations.
- Spain10.4%
- Argentina9.5%
- Brazil7.3%
- France7.1%
- England5.3%
- Portugal4.5%
- Germany4.2%
- Netherlands4.2%
- Mexico3.8%
- Colombia3.7%
- Belgium3.2%
- Japan3.2%
Snapshot of a backtest-acceptance run — the live numbers are on the forecast page.
Group tie-breakers — FIFA Article 13
The simulator implements the official 2026 tie-breaker chain in full. Getting the order right matters: it changes which third-placed teams qualify, and therefore the knockout bracket.
- 1a–1cHead-to-head first — among the tied teams only: points, then goal difference, then goals scored.
- re-applyIf head-to-head separates some but not all, re-apply 1a–1c to the teams still tied (the step does not restart from scratch).
- 2d–2eOverall goal difference, then overall goals scored across all group matches.
- 3gMost recent FIFA / Coca-Cola Men's World Ranking — FIFA's official terminal step.
For the best-8 third-placed ranking across groups the chain is identical except head-to-head is skipped (those teams have not played each other): points → overall GD → overall GS → FIFA Ranking.
Reproducibility
Every prediction run records a 10-field provenance block, so any published forecast can be traced to the exact model, code, data, and random seed that produced it.
forecast__elo-dc-v1.0__seed42__cond0__rerun20260610live| Field | Value | What it pins |
|---|---|---|
| model_version | elo-dc-v1.0 | Identifies the committed Elo + Dixon–Coles fit. |
| simulator_version | shell-v0.1 | Identifies the Monte Carlo engine + tie-breaker logic. |
| code_version / git_commit | 251b05fc78ce | Repository commit the run was produced from. |
| data_version | 2026-06-10 | Snapshot date of the cleaned results dataset. |
| data_hash | ee6953cd4cd6d943… (SHA-256) | Hash of the source results CSV (martj42/international_results). |
| rng_seed | 42 | Monte Carlo seed — fixed, so a re-run reproduces the same draws. |
| training_cutoff_date | 2026-06-10 | Matches on or before this date were used to fit; nothing after leaks in. |
| data_cutoff_time | — (pre-tournament) | Null pre-tournament; set to the as-of time once live results condition the run. |
| calibration_method | none | Calibration assessed and applied (see calibration section). |
| run_type | model_forecast | Distinguishes a forecast run from a data build or backtest. |
Read live from the published forecast run (the current_run view) — the same source the forecast page uses, so it refreshes at every re-run.
Limitations
Where the model is weakest, stated plainly.
- No team news. Injuries, suspensions, lineups and form within a camp are invisible to it — it sees match results, not who is on the pitch.
- No market signal. By design it ignores betting odds, so its champion shares are more compressed than the market’s (favourites land near 7–10%, not 12–15%).
- Sparse World Cup evidence. Only two tournaments (128 matches) back the calibration; reliability bins are noisy and a single tournament cannot fully validate tail behaviour.
- Grid-boundary tuning. Half-life and the Elo prior both landed at the top of their search grids — the true optimum may sit beyond, suggesting even more weight on long-run strength.
- Static within a day. The forecast updates when results are entered, not continuously; between updates it does not react to news.
Update cadence
When the numbers move, and why.
- Pre-tournament. A final model re-run on the June 10 data freeze sets the published opening forecast — superseding any earlier provisional run.
- During the tournament. As each result is entered, the simulator re-conditions on completed matches and re-runs, so standings and championship odds shift to reflect what has actually happened.
- Provenance preserved. Each re-run writes a fresh reproducibility block (above), so any past state of the forecast remains traceable.
Independent, calibrated, and probabilistic — not affiliated with FIFA, and not betting advice. Model figures are transcribed from the committed artifacts (backtest report, Dixon–Coles parameters, reliability CSVs); the reproducibility block and calibration status are read live from the published forecast run.