Methodology

Probabilistic · calibrated

A transparent account of how the forecast is produced — the model, the data it learns from, how it scored on past World Cups, and exactly where it stops being certain. The brand says “Prediction Engine”; the claims here say probability. Both are true.

00

What this model is — and is not

An honest frame before any numbers. A disclosed limitation is a feature in a forecast, not a flaw.

It is

  • A statistical model of international football: strength ratings → a goal model → simulated tournaments.
  • Calibrated and backtested on the 2018 and 2022 World Cups before being trusted on 2026.
  • Fully reproducible — fixed data hash, fixed RNG seed, versioned code.
  • Independent. Built from public match results only.

It is not

  • An oracle. A 7% champion probability means roughly a 1-in-14 shot, not a prediction the team wins.
  • A betting product. It uses no market or odds data, by design.
  • Injury/lineup/transfer-aware in real time — it sees results, not team news.
  • Affiliated with FIFA or any official body.
01

Data sources

One public dataset of international results, cleaned and tournament-weighted. No market data, no scraped odds.

martj42 / international_resultslatest match 2026-06-01

Every full international since the 19th century. After cleaning (dropping unplayed and unparseable rows), 49,291 matches feed the model. Each match carries a competition tier — a World Cup final counts for more than a friendly — and the exact tier weights were tuned on the backtest (below).

Tier 1
964
Tier 2
3,391
Tier 3
8,771
Tier 4
8,231
Tier 5
27,934

Source pinned by SHA-256 37e5ce3b82849279 — see the reproducibility note.

02

The pipeline

Three stages turn historical results into a champion probability for all 48 teams.

03

Elo ratings

Standard international-football Elo (the eloratings.net tradition), with two domain-specific refinements: competition-tier weighting and an altitude correction.

Base K-factor
32
Home advantage
+65
Elo points, non-neutral
Margin multiplier
ln
1 + ln(1 + |GD|)
Initial rating
1500

The update per match scales with the competition tier and the goal margin. An altitude correction stops high-altitude home wins (Quito, La Paz, Bogotá) from inflating ratings: a sea-level side visiting altitude is already expected to do worse, so beating them earns a smaller rating gain. Tier weights, tuned on the backtest:

Tier 1 ×1.1Tier 2 ×0.75Tier 3 ×0.6Tier 4 ×0.45Tier 5 ×0.3

Final-refit top of the table: Spain · Argentina · France · Brazil · England · Netherlands · … — Ecuador sits #16 and Bolivia #84, a sign the altitude fix holds.

04

Dixon–Coles goal model + host advantage

Elo says who is stronger; Dixon–Coles turns that into goals. It models each side's expected goals as Poisson rates with a low-score correction, and fits one shared host-advantage term.

For a match between teams i and j, the log goal rates are log λ = c + attack₍ᵢ₎ − defense₍ⱼ₎ + h·home₍ᵢ₎ − altitude·burden₍ᵢ₎. The Dixon–Coles τ correction fixes the well-known under-counting of 0-0, 1-0, 0-1 and 1-1 scorelines. An Elo priorpulls each team’s attack/defense toward its rating-implied strength, and the fit is time-weighted (recent matches count more).

Intercept c
0.31
league goal rate
Host adv. h
+0.29
log-goal coefficient
Altitude coef
0.168
visitor goal penalty
ρ (low-score)
-0.03
Dixon–Coles τ

Host advantage. A single coefficient (h = +0.29) lifts the home/host side’s goal rate. For 2026 it applies to the three hosts (USA, Mexico, Canada) on their genuine home fixtures; it is discounted at neutral World Cup venues, and the altitude term handles Mexico City’s elevation separately.

Fit on data from 2008-01-01, 4-year half-life (1460 days), Elo-prior strength 20. Training cutoff 2026-06-03.

05

Calibration

Calibration was assessed, not assumed. We tested whether a post-hoc map would improve the probabilities before deciding what the published run carries.

Published runcalibration methodnone — checked, adequate
Method2018 ΔBrier2022 ΔBrierLog lossVerdict
platt+0.5%-0.3%worse both yearsnet-neutral on Brier, no calibration gain
isotonic+0.3%+2.3%much worse 2022rejected — >1% Brier regression (small-window overfit)

Assessed finding: adequate as-is, no map applied. Isotonic regression overfits the small World Cup window (+2.3% Brier on 2022); Platt scaling is net-neutral on Brier and degrades log loss. The raw model is acceptably calibrated — visible in the reliability diagram below — so the published run records calibration_method = none. This is a checked-and-adequate result, recorded with its evidence, not an omission.

06

Backtest results

The model was refit strictly before each past World Cup opener — no leakage — then scored on the matches that followed. Tuning happened on 2022; 2018 was held out.

World CupBrierLog lossBaseline BrierΔBrierGate
2018 (held-out)0.59760.99850.5856+0.0120 (0.48σ)within 1σ ✓
2022 (tuned)0.61231.02860.6018+0.0105 (0.35σ)within 1σ ✓

Both years beat a uniform (⅓, ⅓, ⅓) model comfortably (Brier 0.667 / log loss 1.099). The comparison column is an Elo-logistic baseline on the same information.

On the gate.A calibrated logistic on the same Elo signal is a very strong win/draw/loss baseline — but the simulator needs Dixon–Coles’s scoreline structure (exact scores, extra time, penalties, goal-difference tie-breaks) that a W/D/L classifier simply cannot produce. So the gate is set to detect brokenness, not to pick a winner for the same job: Dixon–Coles must land within 1σ of the baseline on Brier each year. Both clear it comfortably (0.48σ and 0.35σ), and the tuned config improved held-out 2018.
2022 tuning trajectory (8 steps)

Coordinate descent on a combined Brier/log-loss objective normalised to the default config. The data preferred a longer half-life and a stronger Elo prior — the hypothesis for why traditional powers were under-rated.

Step (knob moved)LossBrierLog loss
default (730 / 5 / default tiers)1.00000.62501.0453
half-life → 1460 d0.99460.62121.0405
λ prior → 200.99050.61861.0360
tier 1 → 1.100.98980.61821.0355
tier 2 → 0.750.98910.61771.0348
tier 3 → 0.600.98740.61651.0333
tier 4 → 0.450.98560.61511.0317
tier 5 → 0.300.98190.61231.0286
07

Reliability diagram

Calibration, drawn. For each predicted-probability bin, where does the model land versus how often the outcome actually happened? Points on the dashed diagonal are perfectly calibrated.

192 match-outcomes
000.250.250.50.50.750.7511Home win · predicted 10% → observed 100% (n=1)Home win · predicted 18% → observed 0% (n=4)Home win · predicted 26% → observed 31% (n=16)Home win · predicted 36% → observed 46% (n=13)Home win · predicted 45% → observed 45% (n=20)Home win · predicted 55% → observed 67% (n=9)Home win · predicted 64% → observed 100% (n=1)Draw · predicted 27% → observed 23% (n=47)Draw · predicted 31% → observed 24% (n=17)Away win · predicted 16% → observed 17% (n=6)Away win · predicted 24% → observed 21% (n=19)Away win · predicted 34% → observed 39% (n=18)Away win · predicted 45% → observed 42% (n=12)Away win · predicted 52% → observed 57% (n=7)Away win · predicted 64% → observed 0% (n=2)Predicted probabilityObserved frequency
Home winDrawAway winBubble size ∝ bin sample count · dashed line = perfect calibration

With only ~64 matches per World Cup the bins are sparse — large bubbles carry the signal, small ones are noise-dominated, so read the diagram by bubble size. The points hug the diagonal without a systematic over- or under-confidence bias, which is why no calibration map was applied.

Rendered from reliability_2018.csv / reliability_2022.csv.

08

Monte Carlo simulation

To get tournament odds, the engine plays the whole World Cup tens of thousands of times — group stage through final — sampling each match from its Dixon–Coles scoreline distribution.

RNG seed
42
fixed → reproducible
Published run
50k
simulations
Convergence check
10k ↔ 50k
champion-share drift
Max drift
0.634pp
on Spain, vs 50k

Each simulated tournament resolves group standings (with the full tie-breaker chain below), selects the eight best third-placed teams, fills the knockout bracket via the official slotting matrix, and plays to a champion. Aggregating across runs gives every team’s probability of reaching each stage. A convergence check comparing 10k and 50k simulations found the largest champion-share drift was 0.634pp (on Spain) — just above the 0.5pp target, driven by noise in the 10k arm. The published forecast therefore runs at 50,000 simulations.

Champion probability — top 12 (illustrative 10k run)
  • Spain10.4%
  • Argentina9.5%
  • Brazil7.3%
  • France7.1%
  • England5.3%
  • Portugal4.5%
  • Germany4.2%
  • Netherlands4.2%
  • Mexico3.8%
  • Colombia3.7%
  • Belgium3.2%
  • Japan3.2%

Snapshot of a backtest-acceptance run — the live numbers are on the forecast page.

09

Group tie-breakers — FIFA Article 13

The simulator implements the official 2026 tie-breaker chain in full. Getting the order right matters: it changes which third-placed teams qualify, and therefore the knockout bracket.

  1. 1a–1cHead-to-head first — among the tied teams only: points, then goal difference, then goals scored.
  2. re-applyIf head-to-head separates some but not all, re-apply 1a–1c to the teams still tied (the step does not restart from scratch).
  3. 2d–2eOverall goal difference, then overall goals scored across all group matches.
  4. 3gMost recent FIFA / Coca-Cola Men's World Ranking — FIFA's official terminal step.
Disclosure.The simulator implements FIFA Article 13 in full through overall goals scored. Beyond that, since it does not model card/disciplinary events, ties are broken using the team’s pre-tournament FIFA Ranking — the 1 April 2026edition, the locked launch ranking (FIFA’s official terminal step). Drawing of lots is not used — FIFA does not use it for 2026. In practice, ties surviving past overall goals scored occur in well under 1% of simulated group instances and have a negligible effect on aggregate probabilities. The pre-tournament FIFA Ranking is a deterministic, reproducible stand-in for the seeded-draw approximation an earlier plan had used.

For the best-8 third-placed ranking across groups the chain is identical except head-to-head is skipped (those teams have not played each other): points → overall GD → overall GS → FIFA Ranking.

10

Reproducibility

Every prediction run records a 10-field provenance block, so any published forecast can be traced to the exact model, code, data, and random seed that produced it.

Current runforecast__elo-dc-v1.0__seed42__cond0__rerun20260610live
FieldValueWhat it pins
model_versionelo-dc-v1.0Identifies the committed Elo + Dixon–Coles fit.
simulator_versionshell-v0.1Identifies the Monte Carlo engine + tie-breaker logic.
code_version / git_commit251b05fc78ceRepository commit the run was produced from.
data_version2026-06-10Snapshot date of the cleaned results dataset.
data_hashee6953cd4cd6d943… (SHA-256)Hash of the source results CSV (martj42/international_results).
rng_seed42Monte Carlo seed — fixed, so a re-run reproduces the same draws.
training_cutoff_date2026-06-10Matches on or before this date were used to fit; nothing after leaks in.
data_cutoff_time— (pre-tournament)Null pre-tournament; set to the as-of time once live results condition the run.
calibration_methodnoneCalibration assessed and applied (see calibration section).
run_typemodel_forecastDistinguishes a forecast run from a data build or backtest.

Read live from the published forecast run (the current_run view) — the same source the forecast page uses, so it refreshes at every re-run.

11

Limitations

Where the model is weakest, stated plainly.

  • No team news. Injuries, suspensions, lineups and form within a camp are invisible to it — it sees match results, not who is on the pitch.
  • No market signal. By design it ignores betting odds, so its champion shares are more compressed than the market’s (favourites land near 7–10%, not 12–15%).
  • Sparse World Cup evidence. Only two tournaments (128 matches) back the calibration; reliability bins are noisy and a single tournament cannot fully validate tail behaviour.
  • Grid-boundary tuning. Half-life and the Elo prior both landed at the top of their search grids — the true optimum may sit beyond, suggesting even more weight on long-run strength.
  • Static within a day. The forecast updates when results are entered, not continuously; between updates it does not react to news.
12

Update cadence

When the numbers move, and why.

  • Pre-tournament. A final model re-run on the June 10 data freeze sets the published opening forecast — superseding any earlier provisional run.
  • During the tournament. As each result is entered, the simulator re-conditions on completed matches and re-runs, so standings and championship odds shift to reflect what has actually happened.
  • Provenance preserved. Each re-run writes a fresh reproducibility block (above), so any past state of the forecast remains traceable.

Independent, calibrated, and probabilistic — not affiliated with FIFA, and not betting advice. Model figures are transcribed from the committed artifacts (backtest report, Dixon–Coles parameters, reliability CSVs); the reproducibility block and calibration status are read live from the published forecast run.