Methodology

Probabilistic · calibrated

A transparent account of how the forecast is produced: the model, the data it learns from, how it scored on past World Cups, and exactly where it stops being certain. The brand says “Prediction Engine”; the claims here say probability. Both are true.

What this model is, and is not

An honest frame before any numbers. A disclosed limitation is a feature in a forecast, not a flaw.

It is

A statistical model of international football: strength ratings → a goal model → simulated tournaments.
Calibrated and backtested on the 2018 and 2022 World Cups before being trusted on 2026.
Fully reproducible: fixed data hash, fixed RNG seed, versioned code.
Independent. Built from public match results only.

It is not

An oracle. A 7% champion probability means roughly a 1-in-14 shot, not a prediction the team wins.
A betting product. It uses no market or odds data, by design.
Injury/lineup/transfer-aware in real time: it sees results, not team news.
Affiliated with FIFA or any official body.

Data sources

One public dataset of international results, cleaned and tournament-weighted. No market data, no scraped odds.

martj42 / international_resultslatest match 2026-06-01

Every full international since the 19th century. After cleaning (dropping unplayed and unparseable rows), 49,291 matches feed the model. Each match carries a competition tier (a World Cup final counts for more than a friendly), and the exact tier weights were tuned on the backtest (below).

Tier 1

964

Tier 2

3,391

Tier 3

8,771

Tier 4

8,231

Tier 5

27,934

Source pinned by SHA-256 37e5ce3b82849279…. See the reproducibility note.

The pipeline

Three stages turn historical results into a champion probability for all 48 teams.

1 · Elo

Tournament-weighted strength rating per team, updated match by match.

2 · Dixon–Coles

A goal model: turns ratings into a scoreline distribution for any fixture.

3 · Monte Carlo

Simulates the whole tournament tens of thousands of times.

Elo ratings

Standard international-football Elo (the eloratings.net tradition), with two domain-specific refinements: competition-tier weighting and an altitude correction.

Base K-factor

Home advantage

+65

Elo points, non-neutral

Margin multiplier

1 + ln(1 + |GD|)

Initial rating

1500

The update per match scales with the competition tier and the goal margin. An altitude correction stops high-altitude home wins (Quito, La Paz, Bogotá) from inflating ratings: a sea-level side visiting altitude is already expected to do worse, so beating them earns a smaller rating gain. Tier weights, tuned on the backtest:

Tier 1 ×1.1Tier 2 ×0.75Tier 3 ×0.6Tier 4 ×0.45Tier 5 ×0.3

Final-refit top of the table: (Ecuador sits #16 and Bolivia #84, a sign the altitude fix holds).

Dixon–Coles goal model + host advantage

Elo says who is stronger; Dixon–Coles turns that into goals. It models each side's expected goals as Poisson rates with a low-score correction, and fits one shared host-advantage term.

For a match between teams i and j, the log goal rates are log λ = c + attack₍ᵢ₎ − defense₍ⱼ₎ + h·home₍ᵢ₎ − altitude·burden₍ᵢ₎. The Dixon–Coles τ correction fixes the well-known under-counting of 0-0, 1-0, 0-1 and 1-1 scorelines. An Elo priorpulls each team’s attack/defense toward its rating-implied strength, and the fit is time-weighted (recent matches count more).

Intercept c

0.31

league goal rate

Host adv. h

+0.29

log-goal coefficient

Altitude coef

0.168

visitor goal penalty

ρ (low-score)

-0.03

Dixon–Coles τ

Host advantage. A single coefficient (h = +0.29) lifts the home/host side’s goal rate. For 2026 it applies to the three hosts (USA, Mexico, Canada) on their genuine home fixtures; it is discounted at neutral World Cup venues, and the altitude term handles Mexico City’s elevation separately.

Fit on data from 2008-01-01, 4-year half-life (1460 days), Elo-prior strength 20. Training cutoff 2026-06-03.

Calibration

Calibration was assessed, not assumed. We tested whether a post-hoc map would improve the probabilities before deciding what the published run carries.

Published runcalibration methodnone (checked, adequate)

Method	2018 ΔBrier	2022 ΔBrier	Log loss	Verdict
platt	+0.5%	-0.3%	worse both years	net-neutral on Brier, no calibration gain
isotonic	+0.3%	+2.3%	much worse 2022	rejected: >1% Brier regression (small-window overfit)

Assessed finding: adequate as-is, no map applied. Isotonic regression overfits the small World Cup window (+2.3% Brier on 2022); Platt scaling is net-neutral on Brier and degrades log loss. The raw model is acceptably calibrated (visible in the reliability diagram below), so the published run records calibration_method = none. This is a checked-and-adequate result, recorded with its evidence, not an omission.

Backtest results

The model was refit strictly before each past World Cup opener (no leakage), then scored on the matches that followed. Tuning happened on 2022; 2018 was held out.

World Cup	Brier	Log loss	Baseline Brier	ΔBrier	Gate
2018 (held-out)	0.5976	0.9985	0.5856	+0.0120 (0.48σ)	within 1σ ✓
2022 (tuned)	0.6123	1.0286	0.6018	+0.0105 (0.35σ)	within 1σ ✓

Both years beat a uniform (⅓, ⅓, ⅓) model comfortably (Brier 0.667 / log loss 1.099). The comparison column is an Elo-logistic baseline on the same information.

On the gate.A calibrated logistic on the same Elo signal is a very strong win/draw/loss baseline, but the simulator needs Dixon–Coles’s scoreline structure (exact scores, extra time, penalties, goal-difference tie-breaks) that a W/D/L classifier simply cannot produce. So the gate is set to detect , not to pick a winner for the same job: Dixon–Coles must land each year. Both clear it comfortably (0.48σ and 0.35σ), and the tuned config held-out 2018.

Reliability diagram

Calibration, drawn. For each predicted-probability bin, where does the model land versus how often the outcome actually happened? Points on the dashed diagonal are perfectly calibrated.

192 match-outcomes

Home winDrawAway winBubble size ∝ bin sample count · dashed line = perfect calibration

With only ~64 matches per World Cup the bins are sparse: large bubbles carry the signal, small ones are noise-dominated, so read the diagram by bubble size. The points hug the diagonal without a systematic over- or under-confidence bias, which is why no calibration map was applied.

Rendered from reliability_2018.csv / reliability_2022.csv.

Validation: did it pick the winner?

The match scores tell you the model is calibrated; this asks the harder question. Refit strictly before each past World Cup, how did it rate the team that actually won, and did it converge on that team as the bracket played out? All figures are bound to the committed backtest artifacts.

(a) Pre-tournament champion odds: both winners were rated, neither was the favourite

Before a ball was kicked, the model refit strictly to the pre-opener data and simulated each 32-team bracket for champion odds. In both backtest years the team that went on to win was rated inside the model’s top five of 32, yet in both years the single favourite was a different team that did not win.

World Cup	Eventual champion	Model rank	Champion odds	Model favourite
2022	Argentina	#2 of 32	10.2%	Brazil (15.9%)
2018	France	#5 of 32	5.8%	Brazil (14.3%)

Argentina was rated #2 at 10.2% in 2022 and France #5 at 5.8% in 2018, while the model’s single favourite, Brazil, topped the board in both years and won neither. That is the calibrated, honest shape of a champion forecast: the pre-tournament favourite usually does not win, and a top-five rating for the team that did is the right kind of hit.

Monte Carlo simulation

To get tournament odds, the engine plays the whole World Cup tens of thousands of times (group stage through final), sampling each match from its Dixon–Coles scoreline distribution.

RNG seed

fixed → reproducible

Published run

50k

simulations

Convergence check

10k ↔ 50k

champion-share drift

Max drift

0.634pp

on Spain, vs 50k

Each simulated tournament resolves group standings (with the full tie-breaker chain below), selects the eight best third-placed teams, fills the knockout bracket via the official slotting matrix, and plays to a champion. Aggregating across runs gives every team’s probability of reaching each stage. A convergence check comparing 10k and 50k simulations found the largest champion-share drift was 0.634pp (on Spain), just above the 0.5pp target, driven by noise in the 10k arm. The published forecast therefore runs at 50,000 simulations.

Champion probability: top 12 (illustrative 10k run)

Spain

Group tie-breakers: FIFA Article 13

The simulator implements the official 2026 tie-breaker chain in full. Getting the order right matters: it changes which third-placed teams qualify, and therefore the knockout bracket.

1a–1cHead-to-head first, among the tied teams only: points, then goal difference, then goals scored.
re-applyIf head-to-head separates some but not all, re-apply 1a–1c to the teams still tied (the step does not restart from scratch).
2d–2eOverall goal difference, then overall goals scored across all group matches.
3gMost recent FIFA / Coca-Cola Men's World Ranking: FIFA's official terminal step.

Disclosure.The simulator implements FIFA Article 13 in full through overall goals scored. Beyond that, since it does not model card/disciplinary events, ties are broken using the team’s pre-tournament FIFA Ranking: the 1 April 2026edition, the locked launch ranking (FIFA’s official terminal step). Drawing of lots is not used; FIFA does not use it for 2026. In practice, ties surviving past overall goals scored occur in well under 1% of simulated group instances and have a negligible effect on aggregate probabilities. The pre-tournament FIFA Ranking is a deterministic, reproducible stand-in for the seeded-draw approximation an earlier plan had used.

For the best-8 third-placed ranking across groups the chain is identical except head-to-head is skipped (those teams have not played each other): points → overall GD → overall GS → FIFA Ranking.

Reading the forecast

What each number on the forecast, groups and team pages actually represents, and how the honest insight reads are derived. These definitions live here so the pages themselves can stay to the numbers.

Stage probabilities

On the forecast page, on each team profile and in the title race, every figure (champion, finalist, semi-finalist, quarter-finalist, group qualification) is the share of Monte Carlo simulationsin which the team reaches that stage. Bars on the forecast chart are scaled to the leader for comparison; the percentage shown is the model’s absolute title odds, not a relative figure.

Title-race movement

The home leaderboard ranks teams by title probability and shows movement against the previous run (the same pre-matchday run the daily briefing recaps against): rank up reads green, rank down reads red, and an unchanged rank still carries a green or red chip when the odds moved without the rank changing. Before any prior run exists, rows render without deltas.

Dark horses and overvalued teams

Both cards first split the field on a frozen elite tier: the pre-tournament top 12 teams by the model’s frozen champion odds, ordered by a deterministic total order (champion odds, then finalist odds, then FIFA rank, then team name, so no two teams share a rank; a team with no champion odds sorts to the bottom and is never elite). The tier is fixed before kickoff and never flexes as teams are eliminated.

Dark horses are the teams from outside that top 12 whose ACTUAL run most beat their frozen odds, ranked by the surprise index. The index blends a stage term (the negative log-probability the frozen baseline gave the deepest stage the team really reached, from results, not a live probability) with a match term (the mean, over the team’s played matches, of points banked minus the points the last pre-kickoff run expected), weighted 60/40 and normalized across the field. Restricting the pool to teams outside the top 12 is what keeps a pre-tournament favourite off the board no matter how deep it runs; a genuine unfancied over-performer takes its place. It is : elimination never removes a team, so an over-performer that goes out still stands. Hosts are not excluded, because the frozen baseline already prices in home advantage, so a host that over-performs is beating its own elevated bar.

Player award boards

The four award races on the players page. Two are honest current leaderboards of what has actually happened; two are the model's pick. Each board's reasoning lives here, not on the cards.

Golden Boot

An honest current leaderboard, ranked by actual goals scored so far, with the model’s expected total shown alongside. The board metric is expected goals, so a cross-role scorer (for example a midfielder who has scored) shows their expected goals here, not their role headline, and keeps their value inside its own expected-goals band.

Playmaker

The assist equivalent: a current leaderboard ranked by actual assists so far, with the model’s expected total alongside. Only players with at least one real assist appear. As with the boot, the board metric is expected assists, so a cross-role assister (a defender or forward) shows their expected assists rather than their role headline.

Golden Glove

The model’s keeper pick, ranked on individual shot-stopping quality: save percentage and goals-prevented percentiles within keepers, with the keeper’s team run only a secondary tilt.

Player of the Tournament

The model’s pick, a cross-role contribution percentile weighted by the player’s team’s title and deep-run odds.

A caveat. Player of the Tournament is the noisiest, least reliable board: it blends contributions across very different roles and leans on team odds, so it is the model’s current pick, not a prophecy. Read it as a talking point, not a forecast.

Player projections

How the per-player numbers on the players index and profile pages are built, and what the goal split and percentile context mean.

Player projections are probabilistic expectations, not predictions of certainty. They come from a pre-tournament statistical baseline, scaled by the team’s expected tournament run, and shift as results are recorded and the model re-runs. Actuals accrue per played match. A player tracked for live output but without a pre-tournament baseline shows actuals only; no projection is fabricated.

The goal split

For forwards and midfielders the expected goals headline is split into open-play and penalty contributions. Open-play goals come from non-penalty xG scaled by a finishing factor (an open-play goals-versus-xG conversion skill): above 1.0 the player historically out-finishes their xG, around 1.0 is neutral, below 1.0 they under-finish. Where no separate finishing skill is projected, open-play goals are derived directly from non-penalty xG.

Index display and percentiles

The players index shows each player’s strongest actual contribution (goals for a scorer, assists for a creator, clean sheets for a goalless keeper) against the matching expected, so a scorer is never hidden behind a zero role headline; the colour tier always compares the shown actual to its matching expected, never goals to expected assists. The per-position boards rank by a role rating that blends each player’s projected output with how they grade per-90 against peers in their position. On a profile, positional percentiles show where a player ranks among others of the same role on the baseline metrics (higher is better).

Eliminated players: the frozen now

On the pre-tournament expectations card, the “now” column is a full-tournament projection that re-projects as results come in. Once a player’s team is eliminated he plays no more matches, so a correct full-tournament projection collapses to what he actually produced. We enforce that directly: an eliminated player’s now is frozen to his as-played value, the sum of his played-match expected goals and assists. Because that value is built from recorded actuals rather than a model output, it cannot move on a later run and can never rise, so a knocked-out player’s number stops drifting upward the moment he is out.

Reproducibility

Every prediction run records a 10-field provenance block, so any published forecast can be traced to the exact model, code, data, and random seed that produced it.

Current runlive__20260719T221057Z__cond104__seed42live

Field	Value	What it pins
model_version	elo-dc-v1.0	Identifies the committed Elo + Dixon–Coles fit.
simulator_version	shell-v0.1	Identifies the Monte Carlo engine + tie-breaker logic.
code_version / git_commit	9c3864a46fe0	Repository commit the run was produced from.
data_version	2026-07-19	Snapshot date of the cleaned results dataset.
data_hash	ee6953cd4cd6d943… (SHA-256)	Hash of the source results CSV (martj42/international_results).
rng_seed	42

Limitations

Where the model is weakest, stated plainly.

No team news. Injuries, suspensions, lineups and form within a camp are invisible to it: it sees match results, not who is on the pitch.
No market signal. By design it ignores betting odds, so its champion shares are more compressed than the market’s (favourites land near 7–10%, not 12–15%).
Sparse World Cup evidence. Only two tournaments (128 matches) back the calibration; reliability bins are noisy and a single tournament cannot fully validate tail behaviour.
Grid-boundary tuning. Half-life and the Elo prior both landed at the top of their search grids: the true optimum may sit beyond, suggesting even more weight on long-run strength.
Static within a day. The forecast updates when results are entered, not continuously; between updates it does not react to news.

Update cadence

When the numbers move, and why.

Pre-tournament. A final model re-run on the June 10 data freeze sets the published opening forecast, superseding any earlier provisional run.
During the tournament. As each result is entered, the simulator re-conditions on completed matches and re-runs, so standings and championship odds shift to reflect what has actually happened.
Provenance preserved. Each re-run writes a fresh reproducibility block (above), so any past state of the forecast remains traceable.

Independent, calibrated, and probabilistic: not affiliated with FIFA, and not betting advice. Model figures are transcribed from the committed artifacts (backtest report, Dixon–Coles parameters, reliability CSVs); the reproducibility block and calibration status are read live from the published forecast run.

Step (knob moved)	Loss	Brier	Log loss
default (730 / 5 / default tiers)	1.0000	0.6250	1.0453
half-life → 1460 d	0.9946	0.6212	1.0405
λ prior → 20	0.9905	0.6186	1.0360
tier 1 → 1.10	0.9898	0.6182	1.0355
tier 2 → 0.75	0.9891	0.6177	1.0348
tier 3 → 0.60	0.9874	0.6165	1.0333
tier 4 → 0.45	0.9856	0.6151	1.0317
tier 5 → 0.30	0.9819	0.6123	1.0286

Fit	Brier	Log loss	MAE	Top-1 / Top-2
Walk-forward (updating)	0.6205	1.0422	0.4175	0.55 / 0.75
Static (pre-tournament)	0.6123	1.0286	n/a	n/a

Stage	Field	Rank	Odds
Round of 16	16	#5	6.6%
Quarter-finals	8	#3	14.7%
Semi-finals	4	#1	29.1%
Final	2	#1	57.3%