Methodology

Full math, modelling choices, and limitations. This is the long page — the rest of the study is non-technical; this is where the equations, formal definitions, and citations live. The five public tabs all sit on top of the pipeline described here.

1. Notation & state

The tracking transformer's input is a single frame: a snapshot of all 22 players plus the ball at a single 5 Hz sample. We represent each frame as a sequence of 23 tokens (22 players + 1 ball). Each token is a 7-dimensional feature vector.

1.1 Per-token features

Feature	Range	Definition
`x_norm`	\([-1, 1]\)	Pitch x-coordinate, centred on midfield, scaled so each touchline sits at \(\pm 1\).
`y_norm`	\([-1, 1]\)	Pitch y-coordinate, centred and scaled identically.
`vx`	m/s	x-velocity from finite differences, clamped to \([-25, 25]\).
`vy`	m/s	y-velocity from finite differences, clamped to \([-25, 25]\).
`is_attacking_side`	\(\{0,1\}\)	1 iff the token belongs to the team currently in possession (ball token: 0).
`is_goalkeeper`	\(\{0,1\}\)	1 iff the player was identified as GK via the extreme-x heuristic over a calibration window.
`has_possession`	\(\{0,1\}\)	1 iff this token is the ball, or the player currently nearest the ball as a possession proxy.

Player ordering within the player block is arbitrary; the transformer is permutation-equivariant over tokens (no role one-hots, no positional encoding on the player axis). This is a deliberate inductive bias — identity is what the model has to infer from geometry + velocity, not what it is told.

1.2 Sampling & coordinates

PFF tracking is delivered at 30 Hz; we sample every 6th frame (stride 6) to land on a common 5 Hz grid. \(\Delta t = 6/30 = 0.2\) s for the finite-difference velocity.
Velocities are clamped at \(\pm 25\) m/s, which absorbs substitution teleports and tracker glitches without distorting realistic player speeds (human sprint <= 12 m/s; pass velocities up to ~30 m/s).
Coordinates are centred so the pitch midpoint is the origin and each side stretches to \(\pm 1\). The attacking direction is normalised so the team in possession always advances toward \(+x\); this lets the model learn a single notion of "forward".

2. VAEP framework

VAEP (Valuing Actions by Estimating Probabilities) is Decroos et al.'s framework for assigning a continuous expected-goal value to every on-ball action. The headline idea: an action's value is the change in expected goal difference it produces, where "expected goal difference" is the probability the possessing team scores in the next \(K\) actions minus the probability they concede in the same window.

\[ V(a) = \big[P_{\mathrm{score}}(s_a) - P_{\mathrm{score}}(s_{a-1})\big] - \big[P_{\mathrm{concede}}(s_a) - P_{\mathrm{concede}}(s_{a-1})\big] \]

where \(s_a\) is the post-action game state, \(s_{a-1}\) the pre-action state, and \(P_{\mathrm{score}}(s)\), \(P_{\mathrm{concede}}(s)\) are estimated by gradient-boosted classifiers conditioned on the previous three actions (SPADL encoding). The standard look-ahead is \(K = 10\) actions.

Equation derivation & implementation note

If the previous action \(a-1\) was performed by the opposing team, \(P_{\mathrm{score}}(s_{a-1})\) and \(P_{\mathrm{concede}}(s_{a-1})\) must be swapped before taking the difference, so that the prior state is expressed from the perspective of the team performing \(a\):

# research/src/chemistry/vaep/model.py
same = (teams == prev_team)
prev_value_self_persp = np.where(same, prev_pS - prev_pC, prev_pC - prev_pS)
df["vaep_value"] = (pS - pC) - prev_value_self_persp

\(P_{\mathrm{score}}\) and \(P_{\mathrm{concede}}\) are trained independently via XGBoost on SPADL-encoded action sequences. Out-of-fold AUCs typically land in the 0.85–0.92 range for \(P_{\mathrm{score}}\) and 0.65–0.75 for \(P_{\mathrm{concede}}\).

VAEP out-of-fold metrics will load here.

Reference: Decroos T., Bransen L., Van Haaren J., Davis J. (2019). Actions Speak Louder than Goals: Valuing Player Actions in Soccer. KDD 2019. See §9 for citation.

3. JOI / JDI (event-based baseline)

Bransen & Van Haaren (MIT Sloan, 2020) layer pair chemistry on top of VAEP. Joint Offensive Impact (JOI) sums VAEP across consecutive-action pairs of teammates; Joint Defensive Impact (JDI) assigns credit to defending pairs when opponents under-perform their expected offensive impact.

3.1 Joint Offensive Impact (JOI)

For an ordered pair of consecutive actions \((a_i, a_{i+1})\) by teammates \(p\) and \(q\), both VAEP values contribute to JOI(p, q):

\[ \mathrm{JOI}(p, q) = \sum_{(a_i,\, a_{i+1}) \in S_{p,q}} V(a_i) + V(a_{i+1}) \]

Normalised per 90 shared minutes: \( \mathrm{JOI90}(p, q) = 90 \cdot \mathrm{JOI}(p,q) / m_{p,q} \).

3.2 Joint Defensive Impact (JDI) — and what event data cannot see

JDI is the event-based baseline's hard half. Defenders' value lives mostly off-ball, but event data only records on-ball touches. The authors are explicit:

"The match event data only describes the actions that actually happened in the match but not the actions that players prevented from happening, for instance, by smart runs or clever positioning."
— Bransen & Van Haaren (2020), §3.2 (Joint Defensive Impact).

Their workaround: for every opposing player \(o\) the defending pair faced, compute the gap between \(o\)'s expected and actual OI and split it across same-team pairs by a responsibility share \(R(p, q, o)\) derived from a 5×5 grid distance heuristic.

\[ \mathrm{JDI}_m(p, q) \;=\; \sum_o \big(\mathbb{E}[\mathrm{OI}_m(o)] - \mathrm{OI}_m(o)\big) \cdot R_m(p, q, o) \cdot \frac{\mathrm{mins}_m(p, q, o)}{90} \]

OI, expected OI, responsibility share — full definitions

Per-match OI: \( \mathrm{OI}_m(o) = \sum_a V(a) \) over passes, crosses, dribbles, take-ons, and shots by player \(o\) in match \(m\).

Expected OI: \(o\)'s per-90 mean across prior matches in the dataset, Bayesian-shrunk toward a positional prior (GK / DEF / MID / FWD) when fewer than \(M_0 = 700\) minutes have been played.

Responsibility share: \( r_p(o) = (1/d(p, o)) / \sum_q (1/d(q, o)) \), with \(d\) the Euclidean distance in the 5×5 grid after mirroring \(o\)'s cell into the defending coordinate frame. The pair share is the simple mean \(R_m(p, q, o) = \tfrac{1}{2}(r_p(o) + r_q(o))\).

Aggregated across matches and normalised per 90 shared minutes gives JDI90.

Predictor metrics will load here.

The JDI quote above is the foil for §4–5: the tracking transformer does see those off-ball actions, frame by frame, for all 22 players.

4. Our tracking transformer

Architecture adapted from Sumer Sports's open-source SportsTrackingTransformer (originally an NFL model). Three changes for soccer: 23-token state instead of 22, two BCE heads (frame-VAEP) instead of an NFL-specific target, and an encode_with_attention path that is a first-class output rather than a debug hook.

4.1 Architecture

\[ \text{Input}\ (B, 23, 7) \;\to\; \mathrm{BatchNorm_{features}} \;\to\; \mathrm{Linear}\ (7 \to d) \;\to\; \big[\mathrm{TransformerEncoderLayer}(d, h)\big]^{L} \;\to\; \mathrm{Heads} \]

BatchNorm is applied over the feature axis, so each of the 7 features is normalised across the (batch × tokens) population — permutation-equivariant.
Linear embedding projects each token independently from 7-dim to the model dimension \(d\) (64 in shipped checkpoints).
Encoder layers use pre-norm ordering with GELU activations. Shipped: \(L = 2\) layers, \(h = 4\) heads, \(d = 64\), dropout 0.1.
No positional encoding on the player axis. Player identity is whatever the model can infer from \(x, y, v_x, v_y\) and the three flag features.

4.2 Heads & targets

Three heads have been trained on this backbone. The two relevant to chemistry are the frame-VAEP specialists; the xT-regression head is the original supervised target and is retained for evaluation.

Train / validation split. 36 PFF WC22 matches used for training (629,634 frames at 5 Hz); 8 held-out matches (138,793 frames) used for the val metrics reported below. The val set is match-disjoint from train — the model never sees any frame from those 8 matches during training. This is a single train/val split, not k-fold cross-validation; the headline numbers are best read as "held-out on 8 of 44 matches," not "out-of-fold across folds."

\(P_{\mathrm{score}}^{\mathrm{frame}}\) specialist: single sigmoid head, BCE loss with class weighting. Target = 1 iff the possessing team scores within the next 10 seconds. Val AUC 0.816, Brier 0.234 (from training_metrics_score_only.json).
\(P_{\mathrm{concede}}^{\mathrm{frame}}\) specialist: same architecture, target = 1 iff the possessing team concedes within 10 s. Val AUC 0.799, Brier 0.054 (from training_metrics_concede_only.json).
Shared two-head backbone (legacy, retained for comparison): score AUC 0.801, concede AUC 0.792 (training_metrics_frame_vaep.json).
xT-regression head: Huber regression on \(\max_{t' \in [t, t+K]} \mathrm{xT}(t')\), the Karun-Singh-grid xT value at the ball's future-window peak. Val Spearman 0.714 vs. an xT-lookup baseline at 0.616 (\(+0.098\) lift; training_metrics_xt.json).

The chemistry pipeline (§5) uses the two single-head specialists, because the shared backbone attends disproportionately to GKs and defenders (P(concede) carries tighter spatial signal than P(score) and dominates the gradient). The score specialist's top-10 off-off pairs overlap the shared model's (legacy / retained for comparison only) by 7/10; new entries include Dembélé + Mbappé (was #7 on shared) and Brazil's Real Madrid duo Vinícius + Raphinha (absent from the shared top-10 entirely).

4.3 Attention extraction

The transformer exposes encode_with_attention which returns the pooled encoding alongside the full per-layer per-head attention tensor:

\[ \mathrm{Attn} \in \mathbb{R}^{B \times L \times H \times T \times T}, \quad T = 23 \]

For chemistry we use the ball token as the query: the row \(\mathrm{Attn}[\,b,\,:,\,:,\,\text{ball},\,1{:}\text{P}+1\,]\), averaged across the \(L\) layers and \(H\) heads, gives a length-22 probability vector over the player tokens — "given the ball token's query, how much weight does the model place on each player when forming its prediction at frame \(t\)?".

Python implementation (excerpt)

# research/scripts/extract_aw_joi.py
ball_attn = attn[:, :, :, BALL_TOKEN, :NUM_PLAYER_SLOTS]   # (B, L, H, 22)
ball_attn = ball_attn.mean(dim=(1, 2))                      # (B, 22)

5. From attention to AW-JOI / AW-JDI

Attention-Weighted JOI / JDI are the pair-level aggregates that feed the leaderboard, the Whiteboard, and the team chemistry density. The construction is the tracking-data analogue of Bransen's JOI / JDI: instead of summing VAEP across consecutive on-ball actions of two teammates, we sum frame-level prediction-deltas weighted by how much the model was jointly attending to both players at that frame.

5.1 Frame-level building blocks

Per frame \(t\), per same-team pair \((p, q)\):

Pair coupling. \( c(p, q, t) = \alpha_t(p) \cdot \alpha_t(q) \), where \(\alpha_t(\cdot)\) is the ball-as-query attention over players, mean-aggregated across layers and heads. The product (not sum) is what makes this a joint signal — both players have to be attended at the same instant for the pair to score.
Score delta. \( \Delta P_{\mathrm{score}}(t) = P_{\mathrm{score}}^{\mathrm{frame}}(t+1) - P_{\mathrm{score}}^{\mathrm{frame}}(t) \) — the forward finite difference of the score-specialist's prediction.
Concede delta. \( \Delta P_{\mathrm{concede}}(t) = P_{\mathrm{concede}}^{\mathrm{frame}}(t+1) - P_{\mathrm{concede}}^{\mathrm{frame}}(t) \) — same construction on the concede specialist.

5.2 Pair sums

AW-JOI sums over offence-side moves (positive score-delta) using the score-specialist's attention; AW-JDI sums over defence-side moves (positive concede-delta — i.e. concede risk grew, and the defending pair gets credit for what attention the model placed on them while it grew) using the concede-specialist's attention:

\[ \mathrm{AW\text{-}JOI}(p, q) \;=\; \sum_{t} c_{\mathrm{score}}(p, q, t) \cdot \max\!\big(\Delta P_{\mathrm{score}}(t),\ 0\big) \] \[ \mathrm{AW\text{-}JDI}(p, q) \;=\; \sum_{t} c_{\mathrm{concede}}(p, q, t) \cdot \max\!\big(\Delta P_{\mathrm{concede}}(t),\ 0\big) \]

Both are normalised per 90 shared on-pitch minutes for the pair: \(\mathrm{AW\text{-}JOI90}(p, q) = 90 \cdot \mathrm{AW\text{-}JOI}(p, q) / m_{p, q}\), analogously for AW-JDI90.

Source — verified against research/scripts/extract_aw_joi.py

# research/scripts/extract_aw_joi.py (excerpt; aggregation kernel)
# Forward differences -- last frame's dv = 0
dv_score[:-1]   = p_score[1:]   - p_score[:-1]
dv_concede[:-1] = p_concede[1:] - p_concede[:-1]
w_joi = np.clip(dv_score,   0.0, None)
w_jdi = np.clip(dv_concede, 0.0, None)

# c[p,q,t] = a_p(t) * a_q(t); contributions = c * weight
a_i = ball_attn_nz[:, iu]                  # (M, K) over upper-tri pairs
a_j = ball_attn_nz[:, ju]
c   = a_i * a_j
contribs = c * weight_nz[:, None]          # (M, K)
# masked to same-team pairs, accumulated into pair_sums[(p,q)]

Note on the JDI sign convention. Bransen's JDI rewards defenders when opponents under-perform expected offence. Our AW-JDI is the tracking-side analogue: positive \(\Delta P_{\mathrm{concede}}\) means concede risk grew over the next 0.2 s, so the attention paid to a defensive pair in that instant is credited to them as "they were in the picture while the team was being broken down." High AW-JDI90 thus flags defensive engagement, not defensive success per se — the leaderboard treats it as the "defensive co-watched-ness" score.

6. Team Chemistry Density (TCD)

TCD is the team-level scalar that summarises "how many of this squad's pairs are above the league baseline?". It is the metric most pages sort by. Definition:

6.1 Pool medians

We compute two pool-level reference values across all 31 squads (one team excluded for data completeness):

\(\tilde{m}_{\mathrm{off}}\) = median AW-JOI90 across all off–off pairs across all teams.
\(\tilde{m}_{\mathrm{def}}\) = median AW-JDI90 across all def–def pairs across all teams.

Empirical pool medians on the WC22 corpus: \(\tilde{m}_{\mathrm{off}} \approx 0.467\), \(\tilde{m}_{\mathrm{def}} \approx 0.322\).

6.2 Per-team counts

For each team:

\[ n_{\mathrm{off}} = \#\{(p,q) \in \mathrm{off}\text{-}\mathrm{off} : \mathrm{AW\text{-}JOI90}(p,q) > \tilde{m}_{\mathrm{off}}\} \] \[ n_{\mathrm{def}} = \#\{(p,q) \in \mathrm{def}\text{-}\mathrm{def} : \mathrm{AW\text{-}JDI90}(p,q) > \tilde{m}_{\mathrm{def}}\} \] \[ n_{\mathrm{cross}} = \#\{(p,q) \in \mathrm{cross} : \Delta_{p,q} > 0\} - \#\{(p,q) \in \mathrm{cross} : \Delta_{p,q} < 0\} \]

where \(\Delta_{p,q} = \mathrm{AW\text{-}JOI90}(p,q) - \mathrm{AW\text{-}JDI90}(p,q)\) — the "off-vs-def net" of a cross-line pair (an attacker–defender pair contributes positively when its offensive co-watched-ness exceeds its defensive co-watched-ness).

\[ \mathrm{TCD} \;=\; n_{\mathrm{off}} + n_{\mathrm{def}} + n_{\mathrm{cross}} \]

6.3 Worked example — Argentina (winner, TCD rank 7)

Values taken directly from team_chemistry_vs_paper.json:

Component	Value
\(n_{\mathrm{off}}\) (off–off pairs above \(\tilde{m}_{\mathrm{off}}\))	19
\(n_{\mathrm{def}}\) (def–def pairs above \(\tilde{m}_{\mathrm{def}}\))	12
\(n_{\mathrm{cross}}\) (cross-line net, off-positive − def-positive)	67
TCD	98
TCD rank (of 31 teams)	7

Sanity check: Argentina has 168 within-team pairs (squad 26, three role buckets), so 19 + 12 + 67 = 98 strong-or-net pairs is roughly 58% of the within-team pair pool. The Brazil row at TCD 121 (rank 2) is denser still; Qatar at TCD 54 (rank 27) is the floor among full data rows. Each tab's "strong pair" counts on the study derive from these same three numbers.

7. Reconciliation with earlier drafts

Reconciliation with earlier drafts. Earlier drafts reported a chemistry-vs-finish Spearman ρ ≈ 0.78 using a pre-TCD strong-pair count with a hard 0.4 threshold. The unified TCD metric (pool-median threshold per role + cross-net) measures ρ = 0.704 on the same 31 teams (p < 0.001). The new definition is the one shipped on every public page; the older 0.78 number is retained only in the headline-archive for traceability.

8. Limitations

Tournament-only baseline. All "expected" values — the Bayesian-shrunk \(\mathbb{E}[\mathrm{OI}]\) used inside JDI, and the per-player Δ OI/90 shown on the Overview — are computed from this 64-match corpus alone. Most players have 2–4 matches here, so the per-player prior collapses to a positional average. Our "form" delta is therefore a tournament-relative gauge, not a club-vs-national comparison.
Provider calibration. WC22 chemistry numbers come from our PFF event + tracking pipeline; cross-context data is StatsBomb event data passed through the same VAEP model. Action-coverage and tagging conventions differ between providers, so treat the delta as directional, not absolute. The PFF WC22 vs StatsBomb WC22 row pair is the calibration check — if those agree for a player, we trust the other rows too.
Small sample. 64 matches across 32 teams means many pairs share well under 90 minutes. We apply a minimum-minutes filter on Top Pairs tables to mitigate, but JDI90 in particular is noisy at low sample. AW-JOI90 / AW-JDI90 inherit the same gating.
Defensive responsibility is a heuristic. Bransen approximates defensive involvement from on-ball recovery / tackle / interception actions in each zone. Off-ball positioning is invisible to event data — the tracking transformer in §4–5 is the structured complement.
Attention ≠ causation. A high AW-JOI90 means the frame-VAEP model's prediction depended on both players being where they were. That is strong evidence of a coordinated tactical pattern, but it is not a counterfactual claim that swapping one player would change the outcome by a specific number of expected goals. The Whiteboard is the place for explicit counterfactuals.
VAEP credit-assignment is action-level. A pass that "creates space" but doesn't directly trigger threat is undervalued by VAEP, and JOI/JDI inherit this. Attention chemistry partially escapes this — the model can credit off-ball movement — but the underlying BCE targets are still derived from event outcomes.
National-team novelty effect. The very thing we want to study — teams whose players don't play together often — is also the smallest-sample regime. Treat absolute numbers as suggestive, comparisons within a team as more robust.
External corroboration of the Morocco finding. One mitigation against the "model is fooling itself" worry is convergence with independent analyses. Benhida et al. (2025, Applied Sciences; study summary) run PCA + K-means on FIFA post-match KPIs and identify the same fast-transition attacking axis our score-specialist surfaces in the off-off pair leaderboard (Ziyech / Ounahi / En-Nesyri / Hamdallah / Aboukhlal / Sabiri). Different data, different model class, same conclusion — evidence that the score-frame attention pattern is picking up a real tactical signature rather than a tracking-data artifact.

9. References

Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). Actions Speak Louder than Goals: Valuing Player Actions in Soccer. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19), pp. 1851–1861. DOI: 10.1145/3292500.3330758.
Bransen, L., & Van Haaren, J. (2020). Player Chemistry: Striving for a Perfectly Balanced Soccer Team. MIT Sloan Sports Analytics Conference 2020. arXiv: 2003.01712.
Sumer Sports. SportsTrackingTransformer (open-source implementation of a permutation-equivariant transformer over tracking data, originally NFL). GitHub: github.com/SumerSports/SportsTrackingTransformer.
Singh, K. (2019). Introducing Expected Threat (xT). Blog post: karun.in/blog/expected-threat.html.
Benhida, M., et al. (2025). Tactical analysis of Morocco's 2022 World Cup performance via PCA and K-means clustering of FIFA match KPIs. Applied Sciences. DOI: 10.3390/app15189994.

Citation note: a peer-reviewed Jordet/Aksum-style paper on visual scanning in soccer was on the candidate list for this section but is omitted because the specific DOI/URL could not be verified at write time; we would rather drop than misattribute. The Sumer Sports reference cites the open-source repository directly because no associated peer-reviewed paper is known to exist.

10. How to reproduce

All intermediate parquet files are exported as CSV on the Downloads page, along with the trained checkpoints. Python 3.12+; use uv (not pip) for everything.

Bash recipe (event pipeline + transformer + AW-JOI/JDI)

# 1. Env setup
uv sync --extra dev

# 2. Tests (skips real-data tests if data dirs are empty)
uv run pytest -q

# 3. Wiring smoke test on synthetic data
uv run python -m wc2026_tracking_transformer.train fit \
  --config configs/local_cpu.yaml

# 4. Train the frame-VAEP specialists
uv run python scripts/train_score_only.py   --pff-n 44 --epochs 6
uv run python scripts/train_concede_only.py --pff-n 44 --epochs 6

# 5. Extract per-pair AW-JOI / AW-JDI from both specialists
uv run python research/scripts/extract_aw_joi.py --combine-after

# 6. Render a clip
uv run python scripts/render_pff_gif.py --match 10502

Next → grab the raw artifacts from Downloads, or jump back to Chemistry Leaderboards for the live tables.