What is chemistry, and can we measure it?

Messi finally broke through. Mbappé scored a hat-trick in the final and still lost. Morocco went in ranked eleventh on FIFA's talent index and still knocked out Spain and Portugal on the way to a semifinal. Croatia ground three knockout games to extra time and walked off with bronze. Four deep runs, four very different rosters, one shared trait: every one of them played like a unit that already knew the answer. Our hypothesis for this study: chemistry is real, measurable, and predicts who survives a tournament. Sometimes it matters more than raw talent does.

Chemistry as a metaphor

Pundits borrow the word from chemistry for a reason: it implies bonds between specific elements, not a property any one of them carries alone. Our job on this study is to turn that metaphor into a number.

What chemistry looks like on a pitch

You already know what it looks like. The through-ball that meets a runner's stride without breaking it. The wall pass into space the defender thought he'd covered. The press trap where the foot-on-the-ball bait pulls a striker forward by a single yard. The decoy that pins a centre-back so the lane behind him opens for someone else. The question this study asks is the harder one: how would you measure them?

The event-based baseline: JOI and JDI in plain English

Bransen & Van Haaren (MIT Sloan, 2020) propose two pair-level metrics on top of VAEP. VAEP turns every on-ball action (a pass, cross, dribble, take-on, or shot) into an expected-goal-difference delta. JOI and JDI pool that VAEP currency by pair instead of by player.

JOI Joint Offensive Impact

When two teammates act in consecutive on-ball events (P passes and Q shoots, or P carries and Q lays it off), both VAEP values land in the pair's JOI bucket. JOI asks: when these two play together, does the ball flow productively through them? Through-balls into a shooter, wall passes, carry-and-finish combinations all show up here. Normalised per 90 shared minutes (JOI90).

JDI Joint Defensive Impact

Events can't see who marked whom, so JDI back-computes. When the opponents a pair shares responsibility for under-perform their expected offensive impact, the pair gets credit, weighted by a 5×5 pitch-grid heuristic for proximity. Per 90 shared minutes (JDI90). It's a clever workaround for the fact that event data never sees prevention directly.

Where event-based chemistry goes blind

"The match event data only describes the actions that actually happened in the match but not the actions that players prevented from happening, for instance, by smart runs or clever positioning." Bransen & Van Haaren, JDI section

Event-based JDI is explicit about its own blind spot. JDI is forced to back-compute defensive credit from opponents under-performing, because event data can't see prevention. JOI has the exact same blind spot on offence, and event-based JOI doesn't say it out loud: JOI only fires on consecutive on-ball touches. The decoy run that opened the lane never touches the ball. The pin that froze the centre-back never touches the ball. The 40-metre off-ball width-hold that bent the defensive shape never touches the ball. None of it enters JOI. The single most important thing two teammates do together is coordinate their off-ball geometry, and that is exactly what stays invisible to the framework meant to measure their chemistry.

What we add: frame data + transformer attention

We train a transformer to predict P(score) and P(concede) over the next ten seconds, feeding it PFF World Cup 2022 tracking data for all 22 players plus the ball, sampled at 5 Hz (every 200 ms). The architecture is adapted from SumerSports's SportsTrackingTransformer (originally NFL). As a byproduct of that prediction, each layer's attention heads light up which pairs of players the model needed to look at together to make its call. On-ball moments and off-ball ones alike. A decoy run, a pin, a defender stepping up to hold the line: when one of them matters for the next ten seconds of threat, attention rises on that pair. Event data has to infer prevention after the fact. We watch it happen.

What the model sees that event-based chemistry can't

	Original paper (event-based)	Our model (frame + attention)
Features (what the model sees)	Discrete on-ball events (pass / cross / dribble / shot) plus a few context features (location, time, score state).	(x, y) + velocity for all 22 players + the ball, every 200 ms (5 Hz). No event labels, no role one-hots, no hand-crafted features.
Sees off-ball runs?	No. Only the player who touched the ball generates a row.	Yes, on every player in every frame.
Who matters in this moment?	Hand-specified (the consecutive-touch pair).	Learned (the transformer's attention weights, end-to-end from outcomes).

What P(score) is based on. Every 200 ms the model sees the geometry of all 22 players plus the ball and predicts: how often does a configuration this dangerous lead to a goal in the next ten seconds? That's it. Pure geometry plus motion plus learned outcomes: no event labels, no human-coded notion of "the play."

What attention is. A learned weight per player per frame. Nobody tells it who matters; it figures that out from the data. Sometimes it stares at the ball-carrier. More often it stares at the off-ball runners and defenders who are shaping what's possible, whether that is the late-arriving eight, the centre-back stepping up, or the wide forward pinning a full-back.

How it works out who matters on its own. Gradient descent. Picture training a kid to predict when a fight is about to break out in a crowded bar. At first they look at everyone equally. They guess. They're usually wrong. They watch what actually happens. Each time they're wrong, they adjust which faces they pay attention to next time. After a few hundred bars, clenched fists and raised voices light up; people on their phones fade out. Nobody handed them a rulebook. The outcomes did. Our transformer learns the same way across millions of soccer frames: predict P(score), look at the next ten seconds, adjust which players to attend to, repeat. After enough training, on a dangerous configuration it has learned to focus on the striker making the diagonal run and the centre-back stepping up, rather than on the fullback standing still on the far side. We never told it the words striker, fullback, or diagonal run. The right weighting emerged because it was the only way to get the probability right.

How attention becomes chemistry. AW-JOI stands for Attention-Weighted Joint Offensive Impact. The "AW" is the part event-based JOI can't compute. Across every frame, how often is the model attending to both of these two teammates at the same time, weighted by how much that frame moves the goal probability? If the model keeps co-attending to Messi and Mac Allister during dangerous configurations, even when only one of them is on the ball, those two are coupled in its eyes. Sum that across the tournament and you have chemistry by our measure.

Event-based JOI/JDI can't even put off-ball runners in its input. Ours can, and its attention layer tells us which off-ball players the model thinks matter right now. That's the difference.

The contrast, made concrete

Two buckets, four mechanisms. Left: things JOI can see (with caveats). Right: the things only the tracking attention can see, and the heart of this thesis.

Event-based JOI sees these · on-ball flow

The Third-Man Triangle

A → B, B's first-time pass releases C. Defenders track A and B's body; C moves into space behind the eye-line. Xavi → Iniesta → Messi · Pep's half-spaces.

Event data: ⚠ partial. JOI catches A-B and B-C, but it can't tell whether B knew C was arriving before the pass. Attention shows the anticipation.

Test this on the Whiteboard →

Event-based JOI sees these · on-ball flow

The Press Trap

Foot on the ball, dare the striker forward by a yard, slip past him. De Zerbi's Brighton, the "bait."

Event data: ⚠ partial. The pass after the bait is captured; the bait itself (the deliberate delay) is an off-ball intention. Attention picks up the coordinated wait.

What is a transformer?

It is the model family behind tools like ChatGPT, pointed here at bodies on a pitch instead of words in a sentence. The one idea that makes it a transformer is the attention we just described: every token, which for us is every player and the ball, looks at every other token, decides how much each one matters to it right now, and updates its own picture of the moment accordingly. A language model uses this to let a pronoun reach back and find the noun it belongs to. Ours uses it to let a striker weigh the centre-back stepping up against the runner arriving late. Stack a few of these attention layers and the model builds a richer and richer read of the whole configuration before it commits to a number.

One frame through the model: tokens in, attention between players, two probabilities out.

In our instance, every frame is a 23-token sequence: 22 players + the ball. Each token carries seven features (position, velocity, attacking side, GK flag, possession flag). Player ordering inside the block carries no meaning. The transformer is permutation-equivariant, which matters because we never have to hand-engineer "who is the left-back" or "who is the holding mid". The model figures out roles from the geometry. Stacked transformer encoder layers with pre-norm and GELU activations feed two sigmoid heads (P-score, P-concede), with the attention matrix exposed as a first-class output instead of a debug hook. The full architecture and training recipe live on Methodology.

Our model beats the event-based VAEP baseline

Both pipelines predict the same two probabilities, P(score) and P(concede), just from different inputs. The event-based VAEP classifier sees only the on-ball action stream; ours sees every frame of tracking. Numbers below are held-out: our transformer was trained on 36 PFF WC22 matches and the AUC is measured on the 8 matches it never saw during training. Same task framing:

What's AUC?

Take a random frame that did lead to a goal in the next ten seconds and a random frame that didn't. AUC is the fraction of the time the model assigns a higher P(score) to the goal-bound one. 0.5 = a coin flip (useless); 1.0 = perfect; 0.80 = right 80% of the time. Threshold-free, so you don't have to pick a cutoff to evaluate it.

P(score) AUC

P(concede) AUC

Metric	Event-based VAEP (out-of-fold)	Ours (frame transformer, held-out)
P(score) AUC	0.681	0.801
P(concede) AUC	0.671	0.792

Scope of comparison: the event-based classifier is action-level and predicts "scores within next 10 actions"; ours is frame-level and predicts "scores within next 10 seconds". Same labels, different input modality, and that is the point. The lift comes from seeing the 21 players the event stream is blind to. Single-head specialists push score AUC to 0.816 and concede AUC to 0.799 when trained separately on the same backbone.

Two pair-level metrics + a team aggregate

Every subsequent page is built from two pair-level scores derived from the attention matrix, plus a team-level aggregate rolled up from them:

AW-JOI (attention-weighted Joint Offensive Impact, per pair) is JOI's continuous-time analogue. Sums ball-distance-corrected attention on a same-team pair during attacking frames; per 90 shared minutes.
AW-JDI (attention-weighted Joint Defensive Impact, per pair) is the prevention-aware analogue. Same recipe during defensive frames; captures the pins, rest-defence anchors, and swarms that JDI's responsibility-grid heuristic can only approximate.
Team Chemistry Density (TCD) is the team aggregate. Count of strong off-off pairs (above pool-median AW-JOI/90) + strong def-def pairs (above pool-median AW-JDI/90) + net cross-team strong pairs. One number per squad.

How TCD adds up

Three counts, one squad-level number. Each piece captures a different slice of how the model wires the team together.

Off-off strong pairs

Count of same-team attacker–attacker pairs whose AW-JOI/90 is above the tournament pool median. The on-ball creation network.

Def-def strong pairs

Count of same-team defender–defender pairs whose AW-JDI/90 is above the pool median. Off-ball pinning, cover, and rest-defence anchors event data can't see.

Cross-team net

(# cross-role pairs where AW-JOI − AW-JDI > 0) − (# where AW-JOI − AW-JDI < 0). The midfield-to-attack and defence-to-midfield connections that tilt offensive when they work, defensive when they don't.

TCD = n_off-strong + n_def-strong + n_cross-net

A worked example, France (TCD rank 1): 43 off-off strong pairs + 15 def-def strong pairs + +81 cross-team net = 139. One number summarizing how "wired together" the squad is, capturing both on-ball and off-ball pair influence at once.

TCD's predictive payoff against the World Cup itself: Spearman ρ = +0.704 with WC22 final finish across 31 teams (p < 0.001). Against a club-overlap History Index the same metric runs ρ = +0.348, which is directional and suggestive (p ≈ 0.055) but not significant on a 31-row sample. For context: FIFA-23 Overall vs WC22 finish runs ρ = +0.548, so on this tournament chemistry edges raw talent at the team level. Semifinalist TCD ranks: France 1, Croatia 3, Morocco 4, Argentina 7. All four deepest sides in the top seven.

Assumptions and limitations

Small tournament sample. 64 matches, 31–32 teams. Spearman ρ on a sample of 31 is volatile; treat ρ = 0.70 as suggestive of a real signal, not as a deployment-ready effect size.
National-team novelty. The teams whose players rehearse together least are also the ones we most want to compare. The signal that survives despite that, in France, Croatia, and Morocco, is the interesting part.
Attention ≠ causation. High AW-JOI/AW-JDI says the model needed to look at both players to make its prediction. That's strong evidence of a coordinated tactical pattern; it isn't a counterfactual claim about goal difference. For counterfactuals, use the Digital Whiteboard.
Ball-token dominance is real. Naive per-pair attention is dominated by GKs and CBs near the ball during build-up. We subtract a ball-distance baseline before aggregating; see Methodology §2–3.
Frame-VAEP labels still come from outcomes. P(score) / P(concede) labels are derived from what actually happened in the next ten seconds. The attention map gets to see off-ball setup, but the training signal is still event-anchored.
One tournament, one provider. All numbers here are PFF WC22. Cross-provider calibration is an open task.