Working Ideas

We Borrowed From the Wrong Sport

Andrew Marritt — Wed, 24 Jun 2026 04:31:25 GMT

The deck that never left the room

In March 2013 I chaired HR Tech Europe’s “Spring Warm-Up - Big Data in HR” in London - a conference at the high-water mark of the early optimism, convinced the numbers were finally coming to HR. The standout session came from a Google people analyst, Caitlin Hogan, who told us that every people decision at Google was made on data, and walked us through how the company had used it to test whether managers even mattered - the study that became known as Project Oxygen, which ended by naming the eight behaviours that separated Google’s best managers from its worst. It was genuinely impressive work. It was also, of all the sessions that day, the one whose deck never quite came to rest in front of anyone outside Google: kept on its own machine, watched over, no copy left behind.

I have thought about that afternoon often, because the findings became famous - those eight behaviours have been retold a hundred times since - while the thing I actually wanted, the data and the methods underneath them, stayed in the room. That gap is the point. The polished story travels; the analytical machinery that produced it does not. It is as true in football as in HR, and worth holding onto before we admire the published material too much.

I have followed Liverpool for most of my life - some of my earliest memories are of watching 1980s European nights on the Kop (and not telling my mother the language I heard there). For the past fifteen years I have followed something else alongside the football: the people doing the analytics behind it. I read Ian Graham’s account of building the club’s research department. I notice the recruitment-model pieces in the Financial Times. And when Google DeepMind’s work on football turned up in my machine-learning reading feed, my first instinct - correct, as it turned out - was that they were generating and testing alternatives, not merely describing what had happened.

I have no interest in the version of this that fills the commentary: the expected-goals figure flashed at half-time, the pundit’s “the data says”, the heat map deployed to relitigate a result. That is the pop-analytics layer and it is not where the serious work is. The serious work is in university labs, at the business-academic conferences, and inside a few clubs that, like that Google team, do not publish. That is the work I want to hold against my own field. Because for most of its short life, People Analytics chose a different sport to learn from - and I think it chose wrong.

The Moneyball decade

When People Analytics found its feet around 2010, Moneyball was the founding story it told about itself. Michael Lewis’s book - and the Billy Beane legend behind it - handed the field a ready-made narrative: ignore the scouts’ gut feel, find the undervalued attributes the market has mispriced, win with evidence. Almost every keynote reached for it. Every vendor deck had the slide.

It was not a foolish borrowing. Recruitment really does suffer from mispriced signals; we really do over-pay for the wrong proxies - the prestigious degree, the brand-name former employer - much as baseball once over-paid for runs batted in. The Moneyball frame gave a young discipline licence to challenge instinct with numbers, which was the fight it needed to pick.

The timing was specific. Lewis’s book came in 2003, but it was the 2011 film that put the idea in the HR air, and the field embraced it for the best part of a decade after: Billy Beane became a fixture on the workforce and recruiting conference circuit from around 2012, and through the mid–2010s the professional bodies and the analytics vendors ran a steady stream of “Moneyball for HR” explainers (SHRM; Visier; Fast Company). For a long stretch, “Moneyball for talent” was the single most reached-for analogy in the discipline.

But the analogy always seemed slightly wrong to me, and it took the football work to show me why.

Why baseball was the wrong sport

Here is the thing about baseball that makes it the analyst’s favourite - and it is the very thing that makes it useless as a model for the workplace. Baseball is the one team sport that is not really a team-production problem. It decomposes, almost cleanly, into a sequence of independent contests: pitcher against batter, one discrete, well-defined event after another, repeated thousands of times under near-identical conditions. A batter’s contribution can be measured largely in isolation from his team-mates because, for that moment, it more or less is isolated.

That is why sabermetrics worked so spectacularly. As one survey of the field puts it, a baseball game is “an aggregation of hundreds of repeated, discrete events”, and that discreteness is the main reason the sport is so amenable to measurement. Large samples, stable conditions, separable individual value - the statistician’s dream.

Football is none of those things. It is continuous, not discrete. The thing you care about - a goal - is rare, and it is scored by one player at the end of a move that many built. A defender’s position three passes earlier, a run that drags a marker out of the way, a pass declined: these shape the outcome and none of them appears in a tally of goals. Value is smeared across players and across time. Football is irreducibly a problem of interdependence.

And that is exactly what knowledge work is.

Team sports sit on a spectrum. Baseball is at the separable, discrete end, where an individual’s value can be lifted out and counted; football and knowledge work are at the interdependent, continuous end, where it cannot. People Analytics went shopping at the wrong end.

When we borrowed from baseball, we borrowed from the one sport whose structure let us sidestep our hardest problem. Baseball let People Analytics behave as though a person’s contribution could be lifted cleanly out of its context and counted - the individual rating, the nine-box, the high-potential list. The Moneyball analogy was comforting precisely because it let us keep measuring individuals when the value was never in the individuals alone. Football offers no such escape. Neither does the work most of us actually do.

A problem named in 1972

None of this is new, and the deepest statement of it predates the sports analytics industry by decades. In 1972 the economists Armen Alchian and Harold Demsetz published Production, Information Costs, and Economic Organization, and put their finger on the difficulty precisely. “In team production," they wrote, "marginal products of cooperative team members are not so directly and separably ... observable.” What a team sells to the market is the marginal product of the team - not of its members. You cannot cheaply meter who contributed what.

Their proposed solution is the part worth dwelling on. Since output cannot be attributed to individuals, they argued, you install a monitor - someone who watches the team and apportions reward by input rather than output. That monitor is the manager. The modern firm, in their telling, exists partly because individual contribution is unmeasurable and somebody has to be paid to judge it instead.

Read that and look at what HR actually does. The performance rating, the calibration meeting, the nine-box grid, the individual objective: this is the 1972 workaround, automated and dashboarded. It is a response to the metering problem - a way of guessing at individual contribution when you cannot measure it - that we have mistaken for the natural order of things. The discipline did not solve the credit-assignment problem; it inherited a fifty-year-old way of working around it, and then built software on top. Game theory does offer a more principled answer - Lloyd Shapley’s values (1953), which divide credit among collaborators according to what each adds across every possible combination of the others, and which now underpin the SHAP method used to explain machine-learning models. It is a genuinely promising route for People Analytics, and one I will explore properly in a forthcoming issue. The same unsolved problem is the hidden engine under two further arguments I take up elsewhere - how we rate individual performance, and how we design bonuses - where we insist that work is collective and then reward it as though it were not.

The interesting question, then, is who has done better than the workaround. The answer is several fields at once - and football is the best-funded of them.

What the football academics actually work on

Strip away the commentary and the academic football programme turns out to be a sustained assault on exactly the questions People Analytics finds hardest. Three are worth exploring, and each has a partner outside sport.

The first is how to value a contribution that is not the goal. The flagship work comes from the sports-analytics lab at KU Leuven, where Tom Decroos, Lotte Bransen, Jan Van Haaren and Jesse Davis built VAEP - a framework whose founding paper is pointedly titled Actions Speak Louder than Goals. Instead of crediting the scorer, it values every action by how much it shifts the probability of scoring or conceding, given the state of play. A related idea, Karun Singh’s “expected threat”, treats the pitch as a set of states and credits a player for moving the ball into a more dangerous one - a Markov chain, in plain terms, laid over a football pitch. This is the credit-assignment problem of 1972, attacked head-on with probability rather than dodged with a monitor.

The second is how to value the work that never touches the ball. William Spearman’s “pitch control” models and the space-creation work Javier Fernández and Luke Bornn presented at the MIT Sloan conference use tracking data to quantify the decoy run that drags a defender away and opens a gap for someone else. The contribution is decisive and registers in no conventional statistic. Economists have the matching evidence from the workplace: in Peers at Work, Alexandre Mas and Enrico Moretti found that a supermarket cashier’s productivity rose with the productivity of nearby colleagues - but only colleagues who could actually see them. Value was spatial and relational, exactly as pitch control implies. The “glue” colleague who makes everyone around them better is the off-the-ball runner our performance systems are worst at seeing.

The third is the network itself. Passing-network research finds team-level structure that predicts performance and is invisible at the individual level; at ETH Zurich, Ulrik Brandes’s lab treats tracking data as a flow rather than a snapshot. Here the bridge to HR is not even a metaphor: Brandes wrote the standard algorithm for computing betweenness centrality - the very calculation that sits inside the organisational-network-analysis tools sold to HR departments, and the network position that Ronald Burt showed wins people better ideas, better evaluations and faster promotions. The same mathematics, pointed at a pitch instead of an email log.

The same England side, two ways. Left: a conventional passing network, thick where passes were frequent - it measures activity. Right: the same passes weighted by Expected Threat, by how much each raised the chance of scoring - it measures contribution. The network stops counting what players did and starts valuing what it was worth. Read “pass” as any unit of collaborative work and the lesson transfers intact. Source: Ma et al. (2026), CC BY 4.0.

Sitting on top is the generative turn my reading feed had flagged. DeepMind’s TacticAI, built with Liverpool, does not just describe a corner - it generates alternative setups a coach can compare with the real one, and Liverpool’s analysts preferred its suggestions to the genuine article nine times out of ten. The move is from what happened to what if. People Analytics is mostly still on what happened.

What I would actually steal

I do not read this work hoping to import it wholesale - the data is too rich and too invasive to copy, and I will come to that. I read it the way I read personnel economics or computational social science: hunting for an analytical move I can lift out, strip to its principle, and rebuild for the workforce. Three are worth the effort.

The first is Bayesian updating. The strongest thing in Graham’s book, to my mind, is not any single model but a habit: belief is something you revise as evidence arrives, not a number you publish once a year. Football modelling has done this since the Dixon-Coles model of 1997 and its later Bayesian forms. People Analytics barely does it at all. A flight-risk score, an engagement reading, a high-potential label - each is produced on a cycle and then frozen until the next one. The refinement is to treat these as living estimates that move with the evidence, and to state the uncertainty around them honestly.

The second is shrinkage, and it is the one technique here that solves a problem we actually have. In 1977 Bradley Efron and Carl Morris used early-season baseball batting averages to show that a player’s noisy individual figure is a worse predictor of their true ability than that same figure pulled partway toward the group average. Small samples lie; partial pooling corrects them. People Analytics is almost always a small-sample problem - a handful of people per manager, per team, per role - and we routinely report raw per-unit numbers that are mostly noise. Empirical-Bayes shrinkage should be the default, not an exotic. It needs none of football’s data riches; it just needs us to stop trusting small averages.

The third is simulation - testing an intervention in a model before committing to it in the organisation, and using generative “what-if” to explore options rather than only explaining the past. This is the family my own systems-dynamics work already belongs to, and the one the football frontier has pushed furthest.

What does not transfer - and should not

Honesty requires the other side of the ledger, because the temptation is to want the methods, and the methods will not come.

Football analytics runs on data of a richness the workplace will never - and should never - have. Every event is logged; players are tracked twenty-five times a second; the objective is singular and unambiguous (win the match) with a clean proxy (goals); feedback arrives every few days across thousands of comparable situations. Workforce data is the mirror image: sparse, infrequent, often self-reported, with outcomes that are contested, multi-dimensional, and play out over years. A good hire reveals itself over a decade and is never repeated under the same conditions.

There is also a line I would not cross even if I could. Tracking a footballer at twenty-five hertz is contractual and consented. The data football depends on is, in the workplace, the data we should refuse to gather. A field that mistakes football’s surveillance for something to aspire to has misread the lesson badly - and the most powerful tools here, the sociometric badges and the communication-metadata networks, are exactly the ones that sit on that line.

And a note of humility, courtesy of football’s own mistakes. The acute-chronic workload ratio swept the sport as an injury predictor, was adopted everywhere, and was then substantially dismantled - the evidence simply did not support the claim. That is the lifecycle of more than one People Analytics metric. Import the humility, not the ratio.

So the methods do not transfer. The questions do - and on the most important one, the evidence from the world of work has already returned a verdict, quite independently of any football lab.

Why buying the star rarely works

Here the football evidence is unusually blunt, and it lands on recruitment - the part of the game Liverpool built its reputation on. Stefan Szymanski’s long-running analysis, the backbone of Soccernomics, found that across forty English clubs over two decades, transfer spending explained only about 16% of the variation in league position - while the wage bill explained around 92%. What a club pays to acquire a player barely moves the needle; what moves it is the standing quality of the squad it assembles and keeps. Buying the expensive name is close to the least reliable thing a club can do, and the work on transfer fees agrees - fees track age, position, reputation and market demand at least as much as they track what a player will actually contribute. This is precisely why marquee signings so often disappoint: the fee prices a reputation, not a portable performance, and the performance turns out to depend on a team and a system the buyer cannot purchase along with the player. Ian Graham’s account of Liverpool’s recruitment is, at bottom, a long argument for valuing how a specific player would contribute to this side, rather than how good he looks in the abstract.

I had a glimpse of the same logic, from the other side, around the year 2000. I came at it with one foot in executive search and the other in banking - I had recently run European campus recruitment for J.P. Morgan’s markets business - when one of the other big Wall Street banks interviewed me for a role with an unusual brief: map the teams of their major competitors, not just the star names. Tellingly, the initiative was run by the business, not by HR. Their reasoning was coldly clear. If a rival desk became too strong, they would not poach its best individual; they would lift the whole unit, because the capability lived in the group, not in any one person you could hire away from it. I did not find it chilling - I found it exciting. I had got into this work believing that changing a business’s people was one of the fastest ways to change its results, and here was that belief priced, commercially, into a strategy - a decade before the research would catch up and prove it.

That research is Boris Groysberg’s Chasing Stars, which tracked more than a thousand star Wall Street equity analysts and asked whether their performance travelled when they changed firm. It did not: stars who moved suffered an immediate and lasting decline, because their excellence had been embedded in their old firm’s resources, colleagues and networks rather than carried in their own heads. The clean exceptions prove the mechanism - those who moved with their team kept performing, and women, whose networks tended to be more external and portable, travelled better than men.

Set the findings side by side and the convergence is hard to miss. VAEP says the goalscorer is not the whole story. Pitch control and Mas-Moretti say the decisive contribution may be invisible. Anita Woolley’s Science paper found a measurable “collective intelligence” in groups that is driven not by members’ average ability but by how evenly they take turns and how socially attuned they are. Groysberg says the star is not portable because the performance was never only theirs. Several fields, opposite data, the same finding: in interdependent work, value lives in the relationships, not the individual.

Which brings me back to that guarded Google deck. The best work in People Analytics, like the best work in football - Liverpool’s recruitment models, Brighton’s and Brentford’s, kept under some of the strictest non-disclosure agreements in sport - is precisely the work nobody lets you keep. In both fields the people with the best data have the least reason to share what they have learned. We are reasoning, always, from the visible half. It is worth stressing this.

Sharing smart ideas is social currency. Research finds we pass on what makes us look in-the-know - so if this was worth your time, send it to someone who'd appreciate it.

The Working Idea

Stop measuring the nodes. Start measuring the edges.

The borrowing from baseball was not wrong because baseball is unserious - it is the most serious analytics in sport. It was wrong because baseball is the one game where individual value is genuinely separable, and our work is not like that. Football, and the evidence from our own world, both say the same thing the economists said in 1972: contribution is produced by a chain, value sits between people as much as in them, and the most important work often leaves no individual trace.

In practice that means a few things. Treat a “high performer” as a claim about a person in a context, not a portable property - and ask what the surrounding network loses if they leave, or you move them. When you assess a contribution, look for the pass before the pass: the enabling, unblocking and sponsoring work the systems render invisible. Hold your estimates as beliefs that update with evidence, not labels fixed once a year, and shrink the small-sample numbers that are mostly noise. Be sceptical of any metric that scores people as though they played baseball. And resist the data envy: the answer to thin workforce data is better questions and better priors, not heavier surveillance.

The honest reading is not that football hands People Analytics a toolkit. It is that football has spent a decade and a fortune confirming, with data we will never have, the thing our field has spent that same decade arranging not to see. We did not need a better model. We needed to stop borrowing from the only sport that let us look away.

Sources

Alchian, A. & Demsetz, H. (1972) - Production, Information Costs, and Economic Organization. American Economic Review
Mas, A. & Moretti, E. (2009) - Peers at Work. American Economic Review
Woolley, A.W. et al. (2010) - Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science
Decroos, T., Bransen, L., Van Haaren, J. & Davis, J. (2019) - Actions Speak Louder than Goals: Valuing Player Actions in Soccer (VAEP). KU Leuven / KDD
Singh, K. (2018) - Introducing Expected Threat (xT). karun.in
Fernández, J. & Bornn, L. (2018) - Wide Open Spaces. MIT Sloan Sports Analytics Conference
Szymanski, S. - Wages, transfers and the variation of team performance in the English Premier League. The Soccernomics finding (wages ~92% vs transfers ~16%)
On Moneyball in HR: SHRM, Visier, Fast Company
Brandes, U. - Social Networks Lab, ETH Zurich. Football network analysis; betweenness-centrality algorithm
Ma, R., Bischofberger, J., da Silva Torres, R., Baca, A. & Exel, J. (2026) - A contribution-based valued passing network for quantitative evaluation of player performance and coordination in football. Quality & Quantity (CC BY 4.0)
Google DeepMind & Liverpool FC (2024) - TacticAI: an AI assistant for football tactics. Nature Communications
Dixon, M. & Coles, S. (1997) - Modelling Association Football Scores and Inefficiencies in the Football Betting Market. Journal of the Royal Statistical Society: Series C
Efron, B. & Morris, C. (1977) - Stein’s Paradox in Statistics. Scientific American
Groysberg, B. (2010) - Chasing Stars: The Myth of Talent and the Portability of Performance. Princeton University Press
Graham, I. (2024) - How to Win the Premier League. Century / Penguin
Big Ideas in Sports Analytics and Statistical Tools for their Investigation. arXiv (on the discrete structure of baseball)
Lundberg, S. & Lee, S. (2017) - A Unified Approach to Interpreting Model Predictions (SHAP). arXiv; the machine-learning descendant of Lloyd Shapley’s 1953 value
Couzins, M. (2013) - HR Tech Europe: Big data will enable HR to make better decisions. Personnel Today (the 2013 conference / Google Project Oxygen session)

Every hire is a bet made in the dark

Andrew Marritt — Wed, 17 Jun 2026 04:31:06 GMT

When I used to sit on hiring panels, I used to notice something that was often not articulated. The candidates across the table were, quite reasonably, presenting the best versions of themselves. Answering questions in a way that emphasised strengths, managed away weaknesses, and projected a degree of confidence that may or may not have reflected how they actually felt. This is not dishonest - it is rational. They did not know whether this was an organisation worth joining. They were constructing a signal.

But I was doing something similar. The culture I was describing was the culture we aspired to, not quite the culture we had. The career trajectory I outlined was the optimistic one. The challenges I mentioned were the manageable ones. I was also constructing a signal.

Neither of us was lying. We were both managing uncertainty - they were uncertain about us, we were uncertain about them - and we were both behaving exactly as people in that situation rationally should. The uncomfortable truth is that this mutual performance is not a failure of the process. It is the structural condition of every hire that has ever been made. And there is a body of economic theory, developed largely in the 1970s and 1980s, that explains with considerable precision why it happens, what its consequences are, and what firms can actually do about it.

That theory has been almost entirely ignored in mainstream HR. This is a significant mistake.

The problem has a name

In 1970, George Akerlof published “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism” in the Quarterly Journal of Economics. It was rejected by three journals before acceptance. Thirty-one years later, Akerlof received the Nobel Prize in Economics for it, jointly with Michael Spence and Joseph Stiglitz.

The paper’s immediate subject is used cars. Sellers know whether their car is a lemon (defective) or a peach (good). Buyers cannot tell them apart before purchase. Buyers therefore only offer a price that reflects average expected quality across the pool. But owners of genuinely good cars believe their cars are worth more than average - so they withdraw them from the market. As peaches leave, average quality falls further, the price buyers will pay falls further, and the cycle continues. The market degrades not through dishonesty but through the structure of information: one party knows something the other does not, and this asymmetry reshapes the equilibrium.

Akerlof observed, in passing, that labour markets have exactly this structure. Employers cannot directly observe worker quality before hiring. I will use “ability” and “quality” as single-dimensional shorthand throughout, because that is how the economics models it, but the reader should hold the simplification lightly: what an employer is really trying to infer is a bundle - cognitive ability, conscientiousness, and the fit between a particular person and a particular role and team, much of which only resolves through experience. Workers know their own abilities, motivations, and work habits considerably better than any interviewer can. Without mechanisms to bridge this gap, labour markets systematically underpay talent and overpay mediocrity - not through any individual malice or carelessness, but as the predictable consequence of the information asymmetry.

The entire apparatus of recruitment - credentials, references, structured interviews, assessment centres, probation periods - is, viewed through Akerlof’s lens, a set of mechanisms that exist to prevent the market from degrading to its lemons equilibrium. This reframes the evaluative question. Rather than asking “are our recruitment processes well-designed?” the right question is “does each element of this process actually bridge the information gap, or does it create the appearance of selection without the substance?”

These are different questions, and they produce different answers.

Signals and the arms race they create

In 1973, Michael Spence published “Job Market Signaling” in the same journal, and addressed a specific puzzle. Why do wages correlate with educational credentials even when the education is not directly relevant to the job being filled? The human capital answer - education increases productivity, therefore wages - cannot account for cases where the content of the education is plainly irrelevant to the role. Spence’s answer is more interesting, and more troubling.

Education, he argued, can function as a signal of ability even if it adds nothing to productivity - provided its acquisition is differentially costly. If high-ability workers can obtain a qualification more easily than low-ability workers (because they find it less arduous, not because they pay less for it), then in equilibrium only high-ability workers will bother acquiring it. Employers, observing this pattern over time, correctly infer that educated candidates are, on average, more able. The credential separates types without necessarily producing them.

The crucial mechanism is differential cost. The signal does not need to generate productivity. It needs to be expensive enough that only the types worth signalling will invest in it. This has an uncomfortable implication: a meaningful fraction of the credential economy may exist not because credentials build skills but because they function as costly hurdles that distinguish ability types. Workers over-invest in credentials relative to what pure skill-building would justify, and this over-investment is individually rational even though it is collectively wasteful.

Barış Kaymak at the Federal Reserve Bank of Cleveland quantified this in a January 2025 working paper. Using longitudinal data and variation in the speed of employer learning across occupations, he estimated that signalling accounts for roughly a quarter of education’s wage premium, with human capital explaining the rest. The aggregate efficiency cost of the information problem - the output lost both to over-investment in credentials and to the occupational misallocation that accompanies it - runs to around 7.6% of average lifetime earnings. That is a substantial number for a distortion that, in economic terms, resolves an information problem rather than building capability.

Spence’s framework also predicts what happens when signalling costs fall. When producing a credential becomes easier for everyone, the differential cost collapses, the credential loses its separating power, and employers respond by requiring higher or rarer credentials - driving further credential acquisition. This is the arms race that the model predicts as the natural equilibrium response to declining signal costs.

Grade inflation is the same mechanism running in slow motion. A degree classification is a signal, and as more students cluster at the top of the scale - firsts and high 2:1s that were once scarce - the grade carries less and less information about which graduate is which. The signal pools. Employers respond exactly as the model predicts: they discount the grade and reach for something that still separates, whether that is the selectivity of the institution, a standardised test, or a personal referral. Why grades drift upward in the first place is a separate question, and a well-studied one - it turns on the incentives of the people awarding the grades rather than the students receiving them - so I will leave it to one side. What matters for this argument is the consequence, and the consequence is general: once a signal stops separating, the market does not absorb the loss quietly. It migrates to a costlier signal, and that migration is the arms race.

The same dynamic is now playing out, far faster, in job applications themselves. When a capable language model can produce a polished, tailored CV and cover letter in seconds for any candidate regardless of their ability, the application document ceases to function as a signal. The cost differential that gave it separating power has been eliminated. Employers who screen application documents without accounting for this are evaluating a signal whose informational content has been substantially degraded.

Not all signals erode equally, however. When one collapses to a pooling equilibrium, candidates and employers shift toward the signals that still separate - and the ones that survive are those whose cost comes not from production but from the signaller’s reputation. A credible referral works precisely here: an established professional who recommends a candidate is putting their own standing on the line, and that is a cost no language model can absorb on the candidate’s behalf. The signals that endure are the ones someone has staked something to send.

The German doctorate as a worked example

An argument like this is easier to see in a particular country than in the abstract, and Germany is the obvious case. It produces one of the highest rates of doctorates per head in the OECD, and the “Dr.” title carries a social weight that has no real equivalent in Britain or the United States - historically recorded on identity documents, used in everyday address, and formally rewarded by corporate grading systems that place a doctorate a rung or two above a master’s on entry. The conventional explanation for all this is cultural. The signalling model offers a complementary one, and I find the two together more convincing than either alone.

The structural point turns on a feature of the German system that has only recently changed. Until the Bologna reforms of the 2000s, the standard German degree was the one-cycle Diplom or Magister: everyone who finished left with the same terminal qualification, and there was no intermediate bachelor’s/master’s split at which the abler students could separate themselves from the rest. Entry to university, then as now, was gated mainly by the Abitur rather than by selective competition between institutions (if you had the Abitur you had the right to enter a university of your choice). Put those two facts together and Spence’s logic does the rest. In a system that separates ability neither at entry nor at the master’s stage, the doctorate becomes the load-bearing signal by default - the next costly hurdle available once the cheaper ones have stopped discriminating. Anke Mertens and Heinke Röbken, analysing the returns to German doctorates, make exactly this argument: in a one-cycle system the doctorate is the graduate’s only remaining way to distinguish themselves from less productive peers, whereas in a two-cycle system the master’s has already done part of that work. The model predicts that such a system will over-produce doctorates relative to one where the lower credentials separate cleanly, which is roughly what we observe.

I would not push this further than the evidence allows, because the German case is a good illustration of the mechanism rather than a clean test of it. The doctoral premium in the data behaves less tidily than a pure signalling story would imply - Mertens and Röbken find it concentrated in some disciplines and largely absent in others, and in fields such as engineering and chemistry the doctorate plausibly builds genuine research capability rather than merely sorting it. The status story is real and partly independent of the labour market. But the single most striking finding in their data is a gender asymmetry that the signalling frame explains rather well: the doctorate raises women’s wages more than men’s, because - on the authors’ reading - it signals a career orientation that employers already assume of men by default. A credential earns its return precisely where there is an inference to be corrected, and there is more to correct about the woman than the man. That is the signalling model visible in a single coefficient, sitting inside a picture that human capital and social status also help to paint.

The hidden bias in your external hire pool

The most practically important finding in this literature - and the one I encounter least often in HR discussions - comes from Bruce Greenwald’s 1986 paper “Adverse Selection in the Labour Market,” published in the Review of Economic Studies. I drew on Greenwald’s model in Issue 5 from the worker’s side, to explain why someone who puts themselves on the market is treated as a suspect signal; here the same model matters from the employer’s side, and the consequence is more pronounced.

Greenwald’s model involves three parties: workers, their current employers, and outside firms. The key observation is that current employers learn about worker ability considerably faster than the external market can. Within a year or two of hire, a firm typically knows far more about a person’s actual performance than any outside recruiter can determine from a CV and a few hours of interview. This information advantage has a structural consequence that most hiring processes ignore entirely.

Employers who identify high-ability workers respond rationally: they retain them. They offer raises, more interesting work, clearer development paths. Workers who receive none of these signals update their beliefs about their own standing and become more likely to look elsewhere (Something I modelled in the simulation in issue 3). Outside firms observe the resulting pool of job-changers - but cannot observe why each person is changing jobs.

The external job market is therefore structurally skewed toward workers whose current employer was willing to let them go. Not because all job-changers are weak performers - many are leaving for reasons entirely unrelated to their performance - but because strong performers are, on average, systematically less available. Their employers are working harder to keep them. The stream that reaches the external market is not a random sample of the workforce.

Robert Gibbons and Lawrence Katz confirmed this empirically in their 1991 paper “Layoffs and Lemons” in the Journal of Labor Economics. They compared the post-displacement wages and unemployment durations of workers laid off individually against workers displaced in plant closings. The logic is clean: an individual layoff carries information about the firm’s assessment of that specific worker; a plant closing does not, because the firm is letting everyone go regardless of individual performance. Gibbons and Katz found exactly what Greenwald’s model predicts. Individually laid-off workers had substantially lower post-displacement wages and longer unemployment spells than plant-closing displaced workers with equivalent pre-displacement salaries. The market is drawing an inference from the cause of displacement - and that inference is broadly correct.

This finding has a practical consequence. Every external hire drawn from the active job-changer pool carries a Greenwald penalty: a structural prior, correct on average, that the person’s current employer was less reluctant to lose them than they were to lose their high performers. This is not a reason to stop hiring externally. It is a reason to treat rigorous assessment as non-optional, because rigorous assessment is the only available mechanism for overcoming the adverse selection bias that the structure of the market produces.

One important caveat that Gibbons and Katz make : candidates who were displaced through organisational restructuring - mass redundancies, plant closings, sector-wide contractions - are not carrying this adverse selection penalty. The market’s inference does not apply when the departure was structural rather than performance-related. Recruiting processes that penalise candidates for having been made redundant are applying a signal that does not apply to their situation.

What employers signal, and to whom

The economics literature focuses mostly on candidates as signallers and employers as receivers. But the information asymmetry runs in both directions, and employers signal too - often without being aware of what those signals communicate.

Michael Waldman identified a distortion in this direction in his 1984 paper “Job Assignments, Signalling, and Efficiency,” published in the RAND Journal of Economics. When a firm promotes a worker, outside firms observe the promotion and update their beliefs about that worker’s ability upward. Promotion is, to the external market, a credible endorsement: the employing firm has assessed this person as high-ability and rewarded them accordingly. This creates an externality that most talent management practice ignores entirely.

Firms have a systematic incentive to under-promote high-ability workers relative to what pure productivity-maximisation would dictate. Promotion makes valuable people more visible to competitors, raises their external market value, and increases the probability of losing them. The firm captures the productivity benefit of having a high performer but bears the risk of financing their market premium. The rational response is to delay promotion - which is why promotion timelines in many organisations seem slower than performance warrants, and why strong performers sometimes leave for what appears to be a marginal salary improvement at a competitor: they are, in effect, receiving at a new employer the market rate that their current employer was suppressing.

Pablo Kurlat and Florian Scheuer extended this logic in their 2021 paper “Signalling to Experts,” published in the Review of Economic Studies. They showed that firms differ in their ability to assess candidate quality directly - some can effectively evaluate ability through rigorous assessment, others cannot and must rely on credential proxies. In equilibrium, high-ability candidates who can access expert firms forgo costly credential signalling: there is no need to invest in credentials when the employer can assess you directly. The practical implication is significant. A firm that develops genuine assessment capability is not just improving its process - it is positioning itself to hire from a pool of high-ability candidates who are invisible to credential-reliant competitors. Assessment expertise is a competitive advantage in talent markets, not a process improvement.

On the candidate-side of the information asymmetry, the employer has a choice that is usually framed as a values question but is equally an economic one. A firm can manage its presentation tightly throughout recruitment - communicate an aspirational version of the culture, smooth the edges, leave candidates with impressions that the work experience will be better than it is. Or it can invest in honest candidate communication, accepting lower application volumes in exchange for better self-selection quality. The economics are straightforwardly in favour of the latter. Over-promising in recruitment creates predictable post-hire disengagement: Denise Rousseau’s research on psychological contracts - the implicit promises that form between employer and candidate during recruitment - demonstrates that when experienced reality violates those expectations, the consequences for commitment, performance, and attrition are measured and persistent. None of this is unfamiliar territory to organisational psychologists: John Wanous spent a career on the realistic job preview - the practice of giving candidates a deliberately balanced, accurate account of the job before they join - and the meta-analytic record shows such previews reduce early voluntary attrition, apparently less by lowering expectations than by signalling the employer’s honesty, which raises commitment before the first day. The economics I am describing and the I/O practice describe a similar phenomenon. The cost of replacing a new hire who leaves within eighteen months because reality differed from the recruitment story is substantially higher than the cost of a smaller, better-matched initial pool.

The problem doesn’t end at the offer

Everything so far has turned on a single question: who is this person? That is the question selection exists to answer, and it is genuinely hard. But it is worth noticing that the information problem does not close when the contract is signed. It changes shape. Before the hire, the thing the employer cannot see is the candidate’s type - their underlying ability. After the hire, ability slowly becomes visible, and a different unobservable takes its place: effort. The employer still cannot see, from the outside, how hard someone is actually working, or whether the output in front of them reflects diligence, luck, or the quiet help of a good team. Economists call the first problem adverse selection and the second moral hazard, and a great deal of what we call management is a response to the second, just as recruitment is a response to the first.

Because effort is hidden, firms reach for mechanisms that make people want to supply it without being watched. Incentive pay is the obvious one: tie reward to output and the worker has a reason to exert effort the employer could never compel directly. A subtler mechanism was identified by Carl Shapiro and Joseph Stiglitz in 1984. If a firm pays somewhat above the going market rate, then losing the job becomes genuinely costly, and the fear of losing it disciplines effort more cheaply than any monitoring system could. The striking implication of their model is that some unemployment is not a malfunction of the labour market but a feature of it - wages sit above the level that would clear the market precisely so that the threat of dismissal has teeth.

This needs handling with care, because it sits right next to a claim I made in the previous issue that looks like its mirror image. There I argued that firms tend to underpay their high performers, capturing the surplus the outside market cannot yet see. Both things are true, and they stop being in tension the moment you notice they are measured against different yardsticks. The efficiency-wage premium is paid above the wage that would clear the market - the floor. The underpayment in Issue 5 is relative to the worker’s true marginal product - the ceiling. A firm can comfortably do both at once: lift everyone above the floor so the threat of dismissal stays real, while still paying its best people well below what they are actually worth, because the information that would justify a higher wage is locked inside the firm. The same asymmetry produces both effects.

The post-hire period is also when the asymmetry that recruitment could not resolve quietly begins to resolve itself. As a worker accumulates tenure, the weight the market places on their credentials falls and the weight it places on demonstrated productivity rises - the lemons problem dissolving in slow motion as evidence accumulates. But the resolution is uneven: the employing firm sees the evidence first, and the outside market sees it late, or not at all. That gap is the engine of the surplus-capture story I traced in Issue 5, and it is the same gap that makes the external hiring pool adversely selected. It is worth seeing that the adverse selection at the gate, the surplus a firm captures from its stars, and the bias in the external market are not three separate phenomena. They are one informational fact - the employer learns faster than the market - viewed from three different angles.

And signalling does not stop at the door. Bengt Holmström’s work on career concerns makes the point that workers keep signalling throughout their careers, because the market is continuously updating its estimate of their ability, and that estimate sets their future wages. This is part of why early-career professionals often work harder than their current pay alone would rationally justify: the return is not this year’s salary but the belief they are building in the minds of the people who will set next year’s. The candidate constructing a careful signal across the interview table never stops constructing it. They simply change audience.

The useful reframe is this. It is tempting to treat hiring as the decision and everything after it as execution - to imagine that once the offer is accepted, the uncertainty has been spent. The economics says the opposite. The information problem is not a gate you pass through once. It is the climate the whole relationship is conducted in. Recruitment is simply the moment it is most acute, most expensive, and least resolvable.

What this framework changes

Information economics does not replace the practitioner’s toolkit. It reframes what the toolkit is for, and in doing so changes which questions are worth asking.

Most recruitment is evaluated on process metrics: time to hire, cost per hire, offer acceptance rates, hiring manager satisfaction scores. These measure how smoothly the machine runs, not whether the machine is solving the right problem. The information economics framing asks a harder question: does each element of our recruitment process actually resolve the information gap, or does it produce the appearance of selection without the substance?

Take that question seriously and a few things follow. Credentials are a signal whose differential cost is eroding fastest at the application stage - and knowledge work leans on that signal most heavily, precisely because ability there is the hardest thing to observe directly. Using them as a primary screen without a clear theory of what they still separate, in your specific context, is optimism rather than selection - and on Kaymak’s evidence, roughly a quarter of the credential premium is pure signalling cost with no productive return, a sizeable deadweight loss to build a process around. The external pool carries an adverse-selection bias that only rigorous assessment can correct, which is why the firms that have genuinely invested in structured, validated assessment are not merely running a tidier process: they are stepping out of an equilibrium that quietly penalises everyone relying on credentials and first impressions. And the candidate-side problem has real costs of its own - the employer that communicates accurately rather than attractively is not indulging an HR virtue but investing in pool quality that compounds as its reputation accumulates.

None of this settles what assessment methods are worth investing in. The validity evidence for selection tools is real, but considerably more uncertain than its conventional presentation suggests. The headline figures from Frank Schmidt and John Hunter’s meta-analyses are frequently cited without the credible intervals that surround them, and they have structural limitations for knowledge-work roles that the field has not fully reckoned with. Nor does the economics frame have any vocabulary for a tension that selection practitioners cannot avoid: the highest-validity tools also tend to produce the largest subgroup differences, so a decision to lean on them carries adverse-impact consequences - ethical and legal - that have to be weighed alongside the validity rather than read off it. That is where this series goes next.

Next in the series: The validity evidence - what the research on selection tools actually shows, and where it runs out.

Sources

Akerlof, G.A. (1970) - The Market for “Lemons”: Quality Uncertainty and the Market Mechanism. Quarterly Journal of Economics
Spence, M. (1973) - Job Market Signaling. Quarterly Journal of Economics
Kaymak, B. (2025) - Quantifying the Signaling Role of Education. Federal Reserve Bank of Cleveland Working Paper No. 25–02. (SSRN)
Greenwald, B.C. (1986) - Adverse Selection in the Labour Market. Review of Economic Studies
Gibbons, R. & Katz, L.F. (1991) - Layoffs and Lemons. Journal of Labor Economics
Waldman, M. (1984) - Job Assignments, Signalling, and Efficiency. RAND Journal of Economics
Kurlat, P. & Scheuer, F. (2021) - Signalling to Experts. Review of Economic Studies
Mertens, A. & Röbken, H. (2013) - Does a Doctoral Degree Pay Off? An Empirical Analysis of Rates of Return of German Doctorate Holders. Higher Education
Rousseau, D.M. (1989) - Psychological and Implied Contracts in Organizations. Employee Responsibilities and Rights Journal
Premack, S.L. & Wanous, J.P. (1985) - A Meta-Analysis of Realistic Job Preview Experiments. Journal of Applied Psychology
Shapiro, C. & Stiglitz, J.E. (1984) - Equilibrium Unemployment as a Worker Discipline Device. American Economic Review
Holmström, B. (1999) - Managerial Incentive Problems: A Dynamic Perspective. Review of Economic Studies
Schmidt, F.L. & Hunter, J.E. (1998) - The Validity and Utility of Selection Methods in Personnel Psychology. Psychological Bulletin

The output gap between your best and worst performers is probably larger than a year’s salary - and most organisations act as if it doesn’t exist.

Andrew Marritt — Wed, 03 Jun 2026 04:30:38 GMT

Almost every standard practice in people management - pay bands, headcount planning, treating attrition as a simple replacement cost - quietly assumes that individual output clusters tightly around the mean. One number, now thirty-five years old, says that assumption is wrong by a wide margin.

In a high-complexity knowledge job - strategy, analytics, research, software engineering, senior consulting - the standard deviation of individual output is approximately 48% of mean output. That figure comes from John E. Hunter, Frank L. Schmidt, and Michael K. Judiesch, who published it in the Journal of Applied Psychology in 1990, and it has held up well in the decades since. Throughout this article I treat the monetary value of mean output as roughly equal to mean salary - a conservative simplification, since a viable firm must on average produce more value per employee than it pays in wages. So a team averaging £80,000 has an output standard deviation of about £38,400 per year, and if anything the true figure is larger.

Pull out a normal distribution - an assumption I complicate two sections below - and think about what that means. The worker at the 84th percentile - one standard deviation above the mean - produces approximately £38,400 more per year than the median worker. The worker at the 16th percentile produces £38,400 less than the median. Measured from the centre of the distribution, each of those gaps is roughly half an annual salary. The gap between the 84th and 16th percentile workers themselves - both drawing broadly similar salaries - is therefore close to a full annual salary’s worth of output. The gap between the 95th and 5th percentile is larger still: roughly one and a half times an annual salary. If the true distribution has a heavier right tail than the normal - as the next-but-one section argues - these figures understate the gap at the top, not overstate it.

That is not a rounding error. That is the single most important empirical fact in people management, and most organisations treat it as if it does not exist.

The SDy figure - standard deviation of productivity, expressed in dollars or pounds - has been around since the 1940s, when Herbert Brogden first formalised it as the key variable in selection utility analysis. Hunter and Schmidt spent decades refining the estimates. The 1990 paper is the definitive version: they corrected for measurement error and range restriction in the underlying datasets, producing figures that are conservative rather than inflated. The progression is 19% for low-complexity jobs, 32% for medium-complexity, and 48% for high-complexity non-sales roles. For sales positions, the figures are considerably larger still.

The Brogden-Cronbach-Gleser utility formula - developed across a series of papers through the 1940s and 1970s - converts these SDy figures into a decision-relevant number: the expected dollar gain from any improvement in selection validity. The formula is

ΔU = N × T × SDy × Δr × Zs − Costs,

where:

ΔU is the expected gain in the monetary value of output;
N is the number of workers selected;
T is their average tenure in years;
SDy is the standard deviation of productivity in monetary terms;
Δr is the improvement in selection validity (the increase in correlation between the tool’s scores and actual job performance)
Zs is the mean standardised score of selected applicants, which rises as the selection ratio falls (the more candidates you screen per hire, the better your average pick) and
Costs is the total expenditure on the selection programme.

Every term in that equation is estimable. The point is that SDy is the multiplier that makes the rest of the equation matter: a small improvement in how well you select for a high-complexity role produces a large absolute gain precisely because the underlying distribution is so wide. I will return to this in the recruitment series that follows. For now, the key point is simpler: the distribution is wide enough that who you hire matters enormously, and wide enough that the gap between your best and average employee is worth actively managing.

What shape does the distribution take?

In Issue 3, I looked at this question in some depth and ran a simulation showing the effect of measurement noise on the observed distribution. I will not repeat that analysis here, but one headline finding is worth flagging for context.

The default assumption in HR practice - that performance is normally distributed, symmetrical around the mean - has been empirically challenged. Herman Aguinis and Ernest O’Boyle examined 198 samples covering 633,263 individuals across research, entertainment, politics, and professional sport, and published their findings in Personnel Psychology in 2012. They concluded that individual performance follows a Paretian - power law - distribution rather than a normal one: a small number of people account for a disproportionate share of total output, and the upper tail is much heavier than the bell curve predicts.

Jason W. Beck, Allen S. Beatty, and Paul R. Sackett published a methodological critique in 2014, arguing that some of the apparent skew in the Aguinis and O’Boyle data reflected measurement truncation rather than the true underlying distribution. The debate is unresolved. The pragmatic position - which I hold and demonstrated in issue 3 - is that even if the true distribution is closer to normal than Paretian, the Hunter et al. standard deviation estimates make the economic case without requiring the power-law claim. The power-law finding strengthens it substantially; the Beck et al. critique qualifies the confidence with which we should hold it, but does not overturn it.

Whether the distribution is normal or Paretian, it is wide. That is the point.

Three things drive the gap

The personnel economics literature identifies three distinct mechanisms that produce the observed dispersion. They are separable in theory and entangled in practice.

The first is ability. General cognitive ability - what psychologists call g, or general mental ability - is the strongest single predictor of performance in complex work. Frank L. Schmidt and John E. Hunter’s 1998 meta-analysis in Psychological Bulletin, drawing on 85 years of selection research, estimated the operational validity of GMA at r = 0.51 for complex jobs. Paul R. Sackett, Charlene Zhang, Christopher M. Berry, and Filip Lievens revised these estimates downward somewhat in their 2022 Journal of Applied Psychology reanalysis, after correcting for indirect range restriction, but the rank order held. GMA remains the best single predictor. The relevant implication for the dispersion question: individual differences in GMA are large, stable across careers, and directly implicated in performance differences in cognitively demanding work.

The Big Five personality trait of conscientiousness adds further explanatory power - particularly in combination with GMA, where the two traits interact. A high-ability, high-conscientiousness worker does not merely start with advantages; they compound them through sustained investment in their own skills.

The second is effort, shaped by incentive design. Edward Lazear’s research programme in personnel economics established that compensation structure directly affects the distribution of output - not merely the mean but the variance. His 2000 American Economic Review study of Safelite Glass, which shifted its windshield installers from hourly wages to piece rates, found a 44% increase in average productivity. Crucially, he decomposed this gain into two components of roughly equal size: an incentive effect (existing workers worked harder under output-linked pay) and a sorting effect (high-ability workers who had previously been earning below their marginal product stayed or were attracted; low-ability workers who could not hit the target income left). The distribution did not just shift upward - it also widened and then re-stratified as the workforce composition changed. The implication is that a substantial portion of the within-organisation productivity variance in any firm reflects its compensation structure, not just its hiring decisions.

The third is match quality. Boyan Jovanovic’s 1979 Journal of Political Economypaper, “Job Matching and the Theory of Turnover,” established the insight that a worker’s productivity in a given job is not fully known at hire - it is revealed over time as both worker and firm learn about the quality of the match. Jovanovic modelled the resulting equilibrium: workers leave when the revealed match quality falls below their outside option (the best alternative available to them elsewhere in the labour market), and the labour market gradually sorts people toward their best-fit roles.

Some of the productivity dispersion we observe in a cross-section within a job category is not about differences in raw ability or effort - it is about imperfect sorting. Some workers are better matched to this type of work in this type of organisation at this moment than others, and the market has not yet fully reallocated them. Match quality is a mechanism that selection and people analytics can address directly, through better job design, clearer role specifications, and more rigorous assessment of fit - not just ability.

The popular intuition is that these gaps are mostly a matter of effort and practice - the story K. Anders Ericsson, Ralf Krampe, and Clemens Tesch-Römer’s 1993 Psychological Review paper set off, and that Malcolm Gladwell later popularised as the ten-thousand-hour rule. The subsequent meta-analytic record does not support that intuition for the settings people analytics operates in. Brooke N. Macnamara, David Z. Hambrick, and Frederick L. Oswald’s 2014 Psychological Science meta-analysis found that deliberate practice explained only about 1% of performance variance in professional and occupational domains (the figure is higher - around a quarter - in tightly structured fields such as music, as Macnamara and Maitra’s 2019 reanalysis confirmed). Whatever sustains the gap in professional work, it is not primarily accumulated practice hours.

These three mechanisms are not independent. High-ability workers who are well-matched to their role and working under incentive structures that reward output tend to pull dramatically away from the rest of the distribution over time. Which brings me to the persistence question.

Subscribe now

Why does the gap persist?

The empirical puzzle is not simply that the gap exists. It is that it is so stable. If labour markets were efficient and employers were learning quickly, you might expect productivity differences to be transient - high performers would be promoted, poached, or given new roles, while low performers would leave or improve. What we actually observe is that productivity distributions within organisations are remarkably persistent across time. Why?

Robert K. Merton gave the foundational sociological answer in a 1968 Sciencepaper, coining the phrase “the Matthew Effect” - drawn from the Biblical verse that to him who has, more shall be given. Merton was writing about the sociology of science, where eminent researchers receive disproportionate credit even when producing work equivalent to lesser-known colleagues. But the mechanism is general: small initial advantages compound. The worker placed in the more stimulating team in year one builds more skills. The worker given the more visible project builds more network connections. The worker with the slightly better manager receives better feedback and develops faster. Cross-sectional productivity differences that look like fixed ability differences are partly the accumulated consequence of these compounding advantages and disadvantages over careers.

Herman Aguinis and colleagues adapted this mechanism for the workplace directly in their 2016 Personnel Psychology paper, “Cumulative Advantage: Conductors and Insulators of Heavy-Tailed Productivity Distributions and Productivity Stars.” They identified the organisational features that conduct cumulative advantage - high autonomy, complex tasks, disproportionate access to resources and networks - and the features that insulate against it, including equal resource allocation and tight performance monitoring. The practical implication is that the shape of the productivity distribution is partly an organisational design choice, not just a talent endowment.

James Heckman adds a longer time horizon still. His research on human capital formation - summarised in a 2006 Science paper on investing in disadvantaged children and developed across several decades of work - shows that skill formation exhibits complementarity: skill acquired at one stage raises the productivity of investment at later stages. Early advantages are amplified by the productivity of subsequent learning; early deficits compound in the opposite direction. By the time workers enter the labour market, much of the cross-sectional variance in their human capital reflects investment differences that are twenty years old. The employer learning period that follows is not correcting a random allocation - it is uncovering the outcome of a highly unequal prior process.

The thread running through all three is that the gap is not a transient artefact of imperfect measurement that better data would wash out. It is structurally produced and structurally maintained: ability differences that are large to begin with, compounded through cumulative advantage, entrenched by organisational design choices, and rooted in human-capital investment that long predates the hire. That is why it persists - and why it is something to be managed rather than waited out.

The knowledge work problem

The Hunter et al. estimates are derived largely from jobs where output is relatively countable: units assembled, windshields installed, sales achieved. Knowledge work is different. The output of a strategy analyst, a people scientist, a software architect, or a senior HR business partner is multi-dimensional, often collaborative, frequently delayed between production and effect, and largely intangible. There is no validated universal measure of knowledge worker productivity - Peter Drucker called this “the great management challenge of the 21st century” in 1999, and the characterisation remains accurate.

This does not make the SDy estimates irrelevant to knowledge work. It makes the uncertainty around them explicit. The 48% figure for high-complexity jobs is probably the right order of magnitude - Chad Syverson’s 2011 Journal of Economic Literature review of firm-level productivity dispersion found that companies at the 90th percentile of within-industry productivity produce two to four times more output per unit of input than those at the 10th percentile, and the same general scale of dispersion appears across individual and firm-level analyses. The consistency is more convincing than any single dataset.

The honest caveat is that the subsequent utility calculations - the Brogden-Cronbach-Gleser estimates of what better selection is worth - depend on being able to estimate SDy for the specific role in question. For knowledge work, this is harder than the neat benchmarks suggest. The 48% figure is a starting point for reasoning, not a number to plug directly into a spreadsheet. I will return to this challenge in more depth in Issue 12 when I work through the utility analysis for the full recruitment series.

Who captures the value?

Here is the economic argument that the SDy figures imply. The mechanism itself - that high-ability workers are underpaid early in their tenure while the firm captures the surplus - is the central prediction of the employer-learning literature. What I have not seen stated quite this way is the synthesis: tying the size of that surplus to SDy, and showing how the headhunter channel that would otherwise correct it is selectively unavailable to exactly the workers it most concerns.

At the point of hire, employers cannot observe a candidate’s true productivity. They set wages based on what they can observe: credentials, prior job titles, interview performance, references. These are imperfect proxies for true ability. The result is that wages at entry are compressed relative to the true productivity distribution - set more or less as if ability were normally distributed around the market mean, because that is the rational thing for an uncertain employer to do.

High-ability workers are therefore paid below their marginal revenue product at hire. Low-ability workers are paid above theirs. The surplus from the high performers accrues to the firm during the period before true productivity is revealed. This is not a small effect.

The evidence for this mechanism comes directly from Joseph Altonji and Charles Pierret’s 2001 Quarterly Journal of Economics paper, “Employer Learning and Statistical Discrimination.” Using longitudinal data from the NLSY linked to AFQT cognitive test scores - a measure of true ability that employers generally do not have access to at hire - they showed that the wage return to true ability rises with labour market experience, while the wage return to observable credentials (particularly education) falls. Employers initially use credentials as ability proxies and gradually update toward observed performance. This is symmetric learning: the worker’s current employer and the wider market learn at roughly the same rate. The harder and more consequential case is asymmetric learning, formalised by Bruce Greenwald (1986) and tested by Uta Schönberg (2007) and Lisa Kahn (2013): the incumbent employer learns about a worker faster than the outside market can, and that private information is precisely what makes outside bids cautious. The evidence here is mixed - Schönberg finds learning is mostly symmetric except among college graduates - but that exception is exactly the knowledge-work population this article concerns, which is where the asymmetry, and therefore the trapped surplus, would be expected to concentrate. Hani Mansour’s 2012 extension in the Journal of Labor Economics found that the learning rate varies significantly by occupation: it is slower in occupations where productivity is harder to observe, which maps precisely onto the knowledge work distinction.

During the employer learning period, the gap between an individual’s marginal revenue product and their wage is the information surplus that accrues to the firm. The SDy figure gives a rough upper bound on the size of that surplus for a given role. For a high-complexity knowledge job at £80,000 average salary, the maximum annual surplus from a worker at the 84th percentile relative to the median is roughly £38,400 - before any correction for the gradual learning process that erodes it.

In observable-output roles - sales, production, certain professional services - compensation design typically addresses this problem through performance-contingent pay. Lazear’s Safelite analysis identified both mechanisms at work: the bonus allows high performers to capture their surplus immediately rather than waiting for the employer learning process, and it simultaneously acts as a sorting mechanism, encouraging low performers who cannot earn the target income to self-select out. The sorting effect was as large as the incentive effect. Lazear and Sherwin Rosen formalised this logic more generally in their 1981 Journal of Political Economy paper on rank-order tournaments: compensation structures that tie pay to output rank both induce effort and reveal ability, performing two functions that flat pay structures do neither.

Knowledge work mostly lacks this mechanism. Fixed salaries, annual pay rounds calibrated to job level rather than output, and a general reluctance to introduce strongly performance-contingent pay in cognitive roles means that the information surplus in knowledge roles is both larger - because SDy is higher - and more persistent - because the learning period is longer and the performance pay sorting mechanism is absent.

The two routes to market - and why one is blocked

You might expect competition to erode this surplus. If firms are consistently capturing a rent from their high-ability workers, those workers will eventually find a way to signal their true value, other employers will bid for them, and wages will converge toward true productivity. In practice, this correction is real but slow and uneven.

The labour market offers two routes by which a worker might extract their surplus.

The first is worker-initiated search: the employee decides to look for another job and enters the market. The problem here is the one Bruce Greenwald identified in his 1986 Review of Economic Studies paper on adverse selection in the labour market. When current employers have better information about worker quality than potential alternative employers, a worker who self-nominates as available is treated as a suspect signal. Outside employers cannot distinguish someone who is leaving because they are under-compensated from someone who is leaving because they were a poor fit. The adverse selection discount limits the outside bid. The Greenwald mechanism is self-defeating for the very workers the argument concerns: the high performers who are earning the least relative to their true productivity are penalised in the market precisely because other buyers cannot observe what makes them valuable.

The second is firm-initiated approach via executive search. Here the economics is quite different. The outside firm independently identifies the worker as high-quality - typically through an executive search firm - and approaches them without the worker self-nominating. The Greenwald adverse selection discount does not apply, because the worker is not seeking to move: the outside firm is seeking to hire them. When an outside bid arrives by this route, the current employer can counter-offer. This is the bidding mechanism that Fabien Postel-Vinay and Jean-Marc Robin modelled in their 2002 Econometrica paper: firms Bertrand-compete for the worker, bidding wages toward the worker’s true marginal revenue product. Alexey Gorn’s 2021 Review of Economic Dynamicspaper quantified how much of the top of the wage distribution is produced by this mechanism: he attributes 35% of the increase in the US top 1% wage share from 1970 to 2010 to the headhunter channel. Executive search specifically generates the fat right tail of the wage distribution - high-ability workers accumulate successive outside bids that trigger bidding wars.

The important nuance is that this second route is selectively available - and the selectivity is exactly wrong for the workers my hypothesis concerns. Executive search firms target workers who are already externally visible: those who have been formally promoted, who work in client-facing or public-output roles, who have strong professional networks, who have measurable outputs. Michael Waldman’s 1984 RAND Journal of Economics paper on job assignments and signalling showed that promotions serve as ability signals to outside employers - promoted workers attract outside bids precisely because outside firms treat promotion as evidence of high ability. Waldman also observed a perverse consequence: firms sometimes withhold deserved promotions to avoid triggering the outside bids that might cost them the worker.

The workers whose productivity surplus is most compressed are also the workers least visible to headhunters. If the current employer cannot observe your performance clearly because it is multi-dimensional, collaborative, and delayed, then neither can an executive search firm. You will not be approached. The Greenwald discount applies if you try to move yourself. The sequential auction correction does not reach you. The surplus persists - not because the market is failing, but because the information asymmetry that drives the surplus also blocks the mechanism that would correct it.

This effect is most acute in knowledge roles with flat hierarchies or slow promotion cycles. The worker who would be promoted - who would receive the Waldman signal that attracts outside interest - never gets the signal because there is no promotion to give. The information surplus is structurally entrenched.

What AI is changing

There is one significant recent development that complicates this picture: the arrival of LLM-based tools that can be deployed at scale across knowledge work. Three well-designed experiments are worth examining together.

Erik Brynjolfsson, Danielle Li, and Lindsey Raymond studied 5,179 customer support agents following the introduction of a generative AI conversational assistant, publishing their findings in the Quarterly Journal of Economics in 2025. Average productivity rose 14%. But the distribution of gains was anything but average: novice and low-skilled workers improved by 34%, while experienced and highly skilled workers showed minimal improvement - and some experienced small quality declines. The AI was diffusing the best practices of senior workers to junior ones, compressing the experience curve.

Shakked Noy and Whitney Zhang published a preregistered experiment in Science in 2023, studying 444 college-educated professionals on writing tasks. ChatGPT reduced task completion time by 40% and raised output quality by 18%. Inequality between workers - their phrase - decreased. Low-ability workers benefited proportionally more. The mechanism was substitution: the AI replaced rough-drafting effort, which is the stage where low-ability workers struggle most, and shifted the work toward idea generation and editing, where the ability differential is smaller.

In a field experiment with 758 BCG consultants, Fabrizio Dell’Acqua and colleagues (published in Organization Science in 2026) found that workers below the average performance threshold improved by 43% on AI-supported tasks, while above-average performers improved by 17%. The distribution compressed. There was an additional finding worth noting: elite consultants using GPT–4 produced less variable ideas than those working without AI. The AI compressed not just performance levels but the range of approaches - a short-run gain in average quality at some cost to the diversity of thinking that drives long-run innovation.

The pattern across these studies points to the same mechanism, one that should feel familiar from an earlier part of this article. The AI is functioning as a scalable tutor - precisely the mechanism that Benjamin Bloom identified in his 1984 Educational Researcher paper on the two-sigma problem. Bloom reported that one-to-one tutoring with mastery learning raised average performance by two standard deviations relative to conventional classroom instruction - a striking figure that has proved hard to replicate at full magnitude, but whose direction is not in dispute. The AI is doing something analogous: making expert knowledge accessible to less experienced workers, compressing the lower end of the productivity distribution by raising the floor.

This is a meaningful and well-evidenced finding for knowledge work. It suggests that the SDy gap - at least at current AI capability levels - is partially compressible through tool deployment. The implication for the argument about firm-level information surplus is interesting: if the AI raises the productivity of lower-ability workers substantially while doing relatively little for top performers, the gap between high performers’ wages and their marginal revenue product narrows from the other end. The firm’s information surplus from top performers becomes less valuable not because those workers get paid more, but because the workers around them become more productive.

A caveat is important here, and it is partly a caveat about evidence. Francesco Bisardi’s 2025 paper, “The LLM Productivity Cliff,” is not a controlled experiment in the way the three studies above are - it is a non-peer-reviewed synthesis by an independent researcher, compiling 2025 results across software, support, and labour markets. Read as a hypothesis rather than a finding, its argument is worth taking seriously: that current experiments capture only early-phase adoption, and that workers who cross a threshold he calls architectural literacy - decomposing problems for models, orchestrating multi-step workflows, binding models to tools and data, validating outputs systematically - pull away from those who merely prompt. If that is right, the compression we see now is the first act. The interesting implication for this article is sharper than “the gap might return.” Architectural literacy would be a new high-complexity skill - which means it would carry its own wide SDy. The AI story would then not be the flattening of the old distribution but the manufacture of a new one, on top of it, with the same Hunter–Schmidt logic applying to a capability that did not exist five years ago. Whether that new dispersion gets captured in wages or in firm surplus is exactly the open question this article has been circling.

What this means

Three implications follow from the evidence, each pointing in a different direction from where most HR investment currently sits.

The first concerns selection. If the standard deviation of output in a high-complexity knowledge role is 48% of salary, then the expected return from improving selection validity is large - predictably, calculably large, in a way that most HR leaders have not communicated to their finance directors. The Brogden-Cronbach-Gleser utility formula converts any improvement in predictive validity directly into expected output gain, scaled by SDy. A validity improvement from 0.38 to 0.51 - the gap between an unstructured interview and a structured one, from Frank L. Schmidt and John E. Hunter’s 1998 meta-analysis - produces an expected annual gain of roughly 0.13 × SDy per hire, multiplied across the tenure of the hire. For a high-complexity role at £80,000 average salary, this is approximately £5,000 per hire per year. Over a five-year tenure, over fifty hires, the cumulative value of the selection improvement is around £1.25 million. These numbers are estimates, not guarantees - but they are estimates grounded in the best available meta-analytic evidence, not consultant projections.

The second concerns development. The Bloom evidence, and its AI-mediated analogue in the Brynjolfsson et al. findings, suggests that a substantial portion of the productivity gap reflects developmental environment rather than fixed ability. Workers with access to better feedback, better managers, better knowledge transfer, and - increasingly - better AI tools are not simply revealing a pre-existing advantage; they are accumulating one. The current literature that I’ve seen has focused heavily on the selection problem and given relatively little analytical attention to the post-hire developmental problem. The evidence from education and from the AI experiments suggests that these two levers are comparably powerful, and that the current balance of analytical investment is wrong.

The third concerns compensation design. Lazear’s sorting evidence is underused in knowledge work. The absence of strongly performance-contingent pay in most knowledge roles is usually justified on the grounds that outputs are hard to measure - which is true. But the consequence of that absence is a system in which high performers are systematically underpaid relative to their marginal revenue product, low performers are overpaid, and the firm captures the surplus from the former during the employer learning period. This is not neutral. It is a specific economic arrangement that benefits the firm at the expense of its most productive workers, and that is sustained by the same information asymmetry that makes the measurement problem hard in the first place. There is no simple fix - you cannot pay knowledge workers on piece rate when the “piece” cannot be reliably counted. But the design space between fully flat pay and fully output-linked pay is large, and most organisations have not seriously explored it.

The productivity gap between workers is not a curiosity in the tail of an I/O psychology dataset. It is the central economic fact of people management. Who you hire, how you develop them, and how you structure their pay are all decisions made in the shadow of a distribution that is wider than most HR leaders have ever been asked to account for.

The next two articles in this series examine the information problem at the heart of recruitment: why bilateral ignorance between candidates and employers makes every hire an investment under radical uncertainty, and what the validity evidence tells us about which tools for resolving that uncertainty actually work. The SDy figures are the stakes. The information economics of hiring is the game being played for them.

Issue 3 covered the performance distribution shape in more depth and included a simulation of the effect of measurement noise on the observed distribution. If you missed it, it is worth reading alongside this one.

Next up - Issue 6: The Information Problem - why recruitment is a bilateral asymmetry problem and what information economics says about it.

Sources

Hunter, Schmidt & Judiesch (1990) — Individual differences in output variability as a function of job complexity. Journal of Applied Psychology
O’Boyle & Aguinis (2012) — The Best and the Rest: Revisiting the Norm of Normality of Individual Performance. Personnel Psychology
Beck, Beatty & Sackett (2014) — On the Distribution of Job Performance. Personnel Psychology
Schmidt & Hunter (1998) — The Validity and Utility of Selection Methods in Personnel Psychology. Psychological Bulletin
Sackett, Zhang, Berry & Lievens (2022) — Revisiting Meta-Analytic Estimates of Validity in Personnel Selection. Journal of Applied Psychology
Lazear (2000) — Performance Pay and Productivity. American Economic Review
Jovanovic (1979) — Job Matching and the Theory of Turnover. Journal of Political Economy
Merton (1968) — The Matthew Effect in Science. Science
Aguinis, O’Boyle, Gonzalez-Mulé & Joo (2016) — Cumulative Advantage: Conductors and Insulators of Heavy-Tailed Productivity Distributions and Productivity Stars. Personnel Psychology
Heckman (2006) — Skill Formation and the Economics of Investing in Disadvantaged Children. Science
Ericsson, Krampe & Tesch-Römer (1993) — The Role of Deliberate Practice in the Acquisition of Expert Performance. Psychological Review
Macnamara, Hambrick & Oswald (2014) — Deliberate Practice and Performance in Music, Games, Sports, Education, and Professions: A Meta-Analysis. Psychological Science. (Source of the <1% figure for professions.)
Macnamara & Maitra (2019) — The Role of Deliberate Practice in Expert Performance: Revisiting Ericsson et al. (1993). Royal Society Open Science. (Music-specific reanalysis; ~25% figure.)
Syverson (2011) — What Determines Productivity?. Journal of Economic Literature
Altonji & Pierret (2001) — Employer Learning and Statistical Discrimination. Quarterly Journal of Economics
Mansour (2012) — Does Employer Learning Vary by Occupation?. Journal of Labor Economics
Greenwald (1986) — Adverse Selection in the Labour Market. Review of Economic Studies
Schönberg (2007) — Testing for Asymmetric Employer Learning. Journal of Labor Economics
Kahn (2013) — Asymmetric Information Between Employers. American Economic Journal: Applied Economics
Postel-Vinay & Robin (2002) — Equilibrium Wage Dispersion with Worker and Employer Heterogeneity. Econometrica
Gorn (2021) — The Role of Headhunters in Wage Inequality. Review of Economic Dynamics
Waldman (1984) — Job Assignments, Signalling, and Efficiency. RAND Journal of Economics
Lazear & Rosen (1981) — Rank-Order Tournaments as Optimum Labor Contracts. Journal of Political Economy
Brynjolfsson, Li & Raymond (2025) — Generative AI at Work. Quarterly Journal of Economics
Noy & Zhang (2023) — Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. Science
Dell’Acqua et al. (2026) — Navigating the Jagged Technological Frontier. Organization Science
Bloom (1984) — The 2 Sigma Problem. Educational Researcher
Bisardi (2025) — The LLM Productivity Cliff. Preprint (not peer-reviewed).

The cost of silence: a hard-headed case for psychological safety

Andrew Marritt — Wed, 20 May 2026 04:31:01 GMT

Amy Edmondson, the Harvard Business School professor whose 1999 paper established psychological safety as a serious research subject, has spent the better part of two decades describing a particular reaction she encounters when she introduces the concept to senior managers. A version of it will be familiar to anyone who has sat through a corporate leadership development programme: the arms fold, the expression closes, and someone - usually someone whose self-image is of a demanding, performance-focused operator - says some variant of “isn’t this just an excuse for people not to be held accountable?”

I have heard the same thing. The response is understandable. The term “psychological safety” sounds, to a certain kind of ears, like the soft end of psychology - a desire to make people feel comfortable, to protect feelings at the expense of standards, to elevate niceness over results.

A McKinsey Organisation Practice survey conducted in May 2020 asked 1,574 employees about their team leaders’ behaviours; 1,223 of those respondents were team members rather than leaders themselves. It found that only 26% of leaders regularly demonstrate the behaviours that create psychological safety (PS) in their teams, and that authoritative leadership - the style most associated with the hard-driving performance culture - was specifically detrimental to it. Only 43% of employees reported a positive team climate, which the research identifies as the single strongest predictor of whether PS exists. One reading of these numbers is that most managers lack the skills or training to create PS. My reading is different: most managers have not been given a reason to believe it is worth their effort. The concept has been framed as a cultural aspiration rather than an economic problem, and the framing has undermined the case.

This article is an attempt to make the economic case. The argument is that psychological safety is not a culture deficit. It is a monitoring problem with a predictable cost structure - and that cost is substantially larger, and more systematically underestimated, than the standard account allows.

Two concepts that have drifted from their original meanings

I want to open with a parallel that I think is instructive.

Utility - the building block of every economic model of why people do what they do - is what economists are actually referring to when they talk about incentives: change what someone values, or change what they expect an action to cost them, and you change what they do. The concept originates with Jeremy Bentham. In his 1789 Introduction to the Principles of Morals and Legislation, Bentham listed 14 simple pleasures - among them social recognition, good name, amity, skill, and power over one's own situation - and 12 corresponding pains. Utility, for Bentham, was roughly what we might today call wellbeing broadly defined. It was the later marginalist mathematicians - Jevons, Walras, Menger in the 1870s, then Pareto and Samuelson through the early twentieth century - who reduced it to something measurable in money. The stripping-out was a move of convenience, not accuracy. The richer account of human motivation was left behind because it was difficult to formalise, not because it was wrong. People outside the profession typically now think economists only think about money as the motivator.

Psychological safety has undergone a similar drift. Edmondson’s original definition was precise: PS is “a shared belief held by members of a team that the team is safe for interpersonal risk-taking.” Not safe to fail. Not comfortable. Not protected from consequences. Safe to take interpersonal risks - to offer dissenting information, flag a concern, challenge a decision - acts that expose the speaker to potential loss of status or standing in the group.

The counterintuitive finding from her original 1999 research made the meaning concrete. Edmondson was studying hospital teams and expected to find that the best-performing teams reported the fewest errors. The opposite was true. Higher-performing teams reported more errors than their lower-performing counterparts. The explanation: in psychologically safe teams, nurses were willing to surface and discuss mistakes. In less safe teams, errors existed but were suppressed - invisible to measurement, visible in outcomes. PS was functioning as a signal-amplifier for information that was already present in the system. The teams that appeared to have fewer problems had more of them; they were simply better at hiding them.

The popular understanding of PS has since drifted toward “a nice environment where people feel comfortable.” The practical consequence of that drift is the reaction Edmondson keeps encountering - and that the McKinsey numbers reflect. If the concept sounds like niceness, the hard-nosed manager is right to be suspicious of it. The original concept does not sound like niceness. It sounds like an information infrastructure problem.

The 2x2 that should have been the headline

Edmondson’s own response to the soft-equals-niceness conflation - developed in The Fearless Organization (2018) - is her two-by-two framework, which maps PS against what she calls the “drive to perform” (see diagram). The quadrant that is usually illustrated in training materials is the Learning Zone: high psychological safety and high performance standards. The quadrant that explains the hard-nosed manager’s concern is the Comfort Zone: high PS, low standards - a culture that is pleasant to work in and systematically underperforming.

These are not opposite ends of a spectrum. They are independent axes. A team can have both, either, or neither. The goal is not to choose between psychological safety and accountability. It is to understand that one without the other produces a specific, identifiable failure mode.

The Anxiety Zone - high standards, low PS - is the most common state of the teams I have worked with in large, highly competitive, organisations. People know what is expected of them and are afraid to surface the reasons why it will not be delivered on time, within budget, or at the quality promised. The Anxiety Zone is not a high-performance culture. It is a culture that optimises for the appearance of performance until the gap between appearance and reality becomes too large to contain, by which time the manager hopes to be running a different team.

Why the standard account of silence is incomplete

The standard economic treatment of why people stay silent runs like this: speaking up carries career risk. The rational employee weighs expected cost against expected benefit and, when cost exceeds benefit, stays silent. To fix the problem, reduce the cost - add whistleblower protections, provide anonymity, make dissent financially safe.

This captures something real, but it frames silence exclusively as the suppression of bad news. It treats psychological safety as a mechanism for preventing harmful things from remaining hidden: the safety incident, the project that is behind schedule, the compliance failure.

This framing misses half the problem. A significant portion of the cost of low PS is not the bad things that happen because warnings were not heeded. It is the good things that never happen because improvement ideas were never offered.

When I was running OrganizationView and analysing large volumes of open-text employee feedback - typically responses to a question along the lines of “what could we do to improve working here?” - the pattern that emerged across multiple clients was consistent and initially surprising to the HR teams we were working with. Roughly half of the feedback comments concerned operational improvements. Not engagement issues, not management style, not the topics HR had expected to find at the top of the list - but process inefficiencies, resource misalignments, bottlenecks that frontline employees could see and executives could not. (Often HR teams tried to hide these comments. I’ll discuss that later.)

A screenshot from an OrganizationView Interactive Starburst employee survey analysis where operational issues accounted for about 70% of the topics. In this instance the HR team focussed on the other 30%

The economics of these suggested improvements are structurally different from the economics of suppressed bad news. When an employee stays silent about a safety risk, the cost is borne by the firm in the form of an incident that could have been prevented. When an employee stays silent about an improvement idea, the cost is more diffuse: a potential gain that never materialises, never attributable to any individual decision. The cost of suppressed bad news is visible after the fact. The cost of suppressed good ideas is invisible by construction.

This is a public goods problem. The information an employee holds - about a better process, a customer need going unmet, a resource being wasted - has value that, if surfaced, would be captured largely by the firm and its shareholders. The employee who raises it bears the full cost of doing so (status risk, time, the possibility of being wrong in public) while receiving only a fraction of the benefit. In the absence of mechanisms that change this asymmetry, the individually rational decision is silence - even when the collective outcome of universal silence is severely suboptimal.

The incentive analysis therefore needs to address two distinct questions: why do people not surface bad news, and why do people not contribute good ideas? Both are PS problems. Both have the same underlying structure of misaligned individual and collective incentives. But the second is substantially harder to see, because nobody ever knows what was not said.

What the fuller utility function looks like

The standard model focuses on career and financial risk. But the cost of speaking up is considerably broader - which is why interventions that address only career and financial risk consistently underperform.

Edward Deci and Richard Ryan’s self-determination theory, developed through the 1980s and 1990s, identifies three basic psychological needs whose satisfaction constitutes a major source of human wellbeing: autonomy (acting from one’s own values), competence (feeling effective and capable), and relatedness (meaningful connection to others). Bruno Frey and Reto Jegen, in a 2001 paper in the Journal of Economic Surveys, added the critical practical corollary: external incentives can crowd out intrinsic motivation. When an activity someone undertakes out of genuine commitment is subjected to monitoring, scoring, and formal routing systems, it is reframed - by the person themselves - as a compliance act rather than a discretionary contribution. The intrinsic motivation disappears.

This has a direct implication for voice mechanism design. An employee who raises concerns because they genuinely care about the organisation is exercising autonomous, intrinsically motivated behaviour. An organisation that responds by tracking, categorising, and routing that input through formal systems may inadvertently signal that voice is a compliance task rather than a valued contribution - crowding out the discretionary behaviour that made it valuable.

But the most significant extension of the standard utility model comes from George Akerlof and Rachel Kranton’s “Economics and Identity,” published in the Quarterly Journal of Economics in 2000 and expanding in their later book. Their argument is that people derive utility not only from consumption but from acting in accordance with their social identity - their sense of who they are in relation to a group.

The formal structure is straightforward. Each person occupies a social category - team member, professional, insider - and that category carries prescribed behaviours. Conforming to the norms of that category generates identity utility. Deviating generates identity anxiety: a utility loss distinct from any financial consequence, enforced by the group through exclusion, disapproval, and the withdrawal of respect.

In a team where the implicit norm is “support the manager’s position,” dissent incurs identity costs - it marks the speaker as someone who doesn’t fit, who has broken the unwritten rules of group membership - even if the career consequences are zero. Akerlof and Kranton also show that identity costs are largest for those whose sense of belonging is most fragile: new joiners, relative outsiders, junior staff. Voice behaviour will be lowest in exactly the populations where speaking up is potentially most valuable.

A manager who responds to a dissenting view by implicitly marking the speaker as “not a team player” is not merely applying a career sanction. They are threatening the speaker’s social category. That is a more powerful and more lasting intervention than a bad performance review, and it operates without the manager necessarily being aware of it.

The manager problem is a principal-agent problem

The firm wants information to surface. The manager whose decision is being questioned often does not. This is not a character flaw. It is the predictable consequence of misaligned incentives - what Michael Jensen and William Meckling formalised in 1976 as the core of the principal-agent problem.

Bengt Holmström’s 1979 analysis of moral hazard adds the key practical complication: when the principal cannot observe the agent’s behaviour, the agent can underperform without detection. Suppressing bad news is an unobservable act. The manager who discourages dissent - through tone, through the implicit signals that Detert and Edmondson document in their 2011 paper on “implicit voice theories” - does not do so in writing. The firm cannot easily measure the silence that results.

Holmström and Paul Milgrom’s 1991 multitask analysis completes the picture. When managers are evaluated primarily on measurable team outputs, they will rationally underinvest in the harder-to-measure task of creating an environment where information travels upward. The outcome-focused incentive structure most organisations apply to managers is, by its design, structurally hostile to psychological safety. No individual manager needs to intend this.

This is probably the more charitable reading of the McKinsey finding. The 74% of leaders who do not consistently demonstrate PS-fostering behaviours are not simply lacking skills or training - they are responding rationally to an incentive structure that does not reward those behaviours. The question the McKinsey data raises is not “how do we train managers better?” It is “what are we evaluating managers on that makes creating PS not worth their effort?”

Silence routes into exit: and what exits with it

Albert Hirschman’s 1970 framework in Exit, Voice, and Loyaltyoffers the clearest account of the organisational consequences of suppressed voice. When quality deteriorates, members have two responses: exit (leave) or voice (articulate dissatisfaction in the hope of change). Which response is chosen depends largely on loyalty - attachment to the organisation.

Suppressing voice does not eliminate the problem it was signalling. It routes the problem into exit. The firm loses the information twice: once when the employee stays silent, and again when they leave and take what they know with them.

The loyalty dimension produces a specific adverse selection dynamic. Employees most willing to persist with voice attempts despite the costs are those most deeply attached to the organisation - the long-tenured, the highly engaged, those with the strongest sense of the firm’s potential. Employees with high ability and good external options exit quickly when voice is suppressed. The firm systematically retains those with the fewest perceived alternatives.

Hirschman’s framework also illuminates what exits along with those people. The employees most likely to persist with voice - the loyal, the engaged, the long-tenured - are also those with the deepest operational knowledge. They have observed what works and what does not across multiple managers, multiple cycles, multiple initiatives. Their voice, when they choose to use it, is institutional intelligence.

The unconditional question - “what could we do better?” - directed at everyone, including those most satisfied, produces a richer data set than the conditional design used in Net Promoter Score surveys, where “why did you give us this score?” signals to the promoter that the organisation wants to understand their positive experience rather than receive improvement ideas. The most valuable operational intelligence often comes from employees who are engaged enough to have thought carefully about what could work better - not from those who are dissatisfied. Conditional questioning by satisfaction level systematically under-samples exactly this group.

The investment question

Reframing PS as an incentive problem rather than a culture problem changes what investment in it looks like.

The costs of low PS are not uniformly distributed. They are highest where the value of information is highest: in high-stakes decision environments, in roles where the individual holds knowledge the organisation cannot easily observe, in teams working on non-routine problems where error is costly and course-correction is time-sensitive. The decision to invest in PS should be conditioned on the cost of silence in a specific context - not applied uniformly as a culture initiative.

High-cost-of-silence environments warrant structural investment: pre-mortem processes that institutionalise dissent so it is no longer a personal act; manager accountability metrics that include information quality rather than just output quality; separation of information-surfacing contexts from evaluation contexts. These change the incentive structure. Training managers in vulnerability changes their attitudes while leaving their incentives intact.

This connects back to what I argued in Issue 1 of this newsletter. The Decision Quality chain requires meaningful information as its third link. If people who hold relevant information are staying silent because the incentive structure makes silence rational, all the investment in analytical quality downstream of that link is building on a foundation that isn’t there.

Psychological safety is not the soft alternative to accountability. It is the upstream condition for the information quality that makes accountability meaningful. An organisation that measures performance against targets but suppresses the information that would reveal which targets are set wrong, which processes are failing, and which decisions are heading in the wrong direction is not running a high-accountability culture. It is running a high-compliance culture that mistakes the appearance of performance for its substance.

The hard-headed case is this: the cost of silence is large, it is systematically underestimated because it is unobservable, it includes both the bad things that happen and the good things that never do, and it routes into exit in a way that selectively removes the people most worth keeping. Fixing it requires structural changes to incentives. And the first step is recognising that it is an economics problem, not a feelings one.

A postscript: when the listening function doesn’t listen

I said earlier that in my experience HR teams often tried to suppress the operational improvement comments from employee surveys. I want to return to that, because it is not a minor observation. It is an illustration of the same dynamic playing out at a different level of the organisation.

The setup is this: an organisation commissions an employee survey. The question is designed to be open, to invite all kinds of feedback. Employees respond. A significant portion of what they say concerns operational problems - process failures, resource waste, things their managers could fix if they knew about them. The HR team receives the full dataset. And then, in a significant number of cases, those operational comments are quietly set aside. The report that reaches executives covers engagement, wellbeing, and manager effectiveness. The operational content - which represents roughly half the signal the employees sent - disappears.

I am not describing a conspiracy. The HR team members involved were not acting in bad faith. They were responding rationally to the incentive structure they were operating within.

Consider what the Holmström multitask analysis predicts for an employee listening team. They are evaluated on delivering a clean, credible, on-time report covering the topics their function owns: engagement scores, eNPS, action plan commitments. Routing operational feedback to the relevant department heads is an unmeasured task. It requires relationships, influence, and the willingness to raise issues that belong to someone else’s territory. It may generate conflict. And if it goes well, the credit accrues to the department that fixes the problem, not to the HR team that surfaced it. The rational response to this incentive structure is to stay in your lane.

The Akerlof-Kranton identity analysis adds a further dimension. HR practitioners have a professional identity that is built, for understandable historical reasons, around people topics: talent, culture, performance, wellbeing. Operational feedback sits outside that identity. Surfacing it feels like overreach - and is sometimes experienced as such by the operational teams on the receiving end. The identity cost of stepping outside the HR category is real, even if the cost of silence to the organisation is larger.

The consequence completes the circuit. Employees who contributed operational ideas see no change. Procedurally, in the terms Bruno Frey and his colleagues described - the utility people derive from being heard, independently of whether the outcome is favourable - the experience registers as a failed exchange. Their voice reached the survey form, but it did not reach a decision-maker. The implicit voice theory - Detert and Edmondson’s term for the unconscious model each of us carries about when speaking up is worth the effort - updates accordingly. Next survey, the engagement questionnaire gets filled in. The operational idea that has been forming for six months does not get written down.

What makes this particularly pointed is that the function responsible for creating the conditions for employee voice is, in this scenario, itself a voice suppressor. Not through malice, and not in a way that is visible to anyone measuring engagement or survey participation. But through the entirely predictable operation of the same misaligned incentives, monitored silence, and identity constraints that govern every other layer of the organisation.

The design implication is structural. Operational improvement data from employee surveys should have an explicit owner and routing mechanism that does not depend on HR discretion. The survey question should carry an implicit promise - that all categories of response will reach someone who can act on them - that the organisation is actually equipped to honour. Without that infrastructure, the unconditional question is not unconditional. It is a promise the organisation does not keep.

Working Ideas is a newsletter on People Analytics, the nature of work, and multi-disciplinary thinking by Andrew Marritt. If this piece was useful, consider sharing it with someone who dismissed psychological safety as soft.

Notes and sources

Psychological safety

Edmondson, A.C. — “Psychological Safety and Learning Behavior in Work Teams,” Administrative Science Quarterly 44(4), 1999 — the foundational paper. Free PDF via MIT.

Edmondson, A.C. — The Fearless Organization, Wiley, 2018 — source of the Learning Zone / Comfort Zone / Anxiety Zone framework.

Detert, J.R. and Edmondson, A.C. — “Implicit Voice Theories: Taken-for-Granted Rules of Self-Censorship at Work,” Academy of Management Journal 54(3), 2011 — on the unconscious beliefs that govern when people speak up. HBS faculty page; PDF available via ResearchGate.

McKinsey Organisation Practice — “Psychological Safety and the Critical Role of Leadership Development,” 2021 — source of the 26% and 43% figures. Survey conducted May 2020, n=1,574.

Utility and motivation

Bentham, J. — An Introduction to the Principles of Morals and Legislation, 1789 — free via the Library of Economics and Liberty. The 14 simple pleasures are listed in Chapter V.

Ryan, R.M. and Deci, E.L. — “Self-Determination Theory and the Facilitation of Intrinsic Motivation, Social Development, and Well-Being,” American Psychologist 55(1), 2000 — free PDF from the SDT research group.

Frey, B.S. and Jegen, R. — “Motivation Crowding Theory: A Survey of Empirical Evidence,” Journal of Economic Surveys 15(5), 2001 — working paper version free via SSRN.

Akerlof, G.A. and Kranton, R.E. — “Economics and Identity,” Quarterly Journal of Economics 115(3), 2000 — free PDF from Rachel Kranton’s Duke University page.

Agency theory

Jensen, M.C. and Meckling, W.H. — “Theory of the Firm: Managerial Behavior, Agency Costs and Ownership Structure,” Journal of Financial Economics 3(4), 1976 — free PDF via Simon Fraser University.

Holmström, B. — “Moral Hazard and Observability,” Bell Journal of Economics 10(1), 1979 — free PDF via UT Dallas.

Holmström, B. and Milgrom, P. — “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design,” Journal of Law, Economics, and Organization 7, 1991 — free PDF via Duke University.

Voice and exit

Hirschman, A.O. — Exit, Voice, and Loyalty, Harvard University Press, 1970 — the complete book is freely available via the Internet Archive.

The performance distribution in your organisation is not the one you think it is

Andrew Marritt — Wed, 06 May 2026 04:31:02 GMT

A few weeks ago I read a Substack article by Colby Kennedy Nesbitt titled “Is Performance a Power Law?” which I recommend. She makes a careful and important distinction: the claim that job performance follows a power law - that the top 20% of employees do 80% of the work - rests on a confusion between performance outputs and the underlying capability that produces them. The outputs of performance (home runs, sales figures, publications) can follow a power law for structural reasons that have nothing to do with the shape of human capability. Count metrics are bounded at zero and unbounded above. They are shaped by exposure time, opportunity, and attribution, not just ability. When you normalise for those factors - when you move from home runs to slugging percentage, say - the distribution tightens towards something closer to normal.

“Performance is not a power law. Some performance outputs are. Confusing the two warps how we measure, reward, and develop.”
“The shape of the distribution shifts with the structure of the metric, not the nature of performance itself.”
Colby Kennedy Nesbitt, Is Performance a Power Law?

I should confess here that I know almost nothing about baseball. I encountered the game analytically, through David Robinson’s excellent book on Empirical Bayes estimation, not through watching it. Slugging percentages entered my vocabulary as illustrative examples rather than as statistics I track on weekends. This is perhaps fitting for an article about European HR practitioners importing ideas from American I/O psychology with imperfect understanding of their original context.

Nesbitt’s argument is convincing as far as it goes. But I found myself stopping at a particular line and thinking: this is where the interesting problem begins, not where it ends. Even if we accept - as I do - that the underlying distribution of human performance capability is approximately normal, the distribution of performance we observe inside any organisation tells us almost nothing about that underlying distribution. The two are separated by a set of processes that are systematic, cumulative, and poorly understood. And worse, the effect of those processes is to make the observed distribution look more normal than it should - which means the standard assumption appears to be confirmed precisely because the measurement system is producing the artefact.

That is what this article is about.

What are we measuring? Two disciplines, two answers

I came to People Analytics as an economist, which shapes how I think about performance in ways I have only gradually become conscious of. When economists study performance, they are studying output - the value a worker contributes to the firm, net of the cost of employing them. Productivity, in the economic sense, is the marginal revenue product of labour. It is an economic construct, defined against a market, not against a behaviour taxonomy. When the economist thinks of labour as being substitutable by capital, freelancers, management consultants or now AI it’s this view of performance which lets us do so.

I/O psychology, which is the academic discipline that has shaped HR practice, starts from a completely different place. John Campbell, in a foundational paper in 1990, defined job performance as behaviour under an individual’s control that contributes to organisational goals - not outputs, but the behaviours that produce outputs. His taxonomy identifies eight dimensions including task proficiency, demonstrating effort, facilitating peer performance, and maintaining personal discipline. By this definition, performance is multi-dimensional, hierarchical, and largely behavioural. The downstream result of that performance - revenue, publications, home runs - is not performance; it is a consequence of performance.

This is not academic hair-splitting. The distinction matters for how organisations measure, reward, and manage. When HR collapses eight dimensions of performance into a single annual rating, it is doing something that neither the economic framework nor the I/O psychology framework endorses. The economist wants to know the marginal revenue product; the rating doesn’t tell them that. The I/O psychologist wants to measure behavioural proficiency on multiple independently weighted dimensions; the rating collapses those into a spurious single number.

George Baker, in a 1992 paper in the Journal of Political Economy, put this with characteristic precision. Any measurable proxy for performance will systematically diverge from the underlying construct the firm actually cares about. He called this distortion. If you pay mechanics based on completed repairs, you create an incentive to recommend unnecessary ones - because the thing you can count is not the same as the thing you value. Bengt Holmström and Paul Milgrom showed in 1991 that this problem is structurally insoluble when workers have multiple tasks and only some are measurable: performance pay on the measurable tasks will divert effort away from the unmeasurable ones, which may be exactly the ones that matter most.

I think about this often in relation to certain financial roles. A derivatives trader who is paid on the mark-to-market value of her book at point of sale has every incentive to maximise initial pricing at the cost of longer-term performance on the instrument. Her current performance metric is excellent. Her expected future contribution to the firm - the present discounted value of all the outcomes her decisions will eventually produce - may be negative. Performance management systems that are backward-looking are not measuring the thing that matters; they are measuring a time-limited proxy for it, one that rational agents will optimise at the expense of the real thing.

The distribution debate: what we actually know

Against this backdrop, the argument about whether performance follows a normal or power-law distribution takes on a slightly different character. The debate is real, and it matters practically. But it is also somewhat harder to resolve than either camp acknowledges.

Ernest O’Boyle and Herman Aguinis made the power-law case in a 2012 paper in Personnel Psychology, drawing on 198 samples and over 633,000 individuals across research, entertainment, politics, and sport. They found that individual performance consistently fitted a Paretian distribution better than a normal one. The implication they drew - and that subsequently escaped into HR practice as a management principle - was that the top performers are so much better than the rest that entire compensation philosophies should be reoriented around them.

Joel Beck, Alexander Beatty, and Paul Sackett responded in 2014 with a careful methodological critique. Their key demonstration was that sampling from only the upper tail of a normal distribution - which is precisely what populations of elite researchers, athletes, and entertainers are - produces highly skewed data even when the underlying distribution is perfectly normal. The apparent power law may be a property of sample selection rather than of human capability.

Nesbitt adds the metric structure argument I described in the opening. Count statistics (publications, home runs) have structural properties - bounded at zero, unbounded above, shaped by opportunity - that generate skew regardless of the underlying ability distribution. When you normalise for those structural features, the skew largely disappears.

I find both the Beck et al. and Nesbitt arguments convincing. But I want to add a third layer, which neither addresses: the sample that organisational performance management is observing is not the kind of sample from which distributional conclusions can reliably be drawn.

The sample you cannot see

Here is the problem in its simplest form. Any claim about the distribution of performance within a firm implicitly assumes that the employees you are observing constitute a meaningful sample from the population you care about. They do not. They are the survivors of at least four sequential selection processes, each applied with substantial measurement error, each systematically biasing who remains.

Stage 1: Selection at entry. The firm applies a selection procedure to an applicant pool. Even a good one - a procedure with validity around 0.40, which is high by real-world standards - leaves substantial error. The hired cohort is range-restricted: those below the selection threshold are absent. That threshold is itself applied to a noisy score, so some of the people below the threshold in terms of true capability were hired anyway, and some above it were not. From the first day of employment, the in-firm distribution is not a random draw from the labour market. But it is also not a pure ‘sampling from the upper tail’ because you have the measurement error of selection.

Stage 2: Differential voluntary attrition. Workers leave at different rates correlated with their ability. High-capability employees have more outside options - their ability is visible to other employers through the labour market signalling process - and so they leave at higher rates. But there is a second, subtler mechanism at work that I have observed consistently in organisations, especially those with structured bonus programmes. Employees join with expectations about their career trajectory: not just their starting salary but the expected value of their development over time, discounted for risk. When the firm’s annual signals - the rating, the bonus allocation - fall short of those expectations, the employee updates their belief about whether this firm will help them develop as they hoped. If the gap is large enough, they start looking elsewhere.

This second mechanism has a particularly perverse interaction with measurement error. Under the first mechanism, the people who leave are genuinely high-ability - the firm loses people it wants to keep, but at least the cause is structural (the market). Under the second mechanism, the people who leave are those who received signals below their expectations. With imperfect measurement, that includes high-ability employees who were rated low through inter-rater noise. The firm does not know these were its best performers, because the rating that prompted them to leave was itself wrong. Meanwhile, lower-ability employees who were fortunate enough to receive inflated ratings stay longer than they should, because the firm’s signal confirmed their self-assessment.

Stage 3: Performance management out. Firms explicitly remove workers at the lower end of their performance distribution. Probationary dismissals, performance improvement plans, and redundancy selection all systematically truncate the lower tail. The criterion used is measured performance - which, as I will show in the simulation, carries a noise component that is roughly as large as the true signal. Some genuinely low-performing employees are removed correctly. Some higher-performing employees who received a bad rating in a bad year are removed incorrectly. The resulting sample is not “everyone above the true performance threshold.” It is everyone above a noisy, error-prone estimate of that threshold.

Stage 4: Retention interventions. Firms apply retention tools - above-market pay, bonus guarantees, high-visibility assignments - disproportionately to employees they believe are high performers. Those beliefs are, again, based on measured performance. So the upper tail of the distribution is partially rebuilt through a process that has the same measurement error as everything else.

After five years of this, the distribution of performance ratings in a firm reflects something that is genuinely complex: the joint operation of market dynamics, firm policies, and measurement error on whatever the true underlying distribution is. James Heckman won the Nobel Prize in Economics in 2000 partly for developing formal methods to correct for this kind of endogenous sample selection. His point was that estimates from selected samples are biased in ways that are not visible from within the sample itself - you cannot identify the selection bias from the selected data alone. That applies directly to performance distributions.

The insight from psychometrics is adjacent: range restriction - the well-documented consequence of selecting a non-random subset of the population - attenuates the observed variance and distorts correlations. Corrections for indirect range restriction (where selection happens on a variable correlated with, but not identical to, the criterion of interest) are standard in meta-analyses of selection validity. They are rarely applied when interpreting performance distributions.

Noise and bias: the measurement problem HR gets backwards

There is a further dimension to the measurement problem that I think the HR profession has consistently misunderstood, and it requires a brief detour into Daniel Kahneman, Olivier Sibony, and Cass Sunstein‘s 2021 book Noise: A Flaw in Human Judgment.

The way HR talks about measurement error in performance ratings is almost entirely a conversation about bias. Leniency bias - managers who rate too generously. Severity bias - managers who rate too harshly. Halo effects, recency bias, demographic biases. Vast effort goes into calibration sessions designed to align rating distributions across managers and reduce systematic distortions. The underlying assumption is that the core problem with performance ratings is that they are systematically skewed in a predictable direction.

Kahneman, Sibony, and Sunstein argue that this focus is importantly incomplete. Bias is systematic error - it shifts ratings in a consistent direction and can, in principle, be identified and corrected. Noise is something different: it is the random, unsystematic variability in judgements that has nothing to do with the person being rated. The same manager assessing the same employee on a different day, or two different managers assessing the same employee, produce different ratings for reasons that are effectively random. Their core finding, demonstrated across multiple domains of professional judgement including medical diagnosis, legal sentencing, and credit assessment, is that noise is typically larger than bias in absolute magnitude.

For performance ratings, this matters in a specific way. Calibration sessions address bias - they bring systematic leniency and severity into alignment. They do almost nothing about noise. The manager who rates consistently harsh can be corrected. The manager who rates differently on different days, or whose assessment varies unpredictably depending on their mood, the order in which they reviewed employees, or the ambient light in the room cannot be corrected without measuring the same employee multiple times by multiple raters and averaging the result. That is not how annual performance reviews work.

When Frank Schmidt and John Hunter established that inter-rater reliability for performance ratings is approximately 0.52 across the meta-analytic evidence base, they were quantifying noise as much as bias. At that level of reliability, roughly half the variance in any performance rating is random error. The HR profession’s focus on bias over noise has a practical consequence: it addresses the smaller part of the measurement problem while leaving the larger part untouched. The noise is not a detail. It is half the data.

Introducing PQ: a simulation device

To explore these dynamics, I want to introduce a modelling tool. I am going to call it PQ - Performance Quotient. Like IQ, PQ is assumed to be normally distributed in the general working population: mean 100, standard deviation 15. It represents the latent capability underlying performance behaviour. I am not claiming PQ exists as a real, measurable construct. I am using it as a device - a clean, understandable baseline - to ask: what happens to a normal distribution when you run it through an organisation’s HR processes?

The device also helps me be explicit about what I am assuming. Starting with normally distributed PQ is a deliberate charitable premise: even granting the normality assumption, I want to show that what the organisation observes will not be normal, and more importantly, that the observed distribution cannot be used to infer the true one.

Before showing the results, I want to explain why I think simulation is the right tool for this problem, because I believe it is underused as a thinking practice in HR and People Analytics - and in business more broadly.

The argument I have been making in the preceding sections is hard to hold in your head all at once. Four selection stages, each with its own error rate, operating on an unknown true distribution, compounding over five years - the verbal argument can gesture at this, but it cannot make you feel the scale of the distortion. The standard intuition that “a big enough sample reveals the true distribution” fails here precisely because the sample is itself the product of the processes we are trying to understand. More data does not help; the data is the problem.

Simulation forces a different kind of clarity. You must make your assumptions explicit. You cannot write “differential attrition” as a vague phrase and then proceed; you have to specify a quit probability function, decide whether it responds to true PQ or measured PQ or the gap between career expectations and the firm’s signals, and commit to numbers. That process of commitment is epistemically valuable - it reveals where your argument depends on assumptions you have not examined, and where it holds across a wide range of plausible values.

In my experience, the standard research process in People Analytics goes: question → data collection → statistical analysis → conclusion. The simulation step - which would sit between question and data collection, asking “what would we expect to find if the mechanism we are hypothesising were actually operating?” - is almost always skipped. The result is that practitioners often lack a strong prior about what the data should look like before they examine it, which makes interpretation harder and increases the risk of finding patterns that confirm prior beliefs rather than testing them. A simulation built before looking at the data is one of the few tools that can genuinely discipline that process.

What the simulation shows

I ran a simulation in R across five annual cycles, hiring from a pool of 1,000 applicants each year at a 20% selection rate, applying a realistic selection procedure (validity 0.40), differential attrition by PQ rank, performance management of the bottom 5% annually, and targeted retention bonuses for high performers.

Figure 1 shows Stage 1: no measurement error, policies applied with perfect information apart from the selection procedure. Even under these idealised conditions, the in-firm PQ distribution diverges substantially from the population within two years. At hire, selection has already truncated the lower tail. By Year 5, the combined effect of performance management, differential attrition, and retention interventions has produced a distribution that is visibly non-normal: a steep left wall, a sharpened and compressed peak, an asymmetric shape bearing little resemblance to the underlying population. With perfect measurement, the distortion is apparent to anyone who looks. The uncomfortable implication follows in Figure 2 - with realistic measurement error, that distorted shape becomes much harder to see. I'd argue we might want a distribution like Year 5, but it's certainly not a normal distribution. It’s close but not the same as Beck, Beaty, and Sackett’s sample.

Figure 2 adds measurement error at the meta-analytic inter-rater reliability of 0.52, the figure Frank Schmidt and John Hunter established across decades of careful research. At this level of reliability, the standard deviation of measurement error is approximately 14.4 PQ points - almost as large as the true standard deviation of 15. Half the variance in any performance rating is noise.

The comparison across three panels is the core of the article’s argument. The left panel shows the Stage 1 Year 5 distribution - the best possible world, with perfect measurement. The centre panel shows what Stage 2 does to the true PQ distribution: measurement error causes false exits (high-PQ employees misrated low and managed out, or misrated low and therefore disappointed in their career signal, who then leave) and false retentions (low-PQ employees misrated high who stay). Actual workforce quality is worse than it would be with better measurement.

The right panel is the one that I find most striking. It shows not the true PQ distribution but the observed performance rating distribution - what HR actually sees when it looks at its data. The noise has recreated a distribution with fatter tails than the normal. The truncations and compressions in the true distribution have been smoothed over by the random error in the ratings. The bell curve in the data is not evidence that the true distribution is normal. It may be evidence that the measurement error is large enough to obscure a distribution that is not normal.

Figure 3 separates the two attrition mechanisms. The left panel shows a world where only Mechanism A operates - high-PQ employees leave because they have better options. The right panel adds Mechanism B: employees in bonus-structured firms who receive signals below their career expectations leave regardless of their absolute performance level. The combined effect is more severe, and its character is different. Mechanism A depletes the upper tail of the true PQ distribution in a relatively predictable way. Mechanism B creates a compound adverse selection problem: the firm loses some of its best people because its own measurement told them they were not valued, while retaining mediocre employees whose inflated ratings kept them satisfied.

Figure 4 shows what happens across a range of reliability values, from 0.30 at the bottom to near-perfect measurement at the top. The two panels tell different stories about the same simulation.

The left panel shows the observed mPQ distribution - what HR actually sees in its rating data at each reliability level. The striking result is how quickly the distribution collapses toward something that looks approximately normal and then past it. At reliability of 0.52, the clean left-truncation visible in Figure 1 has largely disappeared. But this is not evidence that the organisation has become a random sample of the population. It is evidence that the cumulative noise in annual ratings has grown large enough to overwrite the signal. Five years of selection, performance management, and differential attrition have shaped the true distribution substantially - but the observed rating distribution makes it appear as if almost none of that has happened. The observed mPQ distribution at realistic reliability levels most closely resembles the at-hire distribution from Year 1, before any of those processes have run. The HR processes are doing work; the measurement system is hiding it.

There is a second, subtler effect visible in the left panel. HR does not just see a distribution that looks more normal than it should. It sees one with somewhat thicker tails - more apparent high performers and more apparent low performers than actually exist, at the expense of the proportion near the centre. This is noise doing what noise does: redistributing people away from their true score in both directions. The employee rated in the bottom 5% is not reliably a low performer. The employee rated in the top 10% is not reliably exceptional. At an inter-rater reliability of 0.52, a substantial fraction of both groups are there by measurement accident rather than genuine performance.

The right panel shows the true PQ distribution at the same reliability levels - what HR cannot see. As reliability falls, actual workforce quality deteriorates. Fewer high-PQ employees survive the combined effect of false low ratings, which push some of the best out through performance management or career disappointment, and false high ratings, which protect some of the weakest from scrutiny. The organisation believes, from its rating data, that it is managing performance actively. It is - but with a large fraction of its effort directed at the wrong people.

What follows from this

I want to be careful about what I am and am not claiming. The simulation is illustrative - it uses parameters chosen to be plausible and instructive, not to reproduce any specific organisation. Different organisations will have different selection validity, different attrition patterns, different reliability in their rating processes. The purpose of the simulation is to show the mechanism - to demonstrate that even under generous assumptions, the four-stage HR process systematically reshapes the in-firm distribution in ways that make it uninformative about the true underlying distribution.

The practical implications are uncomfortable but clear. Organisations that use forced distribution curves - requiring managers to rate a fixed percentage of employees as top, middle, and bottom performers - are imposing a distributional assumption on a sample that has already been shaped by that same assumption in previous years. The policy creates a feedback loop: managing out low performers changes the distribution, which changes who the “low performers” are in the next cycle, which changes the distribution again. This is not a bell curve naturally occurring in the workforce. It is a bell curve that the organisation’s own processes are manufacturing and then observing with satisfaction.

There is a further problem that becomes visible once you take noise seriously. The employees rated in the bottom 5% of the observed distribution are not all genuinely low performers. At inter-rater reliability of 0.52, with measurement error of roughly the same magnitude as the true signal, the bottom of the rated distribution is populated by a mixture: some genuinely low-PQ employees, and some average or above-average employees who happened to receive an unfavourable rating in that review cycle - a difficult project, a new manager, a period of personal pressure, random variation in assessor judgement. Managing out the bottom 5% does not cleanly remove the weakest performers. It imposes what is effectively a random tax on the workforce, removing people distributed across the true ability range, weighted only loosely towards the lower end. The forced distribution is not selecting against poor performance; it is partly selecting against bad luck.

More broadly, any claim about the shape of performance in your organisation - whether it is normal, Paretian, log-normal, or anything else - needs to account for the selection history that produced the sample before that claim can be evaluated. This is not a statistical nicety. It is the precondition for any serious analysis of performance management.

Nesbitt is right that the power law claim overreaches. Beck, Beatty, and Sackett are right that elite samples produce apparent skew regardless of the true distribution. But the deepest problem is not which distribution fits the data. It is that the data itself is the output of processes we have not modelled - and until we model them, we are reading tea leaves dressed up as statistics.

The question is not what shape performance takes. The question is what shape your hiring, retention, and management processes have created in the sample you can observe, and how far that sample is from the thing you actually care about.

References

Nesbitt, C.K. (2026). “Is Performance a Power Law?” Variance, Explained (Substack).
Campbell, J.P. (1990). “Modeling the Performance Prediction Problem in Industrial and Organizational Psychology.” In Dunnette, M.D. & Hough, L.M. (Eds.), Handbook of Industrial and Organizational Psychology (2nd ed., Vol. 1, pp. 687–732). Consulting Psychologists Press. (NOTE: I didn’t have direct access to this so used multiple descriptions to cross-check the original claims taken from secondary sources.)
O’Boyle, E. & Aguinis, H. (2012). “The Best and the Rest: Revisiting the Norm of Normality of Individual Performance.” Personnel Psychology, 65(1), 79–119.
Beck, J.W., Beatty, A.S. & Sackett, P.R. (2014). “On the Distribution of Job Performance: The Role of Measurement Characteristics in Observed Departures from Normality.” Personnel Psychology, 67(2), 531–566.
Baker, G.P. (1992). “Incentive Contracts and Performance Measurement.” Journal of Political Economy, 100(3), 598–614.
Holmström, B. & Milgrom, P. (1991). “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design.” Journal of Law, Economics, and Organization, 7 (Special Issue), 24–52.
Viswesvaran, C., Ones, D.S. & Schmidt, F.L. (1996). “Comparative Analysis of the Reliability of Job Performance Ratings.” Journal of Applied Psychology, 81(5), 557–574.
Heckman, J.J. (1979). “Sample Selection Bias as a Specification Error.” Econometrica, 47(1), 153–161.
Kahneman, D., Sibony, O. & Sunstein, C.R. (2021). Noise: A Flaw in Human Judgment. Little, Brown Spark.

The Analyst Who Changed Nothing

Andrew Marritt — Wed, 22 Apr 2026 04:31:01 GMT

Picture this: an analyst has spent two months building a model. The analysis is solid - properly specified, cross-validated, defensible. They present to the leadership team. The model identifies the drivers of performance decline in the business unit. Clear findings. Actionable levers identified. A recommendation follows.

The leadership team thanks them. Nods appreciatively. Someone says “good work.” Then they put the analysis in a presentation deck and do what they were going to do anyway. Nothing changes.

The analyst is frustrated. The technical work was sound. But something went wrong. Not in the regression. Not in the feature engineering. Something earlier - something upstream of the analysis itself.

Here is what happened: the analyst was asked to explain why performance was declining. They accepted that frame without interrogation. But they never asked the leadership team what they were actually going to do with the answer. They never found out what variables were actionable. They never involved the decision-makers in the process of shaping the question or considering alternatives. By the time the findings landed on the table, the room had already decided whether to care about them.

The analyst had answered a question nobody had agreed to ask, with variables nobody could act on, to a room that was never committed to acting on the answer. They had addressed one link in the decision chain and left five others untouched.

This is not uncommon. In fact, it is the default pattern in most analytics teams.

The conventional approach: the analyst as information producer

The standard model of how analysis works is straightforward. A business leader or manager identifies a problem. The people analytics team is asked to investigate. The team retrieves data, builds a model, and delivers findings. If qualitative work happens at all, it is a separate workstream - conducted by a different team, reported separately, in a different meeting, to a different audience.

The implicit assumption is clear: the analyst’s job is to produce information. Someone else handles what happens next - framing the problem, generating alternatives, building commitment.

This has a name. In 2018, McKinsey formalised it as the “Data Translator” role. The translator works in between two worlds: the technical analysts on one side and the business decision-makers on the other. The translator’s job is to carry information across the gap - to help the technical team understand what the business needs, and to help the business team understand what the data can tell them. Useful. But notice what is missing: the translator carries information across, but does not shape the decision process itself.

The convention was set early. In data science training, analysts are taught to retrieve data, clean it, build models, and report results. The process is linear: brief → data retrieval → analysis → report → done. Iteration within the analytical step is valued. Stepping back to interrogate the frame is not part of the process. The assumption - often unstated, therefore more powerful - is that the brief is right and the analyst’s job is to answer it.

This model has persisted even as the stakes have risen. Analytics teams are now making recommendations on who gets hired, who gets promoted, who gets managed out of the organisation. The decision quality these teams support matters. And yet the process by which those decisions are framed, shaped, and acted upon remains untouched.

Subscribe now

The problem: why information alone does not improve decisions

There are three ways this breaks down.

First, the framing problem. By the time the analyst receives the brief, the decision frame is already set. Usually badly. A business leader says “we’re worried about attrition,” and the analytics team mobilises to investigate why people are leaving. But that is a framing choice - and possibly the wrong one.

The real decision might be different. It might be “should we be worried about attrition at all, or is the real problem that we’re retaining the wrong people?” It might be “attrition is fine at junior levels, but we’re losing critical senior people - should we focus specifically there?” It might be “the cost of attrition is high because we’re bad at hiring for culture - would a recruitment intervention solve the problem faster than a retention intervention?”

These are different questions. They require different analyses. They lead to entirely different actions. The analyst who only starts at the data, who accepts the frame as handed over, has already compromised the decision. This is not speculation. Paul Nutt, who spent twenty years studying more than four hundred management decisions, found that the most common cause of decision failure was not poor analysis. It was premature commitment to the wrong problem - accepting a given framing without questioning whether it described the real decision at stake.

Second, the mechanism problem. Quantitative analysis finds associations. A regression model shows that manager support correlates strongly with engagement. That is useful information.

But is it causal? Are good managers creating engagement, or are engaged employees rating their managers more highly, or is there a third factor driving both? Scale data cannot settle this. The correlation coefficient tells you the relationship is there; it does not tell you how - the mechanism, the causal pathway, the step-by-step sequence from cause to effect. Without this, recommendations target symptoms. You might invest in manager development, expecting engagement to rise, only to discover you are polishing the output of a system that was never broken.

Macey and Schneider’s research in 2008 showed that effect sizes in engagement models halve when you move from cross-sectional data to properly controlled longitudinal designs. The reason is not measurement error. It is confounding - the quantitative relationship is real, but you do not know what is driving it. More recently, Richard E. Lucas in his 2023 paper showed convincingly that the cross-lagged panel model — the standard approach for longitudinal research in organisational psychology — conflates stable trait differences with time-varying within-person processes, making causal inference unreliable even with repeated waves of data.

The mechanism matters because different mechanisms imply different interventions. If high performance drives high engagement (because successful people feel good about work), the lever is different than if high engagement drives high performance (because motivated people work harder). They cannot be distinguished with quantitative data alone.

Third, the commitment problem. A decision has no value if nobody acts on it. And people only act on decisions they understand, feel ownership of, and believe they helped to shape.

The analyst who delivers a report to a room that had no involvement in shaping the question, defining the alternatives, or understanding the evidence has no basis for expecting action. The room may nod appreciatively. But the analysis becomes information that confirmed or denied a pre-existing view - not a guide to decision-making.

This is not psychology. It is structural. If you were not part of interrogating the frame, you have no investment in whether the frame is right. If you were not part of considering alternatives, you have no commitment to the alternative the analysis recommends. Involvement in the analytical process is how decisions become real.

These three problems are not failures of individual analysts. They are structural failures of the conventional model - a model that treats analysis as an end in itself rather than as part of a decision process.

A better model: the analyst as decision coach

The alternative does not replace quantitative analysis. It places it within a larger cycle - a process that constantly seeks and integrates both qualitative and quantitative data, and in doing so naturally engages every link in the Decision Quality chain that we identified in Issue 1.

It looks like this:

Phase 1: Stakeholder interviews - before any modelling.

The analyst begins not with data retrieval but with conversations. They interview the leadership team, the managers, and a sample of employees involved in the situation being analysed. But this is not requirements gathering, where the brief is handed down and the analyst listens passively. It is decision coaching - the first act of shaping how a decision will be made.

These interviews achieve multiple things simultaneously.

Framing: What question is the organisation actually trying to answer? Is the brief right, or is the real problem different? An analytics team tasked with “investigate why attrition is rising” can come back and say: “We spoke to ten managers. They all perceive attrition as a problem, but what they are really worried about is whether we are losing specialists we cannot replace. The question should be about critical skill attrition, not total attrition.”

Actionable variables: What levers are actually available? This is a core idea. Donald Rubin formalised a principle that is simple but revolutionary: “No causation without manipulation.” Age predicts attrition well. So does tenure. Neither is actionable - the organisation cannot make employees younger or older, more or less tenured. The analyst who spends time with stakeholders discovers what is actionable in this specific context. Flexible work arrangements might be possible; mandatory office tenure is not. Career development conversations are feasible; relocation packages are not. The model should be built around what can be changed.

Generating hypotheses: What relationships do practitioners believe exist? An experienced manager’s intuition about the drivers of performance in their team is informative - not as final evidence, but as a source of hypotheses. “What does the data show about whether manager transitions affect team stability?” is a better question when you know the manager has observed transition effects. The interview generates the feature engineering.

Testing intuitions against the data: There is something more subtle happening here. When a senior leader tells you “I think the problem is concentrated in the teams that grew fastest last year,” they are giving you a testable expectation - an informed prior about where the effect will be strongest and what shape it will take. The analyst can then run the analysis and show how that intuition actually plays out in the data. Sometimes the data confirms it. Sometimes it partially confirms it - the effect is there, but it is smaller than expected, or it operates differently than assumed. Sometimes it contradicts it entirely.

This matters for two reasons. First, it makes the analysis more powerful. An analyst who incorporates informed expectations about likely patterns - who tests whether the effect is sensitive to different starting assumptions - produces more robust findings than one who approaches the data with no expectations at all. Second, and perhaps more importantly, it tells the stakeholder: I listened to your perspective. I took it seriously enough to test it. When the results come back, they are not a surprise delivered to a room that had no involvement. They are an answer to a question the stakeholder helped to shape. The leader whose intuition was confirmed learns that the data supports their judgement. The leader whose intuition was contradicted learns something genuinely new - and because they were involved in framing the test, they are more likely to update their view than if the finding had arrived unannounced in a report.

Considering alternatives: What interventions are actually on the table? Not what the analyst thinks should be on the table, but what leaders are genuinely considering. A retention package, a career development program, a change to the performance management system, doing nothing and improving hiring speed instead. The analysis should be designed to inform a choice between these options, not just to characterise a problem.

Seeding commitment: The leader who has spent time shaping the question, who has seen their ideas reflected in the framing of the analysis, who has been involved in defining what it means to act on the findings - that person is already invested in the answer. They are not waiting for the report to decide whether they care. They are waiting to see which of the alternatives the data supports.

Phase 2: Quantitative analysis.

Now the analyst models. With better features. Better hypotheses. A question that someone actually wants answered. AI tools have compressed this step dramatically. What once took a skilled analyst a week - data extraction, cleaning, exploratory analysis, model building - can now happen in an afternoon with Claude Code or a similar LLM-augmented workflow. The barrier is no longer the regression; it is knowing what to ask. The technical work has become the cheaper part.

The analysis tests the hypotheses that came from the interviews. It measures the scale of the associations. It identifies which of the actionable levers has the strongest relationship with the outcome. It surfaces the patterns that would not be visible to observation alone.

Phase 3: Process tracing - back to qualitative data.

Does the pattern that the quantitative data reveals make mechanistic sense? The model says that manager transitions are associated with a subsequent performance decline. The analyst goes back to qualitative data - back to the interviews, to open-text survey responses, to transcripts, to observational notes. They trace the mechanism step by step.

Is performance declining because the new manager loses institutional knowledge that the old manager held? Because the team’s routines get disrupted? Because the new manager has a different style and people need time to adjust? Because the new manager cannot form relationships at the same speed? The specific mechanism matters - each one suggests a different intervention.

This is process tracing, a method formalised by political scientist David Collier in 2011 specifically for causal investigation. The idea is simple: to understand causation, trace the causal pathway step by step. Where quantitative analysis asks “is there an association?”, process tracing asks “how does X cause Y?” It incorporates both quantitative and qualitative evidence and is designed precisely for the kind of investigation described here.

Phase 4: Formal decision analysis - bridging the statistical model and the business question.

The analyst now knows what is associated with what and has traced the mechanism. But there is one more step, and it sits squarely within the analyst’s skillset - more squarely, usually, than within anyone else’s in the room.

Statistical models produce outputs: probabilities, effect sizes, classifications. Business decisions require choices between alternatives with different costs and consequences. The gap between these is where formal decision analysis lives. Techniques like expected monetary value calculations, decision trees, and the assignment of monetary values to the cells of a confusion matrix turn a statistical output into a recommendation that can actually be weighed against alternatives.

Consider an attrition risk model. The model classifies employees as high, medium, or low risk of leaving in the next twelve months. Useful - but the manager looking at this output faces a concrete question: given the cost of an intervention (a retention conversation, a compensation adjustment, a development plan), the probability that the intervention works, and the cost of the person actually leaving, which employees should receive which intervention? A confusion matrix with monetary values assigned to true positives (retention achieved), false positives (intervention wasted on someone who was not going to leave), and false negatives (no intervention for someone who then left) turns the classifier into a decision tool. Expected value calculations across a decision tree of possible interventions turn “what does the model say?” into “what should we do?”

These are numerate techniques, and they sit naturally with the analyst’s skillset. In many people analytics decisions the analyst is the most numerate person involved. This is exactly where that numeracy matters. It is probably the line between a traditional statistical model and the business question the analyst is being asked to help answer. Skipping it leaves the statistical output stranded on one side of that bridge - and leaves the decision-maker to translate probabilities into actions using intuition, which is rarely their strongest suit.

Phase 5: Iterate.

The mechanism suggests a new hypothesis. Back to the data. Then back to the people. The cycle continues. The analyst tests the new hypothesis quantitatively. The results suggest a different mechanism than was initially suspected. Back to the interviews. A manager explains something unexpected. The model needs to be refined. Rapid iteration between data types, guided by the goal of improving a specific decision.

This is not the pre-designed mixed methods study that appears in research textbooks - with a fixed sequence of qualitative then quantitative, or parallel workstreams, with a single synthesis at the end. This is iterative, continuous, back-and-forth. The analyst cycles between data types multiple times within the same engagement.

And because stakeholders have been involved throughout, the iteration is not just analytical - it is decision shaping.Each cycle refines not just the model but the frame itself. Each conversation with a leader surfaces something the data did not show. Each quantitative result prompts a new question. By the time the analyst moves toward action, the stakeholders are not seeing the results for the first time. They have been seeing the results emerge, piece by piece, over weeks of conversation.

What makes this feasible now

This kind of iterative integration sounds time-consuming and expensive. For most of the history of People Analytics, it would have been. Two developments have changed the calculus in a way that is counterintuitive.

First: qualitative data at scale. Text analytics and computational natural language processing have made it possible to analyse thousands of open-text responses, exit interview transcripts, and employee comments with analytical rigour. The capability to work with qualitative data quantitatively - what practitioners sometimes call “qual-at-scale” - was the core intellectual contribution of companies like my old firm - OrganizationView - over the past decade. It is now increasingly accessible through general-purpose AI tools.

This matters because the chief objection to qualitative research has always been: it doesn’t scale. You can conduct meaningful interviews with fifty people. Can you do the same with five thousand? Not in the old way. But if you have five hundred hours of interview and focus group transcripts, you can now extract patterns from them with computational text analysis. You can identify which themes are common, which are rare, which predict outcomes. The “qual doesn’t scale” objection is no longer defensible.

Second: AI-assisted quantitative analysis. Claude Code and similar tools have compressed the quantitative analysis cycle from days to hours. What once required a skilled analyst a week - data retrieval, cleaning, exploratory analysis, model building - can now be done in an afternoon. The technical barrier has fallen dramatically.

Here is the counterintuitive part: as quantitative analysis gets cheaper and faster, the analyst’s work becomes more qualitative, not less.

If a quantitative analysis cycle takes five days, an analyst can do a few cycles per engagement. You are limited. You do one round of interviews, build one model, present findings, and move to the next project.

If a quantitative analysis cycle takes one day, the iterative mixed methods process described above becomes practically feasible for the first time. The analyst can run a model at 10 a.m., discuss the findings with stakeholders at 11 a.m., surface a new hypothesis at noon, test it quantitatively by 3 p.m., and begin process tracing in the afternoon. The analyst can cycle between data types multiple times within the same engagement. They can iterate deeply. They can trace mechanisms. They can talk to people more, not less.

The introduction of AI tools that are “good at analysis” increases the proportion of the analyst’s time spent on qualitative work - interviews, conversations, sense-making, mechanism tracing, commitment building. The mechanics of analysis accelerate. The human work expands to fill the space.

This is not a small change. It is a reorientation of the analyst’s role. McKinsey’s Data Translator concept was useful but incomplete. The translator bridges two worlds. The iterative analyst is a decision coach - someone who shapes how a decision will be made by being involved in every stage of the analytical process.

What should change

If this is right, four things should shift in how analytics teams - and the HR functions they serve - are structured.

First, the stakeholder interview is analytical work. It is not requirements gathering, not soft skill, not something that happens before “real” analysis begins. It is the first, most consequential act of decision coaching - the moment when the frame is interrogated, actionable variables are identified, hypotheses are generated, and commitment to the eventual findings is seeded. An analyst who skips this has already compromised the decision, no matter how good their model is.

Second, structure teams around decisions, not methods. The conventional split between quantitative and qualitative analysis mirrors a disciplinary boundary in academia. That boundary has no basis in decision science. A question like “what is driving performance decline in this business unit?” requires both data types. One analyst (or a tightly integrated team) cycling between data types produces better decisions than parallel tracks that run separately and report to the same meeting.

Third, the analyst’s distinctive contribution is increasingly the steering, not the analysis itself. As AI makes the mechanics of analysis cheaper, the analyst’s value shifts to what cannot be automated: interrogating the frame, identifying what is actionable, tracing mechanisms step by step, applying formal decision analysis to turn statistical outputs into recommendations, and building the human commitment to act. The analyst’s numeracy is not only useful for the model - it is useful at the point where the model’s output meets the decision. In most people analytics engagements the analyst is the most numerate person in the conversation, and techniques like expected value calculations and decision-tree reasoning belong to them more naturally than to anyone else involved. This is not the McKinsey translator. That role bridges two worlds and carries information across. The decision coach guides the decision process itself - by being involved in shaping the question, defining the alternatives, making the evidence legible, reasoning formally about the trade-offs implied by the evidence, and earning commitment to action.

A warning about self-service analytics. There is a trend running directly against this argument. Many People Analytics platforms now promise “Insight” through self-service dashboards - giving HR business partners and line managers direct access to pre-built analyses, correlations, and benchmarks. The pitch is appealing: democratise access to data, reduce the bottleneck of the central analytics team, let managers explore the data themselves.

But consider what this does in the terms of this article. Self-service analytics accelerates the quantitative step - the one step that is already becoming cheaper. It does so by removing the analyst from the process entirely. The manager who opens a dashboard and sees that engagement is correlated with manager tenure has received information. They have not had the frame interrogated. They have not been asked what interventions are actually on the table. Nobody has traced the mechanism. Nobody has tested whether the association is causal or confounded. And nobody has built commitment to a specific course of action.

Self-service analytics, done badly, is the purest version of the problem this article describes: it optimises the information link while disconnecting the user from every other link in the decision chain. The dashboard delivers the data faster. It does not make the decision better. In some cases it makes it worse, because a manager who has “seen the data” feels more confident in a conclusion that may be entirely artefactual.

This does not mean self-service has no place. It means it should be designed with the decision coach in mind - as a tool the analyst uses with the stakeholder, not a replacement for the analyst’s involvement.

Fourth, the capability gap in HR is decision literacy, not data literacy. For the past decade, People Analytics teams have campaigned - with considerable energy and some success - to develop data literacy across HR. The logic was sound: if HR professionals could read charts, interpret statistics, and interrogate data, they would make better decisions. But if the argument of this article is right, data literacy addresses the wrong bottleneck. Teaching someone to read a regression output helps with the information link. It does nothing for framing, alternatives, or commitment.

What HR needs instead is decision literacy - an understanding of how to structure a complex decision. Most people decisions involve multiple factors that need to be optimised simultaneously, many of which are difficult to measure. Should we invest in retention or accept higher attrition and improve hiring speed? Should we centralise the analytics function or embed analysts in business units? Should we redesign the performance management system or focus on manager capability? These are not questions that more data will resolve. They are structurally complex decisions where the frame matters more than the analysis, where trade-offs between competing values must be made explicit, and where the quality of the reasoning process determines the outcome more than the quality of the data.

The key lessons of decision analysis - how to frame a decision, how to generate genuine alternatives, how to distinguish what is measurable from what matters, how to reason under uncertainty, how to build commitment - these are the skills that HR professionals need now. Data literacy was the right campaign for an era when the problem was that HR did not look at data at all. Decision literacy is the right campaign for an era when data is abundant, AI can analyse it cheaply, and the binding constraint is the quality of the decisions the data is supposed to inform.

The diagram below shows how this works. Issue 1 established that conventional analytics operates at one link in the Decision Quality chain - the information link. The decision coach engages five of the six: framing, alternatives, information, reasoning, and commitment. The one link the analyst does not own is values and trade-offs - deciding what the organisation actually wants to optimise for. That belongs to the decision-maker. But everything else, including the formal reasoning that turns model outputs into weighed choices, is within the analyst’s scope.

The Working Idea

You have probably noticed something in this argument: it does not require any particular analytical technique. You could build the model with classical statistics or machine learning or causal forests. You could do the qualitative investigation with interviews or observational research or process tracing. The method is not the point.

What matters is the structure of the process. An analyst who interviews stakeholders before modelling, who tests mechanisms through qualitative investigation, who cycles between data types guided by a decision goal, and who brings stakeholders along the way - that analyst is doing decision coaching whether they know it or not.

But here is the question: how many analytics teams are structured this way?

Most are not. Most follow the conventional model - brief, analyse, report. The stakeholder involvement happens at the beginning (to understand the question) and the end (to present findings). The work in the middle is done in isolation.

Changing this is not a tool problem. It is not a data quality problem. It is not about getting a closer seat at the executive table. It is about reorienting the entire framing of what an analyst does. From analyst-as-information-producer to analyst-as-decision-coach.

The best analysis your team ever produces will change nothing if nobody was committed to acting on it before they saw the results. Start with the interview, not the dataset.

Notes:

In prior projects where I’ve used the EMV analysis approach it has taken about as much time to build the expected monetary value model as the statistical model, but the effect on the quality of the overall decision, for optimising on the right thing, and for generating acceptance of the decision, was enormous.
The arguments in this and the preceding article are why I’m including a module on Decision Theory in the Masters in Data Science that I’m currently teaching. We’re not alone in doing this - others include Cornell, Barcelona School of Economics, UConn to name just a few.
I’m currently working with a few HR teams to develop decision skills education for their teams. Do get in touch if you’d like to discuss.

Sources:
Macey, W.H. & Schneider, B. (2008). “The Meaning of Employee Engagement.” Industrial and Organizational Psychology, 1(1), 3–30.
Lucas, R.E. (2023). “Why the Cross-Lagged Panel Model Is Almost Never the Right Choice.” Advances in Methods and Practices in Psychological Science, 6(1).
Collier, D. (2011). “Understanding Process Tracing.” PS: Political Science & Politics, 44(4), 823–830.
Nutt, P. (1999). “Surprising but true: Half the decisions in organisations fail.” Academy of Management Executive, 13(4), 75–90.
Imbens, G.W. & Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Analysis Is Not the Point

Andrew Marritt — Thu, 11 Dec 2025 12:26:41 GMT

This is the first of a 2-part mini-series on why People Analytics needs to shift from analysis to decisions. The second part is here.

The field’s uncomfortable moment

Speaking to practitioners at People Analytics conferences over the past couple of years, I’ve noticed a change in the atmosphere. The optimism of the early 2010s — when building the team, landing the technology, and proving that HR could do numbers felt like the whole job — has given way to something more unsettled. The dashboards are built. The tools are funded. And yet.

The data bears this out. According to HR.com’s State of People Analytics 2024–25, around three-quarters of organisations now invest in people analytics teams and technology. Just one in ten reports consistently achieving the highest level of impact with them. Decisions about who gets promoted, who gets hired, who gets managed out are still being made largely on instinct. Engagement scores are tracked and reported; little changes. Months of careful analysis sit in presentations that informed no decision in particular.

The field’s response has been consistent: better data, more AI, improved data literacy among managers, a closer seat at the executive table. These diagnoses were being offered in 2015. Progress has been modest.

I think the problem is upstream of all of them.

Treating the symptoms

The standard diagnoses are not wrong. Data quality problems are real. Many managers genuinely lack the confidence to work with quantitative evidence. Integration between people data and operational or financial data remains poor in most organisations. These are genuine barriers and they are worth addressing.

But here is a useful test: imagine you fixed every one of them. Your data is clean and fully integrated. Your CHRO presents at every board meeting. Your managers are fluent with numbers. You have still not guaranteed that a single important decision gets made better as a result of your work.

Technical and political improvements are necessary but not sufficient. The failure mode is not technical and it is not political. It is conceptual — a confusion, running deep through the discipline, about what analysis is actually for.

The means and the end

The purpose of analysis is to improve decisions. Analysis that does not change a decision has no value.

This sounds obvious. Its implications are radical.

It means the quality of an analytical project cannot be judged by the quality of the analysis. A technically excellent model, properly validated, beautifully presented, is worthless if it does not change how a decision is made. The question “was this good analysis?” is almost entirely the wrong question. The question that matters is “did this help someone make a better decision?”

And it means that everything upstream of the analysis — how the problem is framed, what decision is being supported, who makes the decision and how — is at least as important as the analytical work itself.

The field has invested heavily in getting better at producing analysis. It has invested almost nothing in the conditions that would make that analysis matter.

Asking the wrong question

Consider a scenario that will be familiar to most practitioners. An organisation is concerned about employee turnover. The people analytics team is asked to investigate why people are leaving. They build a solid predictive model — a proper piece of work, well-specified, appropriately validated. The top drivers of attrition are identified. Recommendations follow. Among the most tractable interventions, it turns out, is stress caused by a feeling of too much work: if you reduce the pressure on borderline performers, total attrition falls. The project is declared a success.

Except: the attrition that falls is largely among poor performers, who were inexpensive or actively beneficial to lose. High performer attrition — costly, often irreversible — has actually increased as they struggle to differentiate themselves from the average. The firm is now measurably better at retaining people it would prefer to lose. Objectively, the decision made things worse. The analysis, throughout, was technically sound.

The error was made before the first line of code was written. The frame was wrong.

The real decision facing the organisation was not “how do we reduce turnover?” It was “how do we manage the cost and impact of turnover on the business?” These are not the same question. They require different analyses, yield different findings, and lead to entirely different actions. The analytical output could be identical; what changes is the decision it serves.

Ronald Howard, who coined the term “decision analysis” at Stanford in 1966, spent much of his career arguing that framing is the most consequential and least discussed step in any analytical process: “The frame is the most important thing, and it’s the one that’s talked about the least — otherwise, you’re going to get the right answer to the wrong problem.” Russell Ackoff, who built the discipline of Operations Research at Wharton across the same decades, put it with characteristic sharpness: “The righter we do the wrong thing, the wronger we become.”

The turnover case is not a horror story about analytical malpractice. It is the default pattern when an analytics team accepts the question it has been handed rather than examining the decision it is supposed to serve.

One link in a longer chain

Even when the frame is right, there is a further problem. Most analytics work treats information as the product — and stops there.

Carl Spetzler, Hannah Winter, and Jennifer Meyer, in Decision Quality (2016), codified what Ronald Howard’s group had been teaching at Stanford for decades: a good decision requires six things to be in place simultaneously, each of which can cause the whole enterprise to fail if it is weak. They are rendered as links in a chain — appropriate framing, creative alternatives, meaningful and reliable information, clear values and trade-offs, sound reasoning, and commitment to action.

Analysts typically only consider one part of a decision

People Analytics, even when it is working well, almost exclusively provides one of these: information. A predictive attrition model tells you that a given employee has a high probability of leaving in the next six months. It does not tell you what you could actually do about it — the alternatives link. It does not encode whether retention is worth the cost in this particular case — the values link. It does not tell you how the decision will be made or by whom — the commitment link.

The chain is only as strong as its weakest link. A brilliant model that informs a poorly framed decision, with no good alternatives considered and no clear owner of the final call, is not a successful piece of analytics work. It is a technically impressive contribution to a failed decision process.

The discipline has optimised for one link and treated the rest as someone else’s problem.

A gap in training — and what fills it

Understanding why this pattern persists requires an uncomfortable observation about how the people analytics profession is formed.

Decision theory has a substantial intellectual history. Howard’s decision analysis work in the 1960s built on earlier foundations in economics and mathematics: the expected utility theory developed by John von Neumann and Oskar Morgenstern, the subjective probability framework of Leonard Savage, the Bayesian statistical tradition. This body of work — which deals explicitly with how to frame decisions, how to reason under uncertainty, and how to connect analytical findings to choices — is standard in economics and Operations Research programmes. It is the intellectual infrastructure of those disciplines.

It is almost entirely absent from the training routes that feed People Analytics. Data scientists are taught to build models. I/O psychologists are taught to design studies and interpret results. HR professionals are taught organisational behaviour and employment law. Almost none of these routes touch, even briefly, on decision theory. Practitioners arrive with genuine technical competence and no conceptual framework for what their work is meant to produce.

This is not a criticism of individuals. It is a structural observation about how disciplines are bounded, and about what happens when a field draws on multiple technical traditions without inheriting any of their foundational thinking about purpose. People Analytics is highly capable, technically. It simply was not taught what it is for.

There are other structural forces that reinforce the pattern. Analytics teams are typically evaluated on outputs — models delivered, dashboards built, reports produced — not on outcomes. Analytics technology is sold as an “insight generation” platform, positioning analysis as the destination rather than the route. And much of the push for evidence-based HR has been, at least in part, a campaign to prove HR’s numerical credibility to sceptical business audiences — which rewards analytical display more than decision facilitation.

A brief note on AI

Generative AI is changing the economics of analysis production rapidly, compressing into minutes work that previously took weeks. If the purpose of analytics is to produce analysis, this is an unambiguous improvement. If the purpose is to improve decisions — and the framing problem described above goes unaddressed — the field can now produce faster and more elaborate answers to the wrong question. Ackoff’s observation applies with some force.

What it looks like when done properly

Return to the turnover scenario, but start differently.

Before any data is pulled, the question is examined: what decision are we actually supporting? Not turnover reduction as an end in itself, but the management of turnover costs and impact — which means the relevant population is not all leavers but costly leavers, primarily high performers and people in hard-to-fill roles. This reframing changes what the model needs to do.

Alternatives are scoped before the model is built: what levers actually exist? Targeted retention packages, career development conversations, changes to management practice, doing nothing and improving hiring speed instead. The analysis is then designed to inform a choice between these options, not just to characterise a problem. And it is built around variables the organisation can actually change — a point the statistician Donald Rubin formalised as “no causation without manipulation.” Age and tenure may predict attrition well; neither is a lever. The decision owner is identified before the findings are presented. The values and trade-offs — what is a retained high performer worth, relative to the cost of a retention effort — are made explicit.

The analytical work may be similar in technical terms. The decision process it sits within is entirely different. And the probability that it changes something is vastly higher.

The real problem

The angst in People Analytics right now is real and appropriate. Paul Nutt, who spent twenty years studying more than four hundred management decisions for his book Why Decisions Fail (2002), found that the most common cause of failure was not poor analysis. It was premature commitment to the wrong problem — accepting a given framing without questioning whether it described the real decision at stake.

People Analytics has been committing, at scale, to the wrong problem. It has treated analysis quality as the thing to improve when the thing that matters is decision quality. Better dashboards, more AI, and improved data literacy will not fix this. They will produce righter answers to the wrong question.

The discipline does not need to be more analytical. It needs to understand what analysis is for.

Part II - How the profession should change

Sources: HR.com State of People Analytics 2024–25; Carl Spetzler, Hannah Winter & Jennifer Meyer, Decision Quality (Wiley, 2016); Paul Nutt, Why Decisions Fail (Berrett-Koehler, 2002); Ronald A. Howard, “Decision Analysis: Applied Decision Theory” (1966); Russell L. Ackoff, quoted in The Systems Thinker.

Subscribe now