What Education Research Is For

Here’s the part Kelsey Piper can’t see

Apr 25, 2026

Before I began my PhD, I thought I knew what I was going to do: give a questionnaire to thousands of kids to show that their curiosity declined during schooling.

Research claims this happens. Nobody had proved it yet. Job done.

I didn’t do it, partly because educational research is hard.

You have to get ethical approval from your university, which takes months. You have to get children to sign up and take a questionnaire. Given the difficulty of getting them to complete a worksheet when they’re right in front of me and I can threaten them with detention, that’s a challenge. You’ve also got to get their parents to agree (and you’re not even allowed to trade their signature for a promise to look favourably on their kids’ coursework).

So I didn’t do it.

Which is lucky, because the OECD had already asked tens of thousands of kids these questions and found out – in a much more robust way than I would have done – what happens to kids’ curiosity when they get older. (I’ve written it up here.) The OECD also runs PISA, the international comparative study that governments treat as the gold standard of educational testing.

It’s not the sort of thing I do, though. I watch lessons. I speak to the children themselves. It takes a different set of skills from analysing 60,000 kids’ questionnaires.

There are lots of ways to do educational research – something I don’t think Kelsey Piper realises.

Before I get to that, we need to talk about Jo.

On Jo Boaler

Piper opens with Jo Boaler. The methodological criticisms are serious – the 2012 Bishop, Clopton and Milgram paper on the Railside misrepresentations; Tom Loveless’s 2023 review; a 2024 anonymous complaint to Stanford alleging 52 citation misrepresentations, which Stanford declined to investigate as ‘scholarly disagreement.’

In summary: Boaler disputes the allegations; it hasn’t swayed her critics; her work helped shape a controversial California maths framework; she’s still employed by Stanford.

But of course there’s a larger story here, and it doesn’t reflect well on education research as a discipline.

Peer review eventually did its job, but ‘eventually’ was after California had built its new curriculum based on her work. And look at where the pushback did come from: mathematicians, statisticians and policy experts. Where were the education researchers in all this?

And then there’s the Nature reproducibility paper Piper cites: 342 education papers examined; 69 contained quantitative claims that were eligible for reproducibility analysis (admittedly a small slice of the total quant work published in education over the decade); only a handful made it through the process; zero findings reproduced exactly. We can agree: quant education research can and should share code and data. Again, PISA’s the gold standard here.

Piper’s made a strong case, but here’s where it weakens.

The solutions she proposes – sharing analysis plans beforehand, conducting large-scale trials, hiring trained economists and ensuring methods are transparent – are not the discipline-changing blue-skies thinking she believes they are.

In fact, there are already a whole group of organisations doing some variation on what Piper proposes – evidence‑mobilisation and ‘what works’ bodies in education. You can find them in the UK, Australia, Netherlands, Chile, Spain (and the US to some extent).

I interned in one for six months. Let me tell you more about how they work – and when they don’t.

Inside the what-works project

When I began teaching, I wondered why we didn’t just run randomised controlled trials, like medicine does. Take two groups of kids, try something with one group, try something different with the other, see who gets the best grades. As Piper puts it: test curriculum A against curriculum B and see which one gets the best results, making sure your methods are transparent. Why bother with anything else?

While they have branched out, this was the founding idea behind the UK’s Education Endowment Foundation. Piper argues ‘bring in the economists, apply the standards of their field’. We’re already doing it. When I interned in the EEF’s evaluation team, most of the team had an economics background. (I’m not an economist, but it doesn’t take much digging to see that economics has its own replication problems – as Drew Bailey, cited by Piper, pointed out in the comments.)

The EEF asks how we can improve attainment across the board and close the gap between the most disadvantaged and the most affluent students. It runs RCTs. It publishes findings. It demands the kind of methodological transparency Piper wants.

And the results are sobering.

A 2021 review commissioned by EEF themselves found that across seven interventions, when these were scaled up, the effect size dropped on average from 0.25 to 0.01. Using their own conversion, before scaling, kids appeared to make 3 extra months of progress with the intervention; after, 0 months. Another study, reviewing 141 large-scale RCTs commissioned by the EEF and its US equivalent, involving over 1.2 million students, found the interventions gave kids about one additional month of progress.

This isn’t bad educational research. These are well-designed, well-funded, well-conducted trials. The results – when successful – are used in thousands of classrooms across the country. But modelling educational research on what goes on in medicine has its limitations.

Medicine tests interventions on individual bodies where the mechanism is relatively stable across patients. Across a population, a drug that lowers cholesterol in Boston will lower cholesterol in Birmingham (although even in medicine, focus is moving towards personalised interventions). Also, as Greg Ashman points out (see the comments), a doctor can’t tell a placebo from the real article; a teacher knows when they’re trying out a shiny new intervention versus doing business as usual - that awareness is going to affect how they teach it.

A classroom isn’t a doctor’s surgery. Most teachers will tell you they’re already trying to maximise their students’ exam grades. Maybe we’re now at the point of diminishing returns, given constraints like class size, funding, mixed ability grouping and timetabling.

Neither is a classroom a laboratory, where painstakingly-designed experiments can be faithfully reproduced again and again. Step into one (or, god forbid, stand near the interactive whiteboard) and you realise they’re a seething mass of interacting factors: the curriculum, teacher, students (and the dynamics between them), lesson material, weather and whether the kids have eaten lunch yet. A reading programme that raises scores in Manchester won’t necessarily raise them in Marseille.

It’s not about bad methodology; it’s the naivety of thinking we can do the same thing across every classroom and expect the same results.

Scale it up, surely, and the effect of these other variables dies away?

Wrong.

Scale up, as we saw, and the effect of the intervention often shrinks.

This is where a different type of educational research comes in, and it’s one the EEF already does. They don’t just run the trial, they also send researchers into schools to check whether the teachers are running the programme as they should.

This is qualitative research. It doesn’t produce shareable code or replicable regression coefficients, because it’s asking other kinds of question: what does this experience actually feel like to the people inside it? What makes a teacher abandon a programme in week three? What does a child do when she can’t keep up?

These are real questions. They have real answers. But the answers don’t fit the format the Nature reproducibility paper measures. You can’t just run the same code again across fifty interviews in R.

If you reduce education research to ‘how can we use methods from economics to transparently work out the best curriculum to maximise kids’ grades’ you miss something.

You miss the collateral when all you care about are these results.

What does it feel like to be labelled ‘low attaining’?

I wrote about Saffa’s experiences in a previous post. If you realise that school isn’t for you – if you struggle in class, get kept in at break and lunch to catch up and attend intervention groups – what does that do to your sense of who you are?

You can survey a thousand kids – or ten thousand – but you won’t get the rich details that Eleanore Hargreaves, Laura Quick and Denise Buchanan at UCL did when they followed 23 London children from the age of seven who had been labelled ‘low attaining’.

Take Saffa. She loved art. Learning about pointillism was her favourite thing at school.

Did that mean she’d rather do art than maths each morning?

No – art would become ‘quite meaningless,’ she said. ‘Because you have to do plus and take away and division and stuff.’

At seven, she’d already learnt which subjects counted and which didn’t. The curriculum hierarchy had overridden her passions. She also knew the stakes. If you didn’t listen in class: ‘You’ll just be a McDonald’s cooker, just flip patties. You will be unsuccessful.’ And when she had to leave her friends to join a younger class for maths, she called it ‘the walk of shame.’

What if we got Saffa a couple of extra points in her GCSE exams? Wouldn’t that transform her life chances? Maybe. But she’s seven. She’s already doing the walk of shame. Will those points be enough to transform how she feels about herself?

What about Chrystal, too, who feels low attaining kids like her end up alone: ‘no-one cares’ because ‘they have no friends to stand up for them’? Or Neymar, hiding in the toilets to avoid a maths test, despite how disgusting he finds the smell?

Seven years old. With nobody to tell their story except a group of researchers who may or may not have economics-level statistics knowledge, but who do have the deep training required to do qualitative research properly.

The problems with qualitative research

Qualitative research brings its own challenges, and transparency is one of them. The UCL team could upload their transcripts and observation notes, but that brings logistical challenges (my PhD interview transcripts total 150,000 words) and ethical ones – something in there might identify a participant. Name the school – as Piper suggests, and the comments underneath push back on – and you often have enough to name the participants.

These challenges aren’t insurmountable, though, and qualitative researchers should keep pushing for the highest standards (we just need to agree on what these are).

There’s also the issue of interpretation. Which of Saffa’s words do the researchers choose? How do they connect them with other students’ experiences? The UCL team are experts at this. Read their papers and you can see the depth required to do the work. If you think it’s easy, try turning 150,000 words of transcript into a coherent and representative narrative.

In choosing to interview students and observe lessons, I found something much more interesting than I could have got from a survey alone. I caught the moments where a student asked a question and their classmate commented under their breath, or a teacher cut it off. And I saw the moments where children’s faces lit up during a practical, or a teacher took a question and ran with it, bringing the lesson to life.

These studies are not better than quantitative research. They sit alongside it. Numbers can tell you how children’s curiosity changes between ten and fifteen. But unless you watch the lessons and speak to the students and the teachers, you cannot fully understand why those changes happen.

The case for education research

A couple of days ago, my daughter saw a book on the shelf – Guy Claxton’s What’s the point of school?

‘Isn’t it obvious?’ she said. ‘Learning.’

Even when I started teaching, I would have agreed. But the more time you spend in schools, the more children you speak to, the more you wonder: is this the best we can do for them?

Schools are happy to tell us what they’re for. They proudly display banners with their grand mission statements. Do they do what they say? Who holds them to account?

Inspections are a form of research. We use exam results to compare schools. Accountability is essential. But someone needs to hold the education system to account on behalf of the students.

There’s a place for large-scale and smaller-scale studies that investigate learning. The science of learning is complex and multidisciplinary, and there’s a lot we’re only beginning to understand about how the brain processes information. Piper’s prescription – better methods, more transparency, larger trials – would improve some of that work.

But it can’t tell Saffa’s story.

What Piper is calling for is a narrower field, not a stronger one. A field that produces only the kind of evidence her preferred standards can measure. Yes, it would be cleaner and easier to audit. But it would also be silent on most of what matters about what schools do to children.

If we jettisoned the rest of education research, we’d lose Saffa. And Chrystal. And Neymar. They’d disappear back into exam stats, where the only question we’d be allowed to ask is whether some intervention raised their scores by 0.06 standard deviations.

That’s not a stronger field. That’s a smaller one.

Thumbnail image by kyo azuma on Unsplash

Notes on Schools

Hi Chris, I recently came across Kelsey Piper through her recent piece about the flaws of educational research - I wasn't aware of her before. Very much enjoyed reading your counter response to her rather pessimistic counter response. The importance of qualitative research for school accountability and continual learning outcomes certainly seems like something that was missing from Piper's account, which you explored thoroughly. Thank you again for your thoughts on this, looking forward to your next piece!

1 reply by Chris Reid

1 more comment...

The Curiosity Gap

Discussion about this post

Ready for more?