The ongoing debate about the usefulness of effect sizes.

Last weekend saw researchED Durham -  #rEDDurham18 – provoke a debate on Twitter about the usefulness of effect sizes within education. This debate involved many individuals – including amongst others  @dylanwiliam, @Kris_Boulton,@HuntingEnglish @SGorard, @profbeckyallen, @dodiscimus, and @tpltd.  Now for the purposes of transparency I need to be upfront in my own role in provoking this debate as I had co-presented a session with Professor Adrian Simpson of Durham University at researchED Durham,  – where we argued that there all sorts of difficulties in using effect sizes as a measure of the effectiveness of an educational intervention and, which may have made a small contribution to the discussion on Twitter.

However, even when you are part of an online discussion or thread – especially on Twitter – the flow and complexity of a discussion can be hard to follow.  And as a result sometimes, the discussion can become a bit disjointed and go off in various directions – resulting in repeated articulation of individual’s competing claims – at the expense of the articulation of the various elements of the whole argument.  So with this in mind, I’m going to try and outline an argument – using Toulmin’s structure of arguments-  about the use of effect sizes in making decisions about what educational interventions to pursue.  This will hopefully, then allow you identify the key issues in the effect size and help you make your own mind up on the issue.

Toulmin and effect sizes.

 The philosopher Stephen Toulmin - Toulmin (2003) – identifies six components in the layout of an argument. 

·      The claim (C) or conclusion i.e.  the proposition at which we arrive as a result of our reasoning

·      The facts or grounds we appeal as a foundation for C, called grounds or data (D) i.e. the basis from which we argue, in other words, the specific facts relied on to support a claim

·      The warrant (W) which is the general rule that allows us to infer a claim. How do we go from D to C – proposition that provides justification – called the warrant (W) – provide a licence to the inference for doing from C to D.

·      Standing behind our warrant will be backing (B) – which is the body of experience and evidence that supports the warrant

·      The qualifier is a word or phrase e.g. presumably, possibly, probably - that indicates the strength conferred by the warrant

·      Rebuttals – are extraordinary or exceptional circumstances that undermine the force of the supporting grounds.

Figure 1 provides diagrammatic representation of the Toulmin structure of arguments and which is derived from Jenicek and Hitchcock (2005)

Screen Shot 2018-11-29 at 16.39.20.png

Next we need to articulate the argument for the use of effect sizes within the Toulmin structure. And we get something like this

Screen Shot 2018-11-29 at 16.43.03.png

As should be readily apparent this all comes down to the Warrant i.e. effect size measures the effectiveness of an intervention - and whether the warrrant is justified.  However, Adrian Simpson states (private correspondence)

(a)   ……larger effect size from a given study on intervention X than another given study on intervention Y only indicates that X is more effective than Y if

  1. The sample on which the intervention is trialled is equivalent

  2. The alternative treatment against which X and Y are compared are equivalent

  3. The outcome measure is the same

  4. The effect size measure is the same

And even then, one can only conclude “X is on average more effective than Y on that sample, compared to that alternative on that measure”

So where does this leave those colleagues who are interested in effect sizes and their usefulness in making decisions about what interventions to adopt or abandon within their schools.

1.     Under certain conditions it might be possible to conclude that on average intervention X is more effective than Y

2.     However, that judgment will depend very much on the quality and trustworthiness of how the research was carried out and whether it was suitable for the questions under investigation.   Gorard, See, et al. (2017) for a discussion of scale, attrition, data-quality and other threats to validity.

3.     If you are using a single study to explore whether a particular intervention might work within your context – there are a whole set of questions – that you need to ask before coming to a decision to proceed. For example see - Kvernbekk (2016).

a.     Can the intervention can play the same causal role here as it did there

b.     What were the support factors necessary for the intervention  to work in other settings

c.     Are the support factors available in your setting?

Effect sizes and meta-analyses

At this stage, we have yet to examine the place of effect sizes with meta-analyses, which involves another set of issues, and more than ably articulated by Wiliam (2016). However, of particular interest are the words of Gene  Glass – the creator of meta-analysis - who argues ‘the most important lessons that meta-analysis has taught us is that the impact of interventions is significantly smaller than their variability.’  Glass goes onto state: ‘Meta-analysis has not lived up to its promises to produce incontrovertible facts that would lead education policy. What it has done is demonstrate that average impacts of interventions are relatively small and the variability of impacts is great.’ (Glass, 2016). As such, context matters in significant ways yet there is little understanding of these contextual influences, with perhaps as much of 2/3 of the variance between studies being unexplained.  In other words, meta-analysis tells you much less than you might want it to – and you need to go back to the original studies to examine the role of both causal mechanisms and support factors.

And finally

Evidence-based practitioners have a major challenge of knowing when to trust the experts or so-called experts. For me, how effect sizes are discussed in presentations, reports or research-papers is an indicator of the trustworthiness/expertise of the authors/presenters. If effect sizes are discussed and there is no recognition or acknowledgment of the limitations of effect sizes, and under what circumstances they might allow you to draw a tentative conclusion - then this should be a warning sign about whether the material should be trusted. Be careful out there.


Glass, G. V. (2016). One Hundred Years of Research:Prudent Aspirations. Educational researcher. 45. 2. 69-72.

Gorard, S., See, B. and Siddiqui, N. (2017). The Trials of Evidence-Based Education. London. Routledge

Jenicek, M. and Hitchcock, D. (2005). Evidence-Based Practice: Logic and Critical Thinking in Medicine. United States of America. American Medical Association Press.

Kvernbekk, T. (2016). Evidence-Based Practice in Education: Functions of Evidence and Causal Presuppositions. London. Routledge.

Toulmin, S. E. (2003). The Uses of Argument. Cambridge University Press.

Wiliam, D. (2016). Leadership for Teacher Learning. West Palm Beach. Learning Sciences International.