Thousands of days of INSET and a deluge references to effect sizes — Evidence-based school leadership and management: A practical guide

Please note - since this blogpost was published I have come across an updated version of Professor Kraft’s paper which can be found here

I’ll be posting an update to this blogpost in the coming weeks

In England there are approximately 24,000 schools – which means that next week will see thousands of INSET/CPD days taking place. In all likelihood, in a great number of those sessions, someone leading the session, will make some reference to effect sizes to compare different educational interventions However, as much it might be appealing to use effect sizes to compare interventions, this does not mean that just comparing effect sizes tells you anything useful about the importance of the intervention for you and your school. So to help you get a better understanding of how to use effect sizes I’m going to draw upon the work of (Kraft, 2018 )who has devised a range of questions that will help you interpret the effect size associated with a particular intervention. But as it is the start of the academic year – it might be useful to first revisit how an effect size is calculated and some existing benchmarks for interpreting effect sizes.

Calculating an effect size

In simple terms, an effect size is a ‘way of quantifying the difference between two groups’p339. (Coe, 2017)and can be calculated by using the following formula

Effect size = ((Mean of experimental group) – (Mean of control group))/Pooled standard deviation

To illustrate how an effect size is calculated makes Coe reference to the work of (Dowson, 2002) who attempted to demonstrate the time of day effects on children’s learning, or in other words do children learn better in the morning or afternoon?

Thirty-eight children were included in the intervention with half being randomly allocated to listen to a story and respond to questions, at 9.00 am., whereas the remaining 19 students listened to the same story and questions at 3.00 am.
The children’s understanding of the story was assessed by using a test where the number of correct answers was measured out of twenty.
The morning group had an average of score of 15.2, whereas the afternoon group had an average score 17.9, a difference of 2.7.
The effect size of the intervention can now be calculated (17.9-15.2)/3.3 equals 0.8 SD.

Benchmarks for interpreting effect sizes

A major challenge when interpreting effect sizes is that there is no generally agreed scale. One of the most widely used set of benchmarks comes from the work of Jacob Cohen (Cohen, 1992) who defines small, medium and large effect sizes as 0.2, 0.5 and 0.8 SD respectively. However, this effect size scale was derived to identify the sample size required to give yourself a reasonable chance of detecting and effect size of that size, if it existed. As (Cohen, 1988) notes: “The terms ‘small,’ ’medium,’ and ‘large’ are relative to each other, but to the area of behavioural science, or even more particularly to the specific content of the research method being employed in any given investigation.’ p25. As such, Cohen’s benchmark should not be used to interpret the magnitude of an effect size.

Alternatively you could use the ‘hinge-point’ of 0.4 SD put forward by John Hattie (Hattie, 2008) who reviewed over 800 meta-studies and argues that the average effect size of all possible educational influences is 0.4 SD. Unfortunately, as (Kraft, 2018)notes Hattie’s meta-analysis includes studies with small samples, weak research designs and proximal measurements - all of which result in larger effect sizes. As such, the 0.4SD hinge point is in all likelihood an over estimate of the average effect size. Indeed, (Lipsey et al., 2012) argue that based on empirical distributions of effect sizes from comparable studies an effect sizes of 0.25 SD in education research should be interpreted as large. Elsewhere, (Cheung and Slavin, 2015) found that the average effect sizes for interventions ranged from 0.11 to 0.32SD depending upon sample size and the comparison group

You could also look at the work of (Higgins et al., 2013) and the Education Endowment Foundation’s Teaching Learning Toolkit who suggest that low, moderate, high and very effect sizes are -0.01 to 0.18, 0.19 - 0.44, 0.45 - 0.69, 0.7SD + respectively. However, it’s important to note is based on the assumption that the effect size of a year’s worth of learning at elementary school is 1 SD – yet (Bloom et al., 2008) found that for six-year olds a year’s worth of growth is approximately 1.5 standard deviations, whereas for twelve-year olds a year’s worth of growth was 0.2 standard deviations.

Finally, you could refer to the work of (Kraft, 2018)who undertook an analysis of 481 effect sizes from 242 RCTs of education interventions with achievement outcomes and came up with the following effect size benchmarks for school pupils: less than 0.05 is Small to less than 0.20 is Medium and, 0.20 or greater is large. Nevertheless, as Kraft himself notes ‘these are subjective but not arbitrary benchmarks. They are easy heuristics to remember that reflect the findings of recent meta-analyses.’ p18.

Kraft’s Guidelines for interpreting effect sizes

The above would suggest that attempting on interpret effect sizes by the use of standardised benchmarks is not an easy task – different scales suggest that large effect sizes range from 0.2 to 0.8 SD. As such, if we go back to the 0.8SD effect size Dowson found when looking at the time of day effects on pupil does this mean we have found an intervention with a large-effect size, which you and your school should look to implement. Unfortunately, as much as you might like to think so, it’s not that straightforward. Effect sizes are not just determined by the effectiveness of the intervention but by a range of other factors – see (Simpson, 2017) for a detailed discussion. Fortunately, (Kraft 2018) has identified a number of questions that you can ask to help you interpret effect sizes and this and in Table 1 we will now apply these questions to Dowson’s findings

As such, given the nature of the intervention and in particular given both the relatively short period of time between intervention and the measurement of the outcomes and the outcomes being closely aligned to the intervention, we should not be overly surprised that we have an effect size which might ‘at first-blush’ be interpreted as large.

Implications for teachers, school research leads and school leaders

One, it is necessary to extremely careful to avoid simplistic interpretations of effect sizes. In particular, where you see Cohen’s benchmarks being used, this should set off the alarm bells about the quality of the work you are reading.

Two, when interpreting the effect-size of an intervention – particularly in single studies where the effect size is greater than 0.20SD it’s worth spending a little time in applying Kraft’s set of questions – to see if there are any factors which are contributing to upward pressures on the resulting effect size.

Three, when making judgments about an intervention – and whether it should be introduced into your school – the effect size is only piece of the jigsaw. Even if an intervention has a relatively small effect size, the intervention may still be worth implementing if the costs are relatively small, the benefits are quickly realised, and it does not require a substantial change in teachers’ behaviour

Last but not least, no matter how large an effect size of an intervention – what matters are the problems that you face in your classroom, department or school. Large effect sizes for interventions that will not solve a problem you are faced with, are for you, largely irrelevant.

References

Bloom, H. S. et al.(2008) ‘Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions’, Journal of Research on Educational Effectiveness, 1(4), pp. 289–328.

Cheung, A. and Slavin, R. E. (2015) ‘How methodological features affect effect sizes in education’, Best Evidence Encyclopedia, Johns Hopkins University, Baltimore, MD. Available online at: http://www. bestevidence. org/word/methodological_Sept_21_2015. pdf (accessed 18 February 2016).

Coe, R. (2017) ‘Effect size’, in Coe, R. et al. (eds) Research Methods and Methodologies in Education (2nd edition). London: SAGE.

Cohen, J. (1988) ‘Statistical power analysis for the behavior science’, Lawrance Eribaum Association.

Cohen, J. (1992) ‘A power primer’, Psychological bulletin, 112(1), p. 155.

Dowson, V. (2002) Time of day effects in schoolchildren’s immediate and delayed recall of meaningful material, TERSE Report. CEM, University of Durham. Available at: http://www.cem.dur.ac.uk/ebeuk/research/terse/library.htm.

Hattie, J. (2008) Visible learning: A synthesis of over 800 meta-analyses relating to achievement. London: Routledge.

Higgins, S. et al.(2013) The Sutton Trust-Education Endowment Foundation Teaching and Learning Toolkit Manual. London: Education Endowment Foundation. Available at: internal-pdf://228.60.152.86/Technical_Appendices_(June_2013).pdf.

Kraft, M. A. (2018) Interpreting effect sizes of education interventions. Brown University Working Paper. Downloaded Tuesday, April 16, 2019, from ….

Lipsey, M. W. et al.(2012) ‘Translating the Statistical Representation of the Effects of Education Interventions into More Readily Interpretable Forms.’, National Center for Special Education Research. ERIC.

Simpson, A. (2017) ‘The misdirection of public policy: comparing and combining standardised effect sizes’, Journal of Education Policy. Routledge, pp. 1–17. doi: 10.1080/02680939.2017.1280183.