A major challenge for aspiring evidence-informed teachers is knowing when to
trust the experts. It would be easy to
assume that just because you have come across a particular interpretation of a concept or idea in a number different places - book, peer-reviewed article or blog - that it is correct. Unfortunately, if you did this, you could well be making a mistake. For example, in recent weeks I have come
across three examples – Churches and Dommett (2016), Firth (2018)
and Ashman (2018) - where the meaning of p-values and statistical significance would appear to have been
misinterpreted. Furthermore, as Gorard et al (2017) states this mistakes are not uncommon. So to help
aspiring school research leads and evidence-informed teachers spot
where p-values and statistical significance have been misinterpreted I will:

- Explain what is meant by the terms p-values and statistical significance
- Identify a number of common misconceptions about p values and statistical
- Show how the work of Churches and Dommett, Firth and Ashman all fall foul of some of these misconceptions and misinterpretations
- Examine some of the implications for evidence-informed teachers.

**P values and statistical significance**

When seeking to
understand these terms there are a
number of major problems and as Greenland, et al. (2016) state: ‘

*There are no interpretations of these concepts, which are at once simple, intuitive, correct, and foolproof’*(p337). Greenland et al go onto illustrate their point by providing twenty-five examples of common misconceptions and interpretation of these terms, which even professional academics are prone. Nevertheless, the American Statistical Association seek to informally define a p-value as:*the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.*
The smaller
the p value, the more unlikely are our results if the null hypothesis (and test
assumptions) hold true. Whereas, the
larger the p value, the less surprising are our results, given the null
hypothesis and (test assumptions) hold true.
In other words, as Greenland et al state: ‘

*The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus P = 0.01 would indicate that the data are not very close to what the statistical model (including the null hypothesis) predicted they should be,**while P = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation’. p340***Statistical Significance**

Put very simply
a result is often deemed to be statistically significant if the p value is less
than or equal to 0.05, although the level of statistical significance can be
set lower levels, for example, p is less than or equal 0.01

**Interpreting p values and statistical significance – guidance from the American Statistical Association**

Given difficulties in interpreting p values and statistical
significance the American Statistical Association - Wasserstein and Lazar (2016) – have provided some guidance
on how to avoid some common mistakes.
This guidance is summarised in six principles

- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good
measure of evidence regarding a model or hypothesis.

*Some common misinterpretations – Churches, Dommett, Firth and Ashman*
I will now look at how Churches and Dommett, Firth. and
Ashman have - in my view- all misinterpreted either p-values or statistical significance.

**Richard Churches and Eleanor Dommett**- In their book

*Teacher-Led Research : Designing and implementing randomised controlled trials and other forms of experimental research*– t include the following definitions within their glossary of terms

*p-value – Probability value – that is the probability that the result may have occurred by chance (e.g p = 0.001 – a 1 in 1000 probability that the result may have happened by chance) Also known as the significance level.*

*Significance – The probability that a change in score may have occurred by chance. A threshold for significance (alpha) is set at the start of piece of research. This is never less stringent than 0.05 ….*

Unfortunately, according to the ASA both of these statements are incorrect. First,
the p-value is a measure of the consistency of the results with a particular
statistical model – with all the assumptions behind the model being maintained. Second, the p-value is not the probability that the
data were produced by random chance alone as it also depends on the accuracy of
the assumptions underpinning the statistical model Third, the definition of significance conflates
scientific significance, with statistical significance.

**Jonathan Firth -**

**Firth, J. (2018).**

*The Application of Spacing and Interleaving Approaches in the Classroom*Impact. 1. 2.
In a recent edition of Impact, Jonathan Firth uses p-values
and statistical significance to the application of spacing an interleaving in
the classroom where an opportunity sample of 31 school pupils between 16 and 17
years of age was used.

*The mean percentages of correct answers on the end-of-task test for the interleaved and blocked conditions are shown in Figure 4. A between-subjects ANOVA was carried out. This analysis revealed a significant main effect of spacing (performance in the spaced condition being worse than the massed condition, with mean scores of 12.25 vs 9.45, p = .002), while interleaving did not have a significant main effect. Importantly, there was also a significant (p = .009) interaction between the two variables (spacing vs interleaving), indicating that interleaving had a mediating or protective effect against the difficulties caused by spacing (see Figure 5).*

*The findings demonstrated that spacing had a harmful effect on the immediate test, while the main effect of interleaving was neutral. The results fit with the idea that these are ‘desirable difficulties’, with the potential to impede learning in the short term.*.

Again, according to the ASA there are errors in both paragraphs. Statistical significance does not demonstrate
whether an a scientifically or substantively important/significant
relation has been detected. Neither is statistical
significance a property of the phenomenon being studied, but is a
product of the consistency between the data and what would have been expected
using the specified statistical model.
In other words, the map is not the territory.

*Greg Ashman -*

*Ashman (2018)*

*The Article That England’s Chartered College Will Not Print. Filling the Pail.*

In a blogpost which criticises the EEF’s approach to both meta-cognition
and meta-analysis, Greg also falls foul of the problems of interpreting p-values
and statistical significance

*If we focus only on the randomised controlled trials conducted by the EEF, the case for meta-cognition and self-regulation seems weak at best. Of the seven studies, only two appear to have statistically significant results. In three of the other studies, the results are not significant and in two more, significance was not even calculated. This matters because a test of statistical significance tells us how likely we would be to collect this particular set of data if there really was no effect from the intervention. If results are not statistically significant then they could well have arisen by chance*.

Again using the ASA’s guidance there are a number of errors
in this statement. First, statistical significance – or rather the lack of it – does not
tell us whether there was no effect from the intervention. It just tells us the
data was inconsistent with the statistical model. Second, even if the results are or are not
statistically significant it does not mean the results have arisen by chance. It
is a statement about data in relation to a specified hypothetical explanation,
and is not a statement about the explanation itself. In other words, it is a statement about the
results of the study relative to a particular statistical model.

**Where does this leave us?**

First, p-values, significance and statistical are slippery
concepts, which take time and effort to even begin to understand never alone
master. Indeed, you may need to forget
what you have already learnt at university on under-graduate or post-graduate
courses.

Second, misuse of p-values and statistical significance is
not uncommon, so is something you have to watch out for when reading
quantitative research reports. So keep
the ASA principles hand to see if they are being misapplied in research
reports. You don’t have to understand
something and how it works, (though it helps) to be able to spot it misuse.

Third, just because you can come across something in a
variety of formats – book, peer-reviewed article or blog and from a variety of
authors – researchers, researchers at university, or school teachers - does not
mean it is correct.

Fourth, I am not making personal comments about the personal
integrity of any of the authors I have criticised. These comments should be seen as ‘business
not personal’ and are a genuine attempt to increase the research literacy of
teachers and school leaders. Being an
evidence-informed teacher or school leaders is hard enough when you are using
the right, never mind the wrong, tools.

**And finally**, it’s worth remembering the words of Greenland, et al. (2016) who state:

*‘In closing, we note that no statistical method is immune to misinterpretation and misuse, but prudent users of statistics will avoid approaches especially prone to serious abuse. In this regard, we join others in singling out the degradation of P values into ‘‘significant’’ and ‘‘nonsignificant’’ as an especially pernicious statistical practice.’ p348.*

**References**

Ashman, G. (2018).

*The Article That England’s Chartered College Will Not Print*. Filling the Pail. https://gregashman.wordpress.com/2018/04/17/the-article-that-englands-chartered-college-will-not-print/. 21 April, 2018.
Churches,
R. and Dommett, E. (2016).

*Teacher-Led Research: Designing and Implementing Randomised Controlled Trials and Other Forms of Experimental Research*. London. Crown House Publishing.
Firth,
J. (2018).

*The Application of Spacing and Interleaving Approaches Int He Classroom*Impact. 1. 2.
Gorard,
S., See, B. and Siddiqui, N. (2017).

*The Trials of Evidence-Based Education*. London. Routledge
Greenland,
S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S. and Altman, D.
(2016).

*Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations*. European journal of epidemiology. 31. 4. 337-350.
Wasserstein,
R. and Lazar, N. (2016).

*The Asa's Statement on P-Values: Context, Process, and Purpose, the American Statistician, 70:2, 129-133,*. The American Statistician. 70. 2. 129-133.