The Myth of Significance Testing

When I decided to leave work and go to University to study psychology I did so because of a genuine fascination with the study of human behaviour, thought, and emotion. Like many, I was drawn to the discipline not by the allure of science but by the writings of Freud, Jung, Maslow, and Fromm. I believed at the time that the discipline was as much philosophy as it was science and had the romantic notion of sitting in the quad talking theory with my classmates.

Unfortunately from day one I was introduced not to the theory of psychology but the maths of psychology. This, I was told, was the heart of the discipline and supporting evidence came not from the strength of the theory but from the numbers. It did not matter that as an 18-year-old male I was supremely conscious of the power of libidos. Unless it could be demonstrated on a Likert scale it did not exist. The gold standard supporting evidence was significance testing.

I always struggled with the notion that the significance test (ST) was indeed as significant as my professors would have me believe. However, it was not until I completed my post graduate diploma in applied statistics that the folly of ST truly came home to me. Here for the first time, I was introduced to the concept of fishing for results and techniques such as the Bonferroni correction. Moreover, I truly understood how paltry the findings in psychology were and to establish the robustness of such findings through a significance test was somewhat oxymoronic.

In 2012 a seminal paper on this topic came out and I would encourage everyone who works in our field to be aware of it. This is indeed the myth for this month: the myth of significance testing:

Lambdin, C. (2012) Significance tests as sorcery: Science is empirical – significance tests are not. Theory and Psychology, 22, 1, 67-90.

Abstract

Since the 1930s, many of our top methodologists have argued that significance tests are not conducive to science. Bakan (1966) believed that “everyone knows this” and that we slavishly lean on the crutch of significance testing because, if we didn’t, much of psychology would simply fall apart. If he was right, then significance testing is tantamount to psychology’s “dirty little secret.” This paper will revisit and summarize the arguments of those who have been trying to tell us— for more than 70 years—that p values are not empirical. If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.

The paper is a relatively easy read and the arguments are simple to understand:

“… Lykken (1968), who argues that many correlations in psychology have effect sizes so small that it is questionable whether they constitute actual relationships above the “ambient correlation noise” that is always present in the real world. Blinkhorn and Johnson (1990) persuasively argue, for instance, that a shift away from “culling tabular asterisks” in psychology would likely cause the entire field of personality testing to disappear altogether. Looking at a table of results and highlighting which ones are significant is, after all, akin to throwing quarters in the air and noting which ones land heads.” (ala fishing for results)

The impact of this paper for so much of the discipline cannot be overstated. In an attempt to have a level of credibility beyond its station psychological literature has bordered on the downright fraudulent in making sweeping claims from weak but significant results. The impact is that our discipline becomes the laughing stock of future generations who will see through the emperor’s clothes that are currently parading as science.

“ … The most unfortunate consequence of psychology’s obsession with NHST is nothing less than the sad state of our entire body of literature. Our morbid overreliance on significance testing has left in its wake a body of literature so rife with contradictions that peer-reviewed “findings” can quite easily be culled to back almost any position, no matter how absurd or fantastic. Such positions, which all taken together are contradictory, typically yield embarrassingly little predictive power, and fail to gel into any sort of cohesive picture of reality, are nevertheless separately propped up by their own individual lists of supportive references. All this is foolhardily done while blissfully ignoring the fact that the tallying of supportive references—a practice which Taleb (2007) calls “naïve empiricism”—is not actually scientific. It is the quality of the evidence and the validity and soundness of the arguments that matters, not how many authors are in agreement. Science is not a democracy.

It would be difficult to overstress this point. Card sharps can stack decks so that arranged sequences of cards appear randomly shuffled. Researchers can stack data so that random numbers seem to be convincing patterns of evidence, and often end up doing just that wholly without intention. The bitter irony of it all is that our peer-reviewed journals, our hallmark of what counts as scientific writing, are partly to blame. They do, after all, help keep the tyranny of NHST alive, and “[t]he end result is that our literature is comprised mainly of uncorroborated, one-shot studies whose value is questionable for academics and practitioners alike” (Hubbard & Armstrong, 2006, p. 115).” P. 82

Is there a solution to this madness? Using the psychometric testing industry as a case in point I believe the solution is multi-pronged. ST’s will continue to be part of our supporting literature as they are the requirement of the marketplace and without them test publishers will not be viewed credibly. However, through education such as training for test users, this can be balanced so that the reality of ST’s can be better understood. This will include understanding the true variance that is accounted for in tests of correlation and therefore the true significance of the significance test will be understood! This will need to be equally matched with an understanding of the importance of theory building when testing a hypothesis and required alterations such as Bonferroni correction when conducted multiple tests with one set of data. Finally, in keeping with the theme in this series of blogs the key is to treat the discipline as a craft, not a science. Building theory, applying results in unique and meaningful ways and being focussed on practical outcomes is more important and more reflective of sound practice then militant adherence to a significance test.

For those interested in understanding how to use statistics as a craft to formulate applied solutions I strongly recommend this book.

This article just out. Seems that there may be hope for the discipline yet.