Stats: We’re Doing It Wrong

It's quite common for HCI or computer science education researchers to use attitude questionnaires to examine people's opinions of new software or teaching interventions. These are often on a likert-type scale of strongly agree to strongly disagree. And the sad truth is that researchers typically use the wrong statistical techniques to analyse them. Kaptein, Nass, & Markopoulos (2010) published a paper in CHI last year found that in the previous year's CHI proceedings, 45% of the papers reported on likert type data but only 8% used non-parametric stats to do the analysis. 95% reported on small sample sizes (under 50 people). This is statistically problematic even if it gets past reviewers! Here's why.

Likert scales give ordinal data. That it, the data is ranked "strongly agree" is usually better than "agree." However, it's not interval data. You can't say the distances between "strongly agree" and "agree" would be the same as "neutral" and "disagree," for example. People tend to think there is a bigger difference between items at the extremes of the scale than in the middle (there is some evidence cited in Kaptein's paper that this is the case). For ordinal data, one should use non-parametric statistical tests (so 92% of the CHI papers got that wrong!) which do not assume a normal distribution of the data. Furthermore, because of this it makes no sense to report means of likert scale data–you should report the mode (entry which occurs most frequently in the data set).

Which classic non-parametric tests should you use? I strongly recommend the flow chart on p. 274 of "How to Design and Report Experiments" by Field and Hole. This text book is also pretty good for explaining how to do the tests in SPSS and how to report the results. It also mentions how to calculate effect sizes (see later).

Why is it so common to use parametric tests such as the T-test or ANOVA instead of non-paramtric counterparts? Kaptein, Nass, & Markopoulos (2010) suggest it is because HCI researchers know that non-parametric tests lack power. This means they are worried that the non-parametric tests will fail to find a test where one exists. They also suggest it is because there aren't handy non-parametric tests which let you do analysis of factorial designs. So what's a researcher to do?

Robust Modern Statistical Methods

It turns out that statisticians have been busy in the last 40 years inventing improved tests that are not vulnerable to various problems that classic parametric tests stumble across with real world data and which are also at least as powerful as classic parametric tests (Erceg-Hurn & Mirosevich, 2008). Why this is not mentioned in psychology text books is not clear to me. It must be quite annoying for statisticians to have their research ignored! A catch about modern robust statistical methods is that you can't use SPSS to do them. You have to start messing around with extra packages in R or SAS, which are slightly more frightening than SPSS, which itself is not a model of usability. Erceg-Hurn & Mirosevich (2008) and Kaptein, Nass, & Markopoulos (2010) both describe the Anova-type statistics which is powerful and usable in factorial designs and works for non-parametric data. By the way, a lot interval data from behavioral research (such as reaction times) does not have a normal distribution or is heterscedastic (groups have unequal variance), and so shouldn't be analysed with classic parametric tests either. To make matters worse, the tests which people typically use to check the normality or heterscedaticity of data are not reliable when both are present. So, basically, you should always run modern robust tests in preference to the classic ones. I have come to the sad conclusion that I am going to have to learn R. However, at least it is free and there is a package called nparLD which does Anova type statistics. Kaptein's paper gives an example of such analysis, which I am currently practicing with. I must say he is a very helpful author indeed!

Effect Size

You might think that this is the end of the statistical jiggery pokery required to publish some seemingly simple results correctly. Uh-uh. It gets more complicated. The APA style guidelines require authors to publish effect size as well as significance results. What's the difference? Significance testing checks to see if differences in the means could have occured by chance alone. Effect size tells you how big the difference was between the groups. Randolph, Julnes, & Lehman (2008), in what amounts to a giant complaint about the reporting practices of researchers in computer science education, pointed out that the ways stats are reported by CSE folk doesn't contain enough information, and missing effect sizes is one problem. Apparently it isn't just us: Paul Ellis reports similar results with psychologists. And they ought to know better. Ellis also comments that there is a viewpoint that not reporting effect size is tantamount to withholding evidence. Yikes! Robert Coe has a useful article on what effect size is, why it matters, and which measures one can use. Researchers often use Cohen's d or the correlation co-efficient r as a measure of effect size. For Cohen's d, there is even a handy way of saying whether the effect size is small, medium or big. Unfortunately, if you have non-parametric data, effect size reporting seems to get more tricky, and Cohen's way of interpreting the size of effect no longer makes sense (indeed, some people question whether it makes sense at all). Also, it is hard for non-experts to understand. Common Language Effect Sizes, or Probability of Superiority statistics can solve this problem (Grissom, 1994).??? It's "the probability that a randomly sampled member of a population given one treatment will have a score (y\) that is higher on the dependent variable than that of a randomly sampled member of a population given another treatment (y2)" Grissom, 1994; 314). As an example from Robert Cole, consider a common language effect size of 0.92 in a comparison of heights of males and females. In other words "in 92 out of 100 blind dates among young adults, the male will be taller than the female." If you have likert data with an independant design and you want to report an effect size, it's quite easy. SPSS won't do it for you, but you can do it with Excel: PS = U/mn where U is the Mann-Whitney U result, m is the number of people in condition 1, and n is the number of people in condition 2 (Grissom, 1994). If you have a repeated measures design, refer to Grissom and Kim (2006, p.115). PSdep = w/n, where n is the number of participants and w refers to "wins" where the score was higher in the second measure compared to the first. Grissom (1994) has a handy table for converting between probability of superiority and Cohen's d, as well as a way of interpreting the size of the effect.

I'm not a stats expert in any way. This is just my current understanding of the topic from recent reading, although I have one or two remaining questions. I welcome any corrections from stats geniuses! I hope it is useful but I suspect colleagues will hate me for bringing it up. I hate myself for reading any of this in the first place. It's much easier to do things incorrectly.

References

Erceg-H-urn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. The American psychologist, 63(7), 591-601. doi: 10.1037/0003-066X.63.7.591.

Grissom, R. J. (1994). Probability of the superior outcome of one treatment over another. Journal of Applied Psychology, 79(2), 314-316. doi: 10.1037/0021-9010.79.2.314.

Kaptein, M. C., Nass, C., & Markopoulos, P. (2010). Powerful and consistent analysis of likert-type ratingscales. Proceedings of the 28th international conference on Human factors in computing systems – CHI ’10 (p. 2391). New York, New York, USA: ACM Press. doi: 10.1145/1753326.1753686.