[email protected]
# Stats: We're Doing It Wrong

It's quite common for HCI or computer science education researchers to use attitude questionnaires to examine people's opinions of new software or teaching interventions. These are often on a likert-type scale of strongly agree to strongly disagree. And the sad truth is that researchers typically use the wrong statistical techniques to analyse them. Kaptein, Nass, & Markopoulos (2010) published a paper in CHI last year found that in the previous year's CHI proceedings, 45% of the papers reported on likert type data but only 8% used non-parametric stats to do the analysis. 95% reported on small sample sizes (under 50 people). This is statistically problematic even if it gets past reviewers! Here's why.

Likert scales give ordinal data. That it, the data is ranked "strongly agree" is usually better than "agree." However, it's not interval data. You can't say the distances between "strongly agree" and "agree" would be the same as "neutral" and "disagree," for example. People tend to think there is a bigger difference between items at the extremes of the scale than in the middle (there is some evidence cited in Kaptein's paper that this is the case). **For ordinal data, one should use non-parametric statistical tests** (so 92% of the CHI papers got that wrong!) which do not assume a normal distribution of the data. **Furthermore, because of this it makes no sense to report means of likert scale data--you should report the mode **(entry which occurs most frequently in the data set).

Which classic non-parametric tests should you use? I strongly recommend the flow chart on p. 274 of "How to Design and Report Experiments" by Field and Hole. This text book is also pretty good for explaining how to do the tests in SPSS and how to report the results. It also mentions how to calculate effect sizes (see later).

Why is it so common to use parametric tests such as the T-test or ANOVA instead of non-paramtric counterparts? Kaptein, Nass, & Markopoulos (2010) suggest it is because HCI researchers know that non-parametric tests lack power. This means they are worried that the non-parametric tests will fail to find a test where one exists. They also suggest it is because there aren't handy non-parametric tests which let you do analysis of factorial designs. So what's a researcher to do?

**Robust Modern Statistical Methods**

It turns out that statisticians have been busy in the last 40 years inventing improved tests that are not vulnerable to various problems that classic parametric tests stumble across with real world data and which are also at least as powerful as classic parametric tests (Erceg-Hurn & Mirosevich, 2008). Why this is not mentioned in psychology text books is not clear to me. It must be quite annoying for statisticians to have their research ignored! A catch about modern robust statistical methods is that you can't use SPSS to do them. You have to start messing around with extra packages in R or SAS, which are slightly more frightening than SPSS, which itself is not a model of usability. Erceg-Hurn & Mirosevich (2008) and Kaptein, Nass, & Markopoulos (2010) both describe the Anova-type statistics which is powerful and usable in factorial designs and works for non-parametric data. By the way, a lot interval data from behavioral research (such as reaction times) does not have a normal distribution or is heterscedastic (groups have unequal variance), and so shouldn't be analysed with classic parametric tests either. To make matters worse, the tests which people typically use to check the normality or heterscedaticity of data are not reliable when both are present. **So, basically, you should always run modern robust tests in preference to the classic ones**. I have come to the sad conclusion that I am going to have to learn R. However, at least it is free and there is a package called nparLD which does Anova type statistics. Kaptein's paper gives an example of such analysis, which I am currently practicing with. I must say he is a very helpful author indeed!

**Effect Size**

You might think that this is the end of the statistical jiggery pokery required to publish some seemingly simple results correctly. Uh-uh. It gets more complicated. **The APA style guidelines require authors to publish effect size as well as significance results**. What's the difference? Significance testing checks to see if differences in the means could have occured by chance alone. Effect size tells you how big the difference was between the groups. Randolph, Julnes, & Lehman (2008), in what amounts to a giant complaint about the reporting practices of researchers in computer science education, pointed out that the ways stats are reported by CSE folk doesn't contain enough information, and missing effect sizes is one problem. Apparently it isn't just us: Paul Ellis reports similar results with psychologists. And they ought to know better. Ellis also comments that there is a viewpoint that not reporting effect size is tantamount to withholding evidence. Yikes! Robert Coe has a useful article on what effect size is, why it matters, and which measures one can use. Researchers often use Cohen's d or the correlation co-efficient r as a measure of effect size. For Cohen's d, there is even a handy way of saying whether the effect size is small, medium or big. Unfortunately, if you have non-parametric data, effect size reporting seems to get more tricky, and Cohen's way of interpreting the size of effect no longer makes sense (indeed, some people question whether it makes sense at all). Also, it is hard for non-experts to understand. **Common Language Effect Sizes, or Probability of Superiority statistics** can solve this problem (Grissom, 1994).??? It's "the probability that a randomly sampled member of a population given one treatment will have a score (y\) that is higher on the dependent variable than that of a randomly sampled member of a population given another treatment (y2)" Grissom, 1994; 314). As an example from Robert Cole, consider a common language effect size of 0.92 in a comparison of heights of males and females. In other words "in 92 out of 100 blind dates among young adults, the male will be taller than the female." If you have likert data with an independant design and you want to report an effect size, it's quite easy. SPSS won't do it for you, but you can do it with Excel: PS = U/mn where U is the Mann-Whitney U result, m is the number of people in condition 1, and n is the number of people in condition 2 (Grissom, 1994). If you have a repeated measures design, refer to Grissom and Kim (2006, p.115). PSdep = w/n, where n is the number of participants and w refers to "wins" where the score was higher in the second measure compared to the first. Grissom (1994) has a handy table for converting between probability of superiority and Cohen's d, as well as a way of interpreting the size of the effect.

I'm not a stats expert in any way. This is just my current understanding of the topic from recent reading, although I have one or two remaining questions. I welcome any corrections from stats geniuses! I hope it is useful but I suspect colleagues will hate me for bringing it up. I hate myself for reading any of this in the first place. It's much easier to do things incorrectly.

**References
**

Erceg-H-urn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. *The American psychologist*, *63*(7), 591-601. doi: 10.1037/0003-066X.63.7.591.

Grissom, R. J. (1994). Probability of the superior outcome of one treatment over another. *Journal of Applied Psychology*, *79*(2), 314-316. doi: 10.1037/0021-9010.79.2.314.

Kaptein, M. C., Nass, C., & Markopoulos, P. (2010). Powerful and consistent analysis of likert-type ratingscales. *Proceedings of the 28th international conference on Human factors in computing systems - CHI ’10* (p. 2391). New York, New York, USA: ACM Press. doi: 10.1145/1753326.1753686.

Yay! I hear you! Superficial and just plain completely wrong statistical analyses are the bane of the field and will only serve to deteriorate the credibility of our research. By providing a useful and friendly discussion of this very important topic, you'll surely improve the field significantly and with large effect sizes :-)

By replacing ANOVA by non-parametric or robust statistics we risk ending up in another local maximum.

Robust statistics are just another way to squeeze your data into a shape appropriate for the infamous "minimizing sum of squares" statistics.

Those had their rise in the 20th century because they were computable by pure brain power (or the ridiculously slow computers in those times).

If HCI researchers and psychologists would just learn their tools and acknowledge the progress achieved in econometrics or biostatistics, for example:

Linear regression and model selection strategies are there to replace the one-by-one null hypothesis testing with subsequent adjustment of alpha-level.

With maximum likelihood estimation one no longer needs to worry about gaussian error terms. Just use Poisson regression for counts and logistic regression for binary outcomes.

The latter can also model Likert scales appropriately - with meaningful parameters and in multifactorial designs.

Once you start discovering this world of modern regression techniques you start seeing more in your data than just a number of means and there differences.

You start seeing its shape and begin reasoning about the underlying processes. This can truly be a source of inspiration.

-- Martin Schmettow

Yeah! You are right Martin!

Cdric Bach from Limassol, Cyprus (TwinTide meeting).

Thanks for this Martin. Please could you recommend a text book or introductory paper for readers who might want to look into this further.

Cheers, Judy

I think a bigger problem is that we expect to be able to measure psychological constructs with single questions. In fact, it is much better practice to ask several questions and perform a factor analysis on the results. This produces more robust and invariant measures.

- Bart Knijnenburg

Department of Informatics, UC Irvine

Another pet peeve: the name "Likert scale" is actually a misnomer. Dr. Likert's original scale was very different from the 5-point rating scales we currently use.

- Bart

I'm glad to see more people raising significant issues about significance testing.

A couple of other references, of potential interest:

A statistical test, statistical significance, gets its closeup, by Carl Blalik (The Numbers Guy) at the Wall Street Journal

Dean Eckles offers some perspectives on the application and misapplication of statistics in social psychology in a recent post on his blog.

And on a meta-level, regarding significance and effects, I found a quote by Bill Liscomb in a remembrance I heard on NPR this past weekend, Nobel Prize-Winning Chemist Dies At 91, to be inspiring:

"It's not a disgrace in science to publish something that's wrong. What is bad is to publish something that's not very interesting"

Great article.

I also agree with Martin, and a great reference for follow up on that is the book: "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Gelman and Hill.

However, not everyone explores these issues in depth, and the CHI paper was aimed at making researchers aware of the variability in outcomes from the (extremely) common 2x2 ANOVA when violating assumptions.

Although I would like to see otherwise, the practice of hypothesis testing is still omnipresent, and I think it's good for people to be aware of new developments there and to be aware of the possible errors one can make running inappropriate tests on small sample sizes.

To bart: You are totally right, that is why we coined them "likert-type rating scales" in the CHI paper. Everyone seems to know likert scales, but they are indeed nothing like Dr. Likert himself proposed

Great discussion here!

I would recommend this article which develops a very different view from yours, in particular that Likert scales produce interval data that are relevant to ANOVA:

Carifio and Perla, 2007, Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends

about Likert Scales and Likert Response Formats and their Antidotes.

Journal of Social Sciences 3 (3): 106-116.

Stphanie Buisine, Arts et Mtiers ParisTech, France

Dear Judy and everyone else,

Unfortunately, I know almost nothing about statistics. Fortunately, a colleague across the hall (David Fortus) knows quite a lot. I showed him your article and his response was that you identified the problem correctly but gave the wrong solution. The non-parametric methods are too heavy for this. The current "best practice" for dealing with the ordinal / interval problem is to use Rasch analysis which processes ordinal data so that they become intervals and simple parametric methods can be used. A reference relevant for science education is given below.

Moti

Boone, William J., Townsend, J. Scott, Staver, John. Using Rasch theory to guide the practice of survey development and survey data analysis in science education and to inform science reform efforts: An exemplar utilizing STEBI self-efficacy data. Science Education

95, 2011, 258-280. http://dx.doi.org/10.1002/sce.20413.

Displaying **all 10** comments