http://cacm.acm.org/blogs/blog-cacm/107125

April 4, 2011

It is quite common for HCI or computer science education researchers to use attitude questionnaires to examine people’s opinions of new software or teaching interventions. These are often on a Likert-type scale of "strongly agree" to "strongly disagree." And the sad truth is that researchers typically use the wrong statistical techniques to analyze them. Kaptein, Nass, and Markopoulos^{3} published a paper in CHI last year that found that in the previous year’s CHI proceedings, 45% of the papers reported on Likert-type data, but only 8% used nonparametric stats to do the analysis. Ninety-five percent reported on small sample sizes (under 50 people). This is statistically problematic even if it gets past reviewers! Here’s why.

Likert-type scales give ordinal data. That is, the data is ranked "strongly agree" is usually better than "agree." However, it is not interval data. You cannot say the distances between "strongly agree" and "agree" would be the same as "neutral" and "disagree," for example. People tend to think there is a bigger difference between items at the extremes of the scale than in the middle (there is some evidence cited in Kaptein et al.’s paper that this is the case). For ordinal data, one should use nonparametric statistical tests (so 92% of the CHI papers got that wrong!), which do not assume a normal distribution of the data. Furthermore, because of this it makes no sense to report means of Likert-scale data—you should report the mode (entry which occurs most frequently in the dataset).

Which classic nonparametric tests should you use? I strongly recommend the flow chart on p. 274 of *How to Design and Report Experiments* by Field and Hole. This textbook is also pretty good for explaining how to do the tests in SPSS and how to report the results. It also mentions how to calculate effect sizes (see later).

Why is it so common to use parametric tests such as the T-test or ANOVA instead of nonparametric counterparts? Kaptein, Nass, and Markopoulos^{3} suggest it is because HCI researchers know that nonparametric tests lack power. This means they are worried the nonparametric tests will fail to find a test where one exists. They also suggest it is because there aren’t handy nonparametric tests that let you do analysis of factorial designs. So what’s a researcher to do?

### Robust Modern Statistical Methods

It turns out that statisticians have been busy in the last 40 years inventing improved tests that are not vulnerable to various problems that classic parametric tests stumble across with real-world data and which are also at least as powerful as classic parametric tests (Erceg-Hurn and Mirosevich^{1}). Why this is not mentioned in psychology textbooks is not clear to me. It must be quite annoying for statisticians to have their research ignored! A catch about modern robust statistical methods is that you cannot use SPSS to do them. You have to start messing around with extra packages in R or SAS, which are slightly more frightening than SPSS, which itself is not a model of usability. Erceg-Hurn and Mirosevich^{1} and Kaptein, Nass, and Markopoulos^{3} both describe the ANOVA-type statistics, which are powerful and usable in factorial designs and works for nonparametric data.

A lot of interval data from behavioral research, such as reaction times, does not have a normal distribution or is heterscedastic (groups have unequal variance), and so should not be analyzed with classic parametric tests either. To make matters worse, the tests that people typically use to check the normality or heterscedaticity of data are not reliable when both are present. So, basically, you should always run modern robust tests in preference to the classic ones. I have come to the sad conclusion that I am going to have to learn R. However, at least it is free and a package called nparLD does ANOVA-type statistics. Kaptein et al.’s paper gives an example of such analysis, which I am currently practicing with.

### Effect Sizes

You might think this is the end of the statistical jiggery pokery required to publish some seemingly simple results correctly. Uh-uh, it gets more complicated. The APA style guidelines require authors to publish effect size as well as significance results. What is the difference? Significance testing checks to see if differences in the means could have occurred by chance alone. Effect size tells you how big the difference was between the groups. Randolph, Julnes, Sutinen, and Lehman^{4}, in what amounts to a giant complaint about the reporting practices of researchers in computer science education, pointed out that the way stats are reported by computer science education folk does not contain enough information, and missing effect sizes is one problem. Apparently it is not just us: Paul Ellis reports similar results with psychologists in *The Essential Guide to Effect Sizes*.

Ellis also comments that there is a viewpoint that not reporting effect size is tantamount to withholding evidence. Yikes! Robert Cole has a useful article, "It’s the Effect Size, Stupid," on what effect size is, why it matters, and which measures one can use. Researchers often use Cohen’s *d* or the correlation coefficient *r* as a measure of effect size. For Cohen’s *d*, there is even a handy way of saying whether the effect size is small, medium, or big. Unfortunately, if you have nonparametric data, effect size reporting seems to get more tricky, and Cohen’s way of interpreting the size of effect no longer makes sense (indeed, some people question whether it makes sense at all). Also, it is difficult for nonexperts to understand.

Common language effect sizes or probability of superiority statistics can solve this problem (Grissom^{2}). It is "the probability that a randomly sampled member of a population given one treatment will have a score (y) that is higher on the dependent variable than that of a randomly sampled member of a population given another treatment (y2)" (Grissom^{2}). An example from Robert Cole: Consider a common language effect size of 0.92 in a comparison of heights of males and females. In other words "in 92 out of 100 blind dates among young adults, the male will be taller than the female." If you have Likert-type data with an independent design and you want to report an effect size, it is quite easy. SPSS won’t do it for you, but you can do it with Excel: PS = U/ mn where *U* is the Mann-Whitney U result, *m* is the number of people in condition 1, and *n* is the number of people in condition 2 (Grissom^{2}). If you have a repeated measures design, refer to Grissom and Kim’s *Effect Sizes for Research* (2006, p.115). PSdep = w/n, where *n* is the number of participants and w refers to "wins" where the score was higher in the second measure compared to the first. Grissom^{2} has a handy table for converting between probability of superiority and Cohen’s *d*, as well as a way of interpreting the size of the effect.

Why is it so common to use parametric tests such as the T-test or ANOVA instead of nonparametric counterparts?

I am not a stats expert in any way. This is just my current understanding of the topic from recent reading, although I have one or two remaining questions. If you want to read more, you could consult a forthcoming paper by Maurits Kaptein and myself in this year’s CHI conference (Kaptein and Robertson^{5}). I welcome any corrections from stats geniuses! I hope it is useful but I suspect colleagues will hate me for bringing it up. I hate myself for reading any of this in the first place. It is much easier to do things incorrectly.

### Reader’s comment

*By replacing ANOVA by nonparametric or robust statistics we risk ending up in another local maximum. Robust statistics are just another way to squeeze your data into a shape appropriate for the infamous "minimizing sum of squares" statistics. Those had their rise in the 20th century because they were computable by pure brainpower (or the ridiculously slow computers in those times)*.

*If HCI researchers and psychologists would just learn their tools and acknowledge the progress achieved in econometrics or biostatistics. For example, linear regression and model selection strategies are there to replace the one-by-one null hypothesis testing with subsequent adjustment of alpha-level. With maximum likelihood estimation, a person no longer needs to worry about Gaussian error terms. Just use Poisson regression for counts and logistic regression for binary outcomes. The latter can also model Likert-type scales appropriately with meaningful parameters and in multifactorial designs*.

*Once you start discovering this world of modern regression techniques, you start seeing more in your data than just a number of means and their differences. You start seeing its shape and begin reasoning about the underlying processes. This can truly be a source of inspiration*.

—*Martin Schmettow*

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment