Following up on blahedo's comment and the questions about what the heck their p-...

Following up on blahedo's comment and the questions about what the heck their p-values mean --

This is a nice example that you can get statistical significance for small effects, if your sample is big enough. Their p-values are explained very badly, so I did my own analysis by transcribing their data from those plots. Let's take their weighting scheme for granted. I agree with some other commenters that the sums and counts are misleading, and instead took average scores per font, and computed confidence intervals for those means. The means are indeed a little different, and for some pairs, statistically significantly so.

http://brenocon.com/Screen%20shot%202012-08-10%20at%202.03.0...

But does it matter much? Take the pair with the largest gap, Baskerville vs. Comic Sans, of 0.95 versus 0.79: a difference is 0.16. This is out of a 10-point scale (ranging -5 to +5).

In fact, the standard deviation for the entire dataset is 3.6 -- so just 0.05 standard deviations worth of difference.

Or here's another way to think about it. If a person does Comic Sans example, versus could have done Baskerville example, how often would they have score higher? (This ignores the weightings, it's a purely ordinal comparison. I think this is related to the Wilcoxon-Mann-Whitney test statistic or something, I forget.) So with independence assumptions (if they had proper randomization, hopefully this solves), just independently sample from the distributions many times and compare pairs of simulated outcomes. 22% of the time it's a tie, 40.3% of the time Baskerville scores higher, and 37.8% of the time Comic Sans scores higher. I guess then it sounds like the difference is better than nothing.

Not sure what's a good and fair way to think about the substantive size of the effect. I wanted to take the quantile positions of the means, but realized you can't exactly do that with ordinal data like this (zillions of values share the same quantile value).

I probably missed something, so here's the transcribed data and R/Python code probably with errors: https://gist.github.com/3311340

Now that I'm thinking about it more, averaging the agreement scores seems weird. Maybe it's clearer to use the simple binary agree/disagree outcome.