Sunday, August 17, 2008

Desperately seeking significance

A couple items from today's A section suggest that the Nation's Newspaper of Record has trouble -- fairly serious trouble, actually -- understanding a concept that's going to be of some importance as the campaign season wears on. The concept is statistical significance. It's not all that hard, but if you don't get it, you're at risk of sounding the way the Times sounded today: authoritatively clueless at best, and it goes downhill from there. Let's let the Times introduce the term and see where its wanderings can take us from there. Here's a tale from North Carolina:

The average of the latest polls here, as of Wednesday, showed Mr. McCain with a lead of 4.3 percentage points over Mr. Obama, a difference that is barely significant statistically.

Looks as if the Times, with its usual close regard for crediting other people's work, is borrowing the Real Clear Politics "average," which journalists in their right minds shun. Biggest point first, the RCP average can't be tested for statistical significance, because it's a meaningless number. It might or might not have some relation to something going on in the empirical world.* There is a way to determine the average of polls with different sample sizes, but RCP doesn't use it, and that's over and above the fallacy of averaging polls that sample adults, registered voters and "likely" voters.**

But our point is the term the Times invokes: "barely significant statistically." And the trouble there -- the thing suggesting that Times writers and editors don't know what they're talking about -- is that "barely" isn't an issue in significance testing. A difference is either significant or it ain't; "barely significant" is like "kind of pregnant" or "barely dead." So what does (or should) "significant" mean? Listen and attend.

When you're generalizing from samples to a population, a difference between samples can arise one of two ways: Either there's a real difference in the population or you got the result by accident ("chance"). To test for significance, you pick a confidence level, or a likelihood that you have a real result, rather than a chance one. Conventionally, that's 95%, meaning a "significant" result is one that has one chance (or less) in 20 of having happened by accident:*** Candidate A isn't really ahead of Candidate B, aspirin doesn't work better than placebo, kids who saw the violent cartoon don't act more violently than kids who saw the happy cartoon. Good so far?

OK, so a "significant" result is one that's unlikely to have happened by chance, given your chosen confidence level. What does that say about a 4.3-point McCain lead? Well, the difference isn't significant at the conventional 95%, although it'd be close in a poll the size of the one the Times and CBS ran last month. Even discounting chance, then, it's possible that Obama is leading in the whole population. Is it likely?

No, which suggests that survey differences that don't reach "significance" should be handled with care. In a poll as small as 600 people, that nonsignificant difference would be significant -- again, meaning unlikely to have come about by chance -- if we set the confidence level to 2 out of 3, rather than 19 out of 20. So the better way to talk about recent surveys in North Carolina isn't to say that a meaningless number did or didn't pass an arbitrary threshold that the reporter doesn't quite understand, but to place "nonsignificant" results on a scale that runs roughly between these points:

"The results are unlikely to represent a real difference in the population."
"The results suggest that Candidate A would be leading if the whole population had been surveyed, but the difference is not conclusive by traditional polling standards."

Got an idea where our next example might take us? Onward:

A New York Times/CBS News poll last month found the race between Mr. Obama and Mr. McCain to be a statistical dead heat, not unlike where Senator John Kerry and Mr. Bush stood in a Times/CBS News poll in July 2004.

Uh, wow. Since even the AP acknowledges that there is no such thing as a "statistical dead heat," it'd be nice if the Newspaper of Record played along. If there are 9 chances in 10, or even 2 in 3, that the sample difference reflects a population difference, we have a difference that doesn't meet conventional levels of significance testing in the social sciences -- not a "dead heat."

But that's not the real problem here. The problem is that this writer either guessed at the results, read the wrong poll, or read the right poll and decided to lie about the whole thing. Because if the "New York Times/CBS News poll last month" is the one that the Times and CBS conducted, um, last month, Obama's lead (45-39, n = 1,534 RVs) isn't just significant at 95%. It's bloody near significant at 99% -- which would be less than 1 chance in 100 that the difference came about by accident.

So in a single section, the Times botches the data in a way that benefits the Democrat and in a way that benefits the Republican. You could, on some strange planet with an ammonia-based atmosphere and lots of suns, call that "balanced" and claim it as evidence of your impartiality. A more practical approach in this world might be adopting a few simple rules and applying them without fear or favor -- by looking at the numbers without the names.

Granted, it's a free country. You can claim, as Howard Kurtz does, that the numbers don't matter; it's your journalistic intuition that counts. That has the unfortunate outcome of making you sound like a bozo, whether you work for the Times or the Post or any of the lesser breeds. Significance testing, to borrow a phrase from old Lasker, is set about with the sort of merciless facts that tend to culminate in checkmate and thus contradict the hypocrite.

* You baseball fans can try this multiple choice question. Suppose you hit .400 your first year in the show and .196 your second. Your career batting average is:
a) .292
b) .298
c) .200
d) How should I know?
If you chose (d), congratulations! Now recalculate on these stats: First season, 4 hits in 10 at-bats; second season, 96 for 490.
** Calculate the career batting average again, but this time, bear in mind that one year's at-bats include walks, HBPs and SFs. Heh heh.
*** Maybe the Times would be happier with Wikipedia's definition.

Labels: ,

0 Comments:

Post a Comment

<< Home