headsup: the blog: Shortest off-season yet

It seems as though the Silly World Series was just last week, and here we are in the peak of silly season again. Where does the time go?

In other words, even though it's way too early in the season to do so, it's already time to complain about inept reporting of public opinion polls. Step forward, The State of Columbia!

Clinton, Giuliani lead in S.C., poll says
Hillary Clinton and Rudy Giuliani lead the races for the 2008 S.C. Democratic and Republican primaries, according to a new poll from Winthrop University and ETV.

Problematic conclusions for a bunch of reasons, only one of which the story bothers to mention, and it gets that one wrong. So let's reiterate the basic rules of reporting on polls:
1) The mainbar must include the minimum details needed for a marginally smart reader to judge the poll's reliability: Population sampled, sample size, dates poll was conducted, confidence interval (aka "margin of sampling error") and confidence level. If you kick the story back to cityside for these details and the assigning editor says "What's a confidence level?" -- well, what would you do with a sports editor who couldn't tell you what an earned-run average was?
2) Never draw conclusions that go beyond your data. A survey of registered voters eight months before primary season starts tells you what registered voters say eight months before primary season starts. It doesn't, and can't, address questions of who has cause to worry or who's in a tougher fight than whom.
3) Don't use language that goes beyond your data.
4) Doublecheck all the arithmetic. Then doublecheck it again.

OK, how about some details?

Clinton, a U.S. senator from New York, has an almost 10-percentage-point lead over her closest Democratic competitor, U.S. Sen. Barack Obama, D-Ill. However, former New York Mayor Giuliani is in a much closer fight among Republicans.

While you're at it, particularly when the story holds to the front and reefers to details inside: Put all the relevant numbers in the mainbar too. Don't make me go hunting for the chart. According to which, by the way, Clinton is less than 8.5 points ahead of Obama.

Still, many Democratic and Republican voters remain uncommitted seven months before the first 2008 ballot is cast in South Carolina. The poll found that about 30 percent of voters in each party - and a similar percentage of independents - remain undecided.

I'm hunting for the damn chart again, and it says the poll ended May 27 -- eight months before the S.C. Democratic primary (assuming the paper got it right last Sunday). And the proportion of undecided voters is a hint of how risky it is to talk about who's leading anything at this point. (Yes, we're also still waiting to be told it's a poll of registered voters.)

The poll also showed Giuliani has more reason for concern. His lead over U.S. Sen. John McCaincq of Arizona - 18.6 percent to 14.4 percent - is within the poll’s margin of error.

One, the poll doesn't show anything about who has reason for concern. That isn't what it measures. Two, let's just go ahead and retire the phrase "within the poll's margin of error" (which, ahem, sends me hunting for the chart again). It's irrelevant here. Three -- gotta love that cq, huh?

...As for the front-runners, the poll shows Clinton dominates Edwards’ native state, a state where her campaign organization is dwarfed by those of candidates with a fraction of her polling percentage.

"Dominates" is an opinion, not a fact, and a poorly based one at that. Clinton is at about the level with "undecided" among registered Democrats eight months before the primary.

And Giuliani, whether he’s in first or behind McCain - uncertain given the margin of error - remains in shouting distance of victory in a state where many Republican voters would fall opposite him on social issues.

Everybody talks about this "margin of error," but nobody bothers to tell us what it is or why it might matter. Let's review for a moment, then. The margin of sampling error* describes the band in which nonchance cases can be expected to fall at a given confidence level. Using the general standard of 95% confidence, 19 samples out of 20 will be within plus or minus the margin (3.8 percentage points for the whole sample here). So there's one chance in 20 that your sample is outside this range and the real population figure -- which we're trying to estimate -- is something way different from what our poll predicts.

Why insist on the confidence level? Because the margin is meaningless without it. You can lower the margin of sampling error to 1.9 points with a flick of your wrist. All you need to do is accept a one-in-three chance that you're wrong.

Here's how it's explained (err...) in the methodology blurb:

The margin of error ranges from plus or minus 3.79 percent to 6.01 percent.

Why the range? The size of the group involved in answering each question.

Consider this example: Of all S.C. voters who responded, 53.4 percent said the war in Iraq is the most important issue facing the country. That result has a margin of error of only 3.79 percent because of the 670 people surveyed, meaning as few as 49.61 percent of the sample or as much as 57.19 percent could believe the war is the most important issue.

Uh, sort of. Those are differences of about 7%, even though it's 3.8 percentage points either way. And we know exactly how much of the sample (53.4%) believes the war is the most important problem; it's the figure for the population we don't know. And there's still a 5% chance that the population figure lies outside those bounds. (See what we meant about confidence level?)

That's why "within the margin of error" is meaningless for describing a candidate's lead. Sampling error applies to each candidate's result, meaning you have to double the margin -- so candidate A's lower bound is above candidate B's upper bound -- to be sure at your confidence level that the difference is real. (How likely a difference is to reflect a real lead, though not necessarily the one the poll indicates, is a different matter.)

However, the survey also found 61.9 percent of Democrats and 50.8 percent of Republicans named the war as most important. But each of those results came from groups smaller than the 670 surveyed, giving a higher margin of error — 6.01 percent, meaning as few as 55.89 percent or as many as 67.91 percent of Democrats believe the war is the most important issue.

Now we're screwing up the arithmetic. These samples are different sizes: The poll had about 25% Democrats and about 40% Republicans. The maximum margin of sampling error at 95% confidence for Democrats on this question is about 7.6 percentage points. It's slightly smaller for this question and for the presidential preference question (about 6.9 points, given the gap between Hillary and not-Hillary votes), but it gives you an idea of how quickly a shrinking sample size can change the band in which your nonchance results operate.

This is an interesting poll, but that doesn't mean it's exciting, which is a large part of the problem. The writer's going for drama where there isn't any. One effect of that (well, that and saying 8.4 points is "almost 10") is an impression that the paper is taking sides -- that it's plumping for Clinton on the news pages. That's not a good impression to give.

Let's pretend it's spring training and practice doing things right. Hit your cutoffs now and you're more likely to hit them in the pennant race.

* Please use its full name on first reference. Surveys are subject to a number of different types of error, and sampling is only one of them.

headsup: the blog

Friday, June 01, 2007

Shortest off-season yet

1 Comments:

Previous Posts

Mailbag!

J-stuff

News

Feeds

Language links

References

Public service journalism

The comics section