One in 10 web pages laced with malware - Google, screamed the The Register, in one of its characteristic gutter-press style headlines.
Now, you've always got to take El Reg with a pinch of salt. God bless them in there: sometimes they write stuff worth reading, but other times you're left scratching your head and asking yourself why in bejaysus they write such a load of complete bollox. Tabloid headline syndrome or something? Sure looks like it to me.
But this time they've really out-done themselves, when a tiny amount of background research would have told them that their article was hopelessly wrong. It concerned a recent paper presented at the Usenix HotBots '07 workshop by Niels Provos from Google, about malware infection rates on web pages. 450,000 out of 4.5 million web pages analysed turned out to be able to infect an instance of Internet Explorer. 10%, you say?
No. The 4.5 million figure was culled from an initial pool of "several billion URLs", by using a programming model called MapReduce to discover correlations between large scale data sets. In this case, the Google team looked at potential exploit URLs and suspicious JavaScript code, and were able to distil the initial data set of billions of web pages down to the quoted 4.5 million.
Critically, Provos and his team did not provide the exact number of URLs initially fed into their analysis. Was it Google's entire web page cache? Was it a randomly selected subset? For that matter, does Google publish the number of pages indexed these days? They used to, but maybe that's commercially sensitive information and we the public don't need to know any more. The only thing we can say about "several billion" is that it's more than 2 billion, which means that at most, one in 4444 randomly selected pages could be categorically labelled dangerous to at least some version of Internet Explorer using Google's methods.
Another notable ommission from Google's study is that they make no mention of whether their IE installation was patched or not. Until otherwise informed, I'm going to read between the lines of:
Our analysis is different as we do not care about specific vulnerabilities but rather about how many URLs on the Internet are capable of compromising users.
... and note the critical phrase: "capable of compromising users". If an exploit targets a known IE vulnerability, it's capable of exploiting users. Their paper should have explicitly mentioned whether they were using a patched IE, because it's an important issue. Lots of people run Windows Update, and despite the horrifying numbers of bots and botnets on the 'net, the numbers would be much larger if they didn't.
El Reg, your punishment is to read the original Google paper and write out 1000 times "I promise not to write such crap in future". No scripts, please.