As posted on Technology news, comment and analysis | guardian.co.uk
Internet measurement techniques need a total overhaul – new methods make it hard for incumbent players to stay in the game
The web user is the most watched consumer ever. For tracking purposes, every large site drops literally dozens of cookies in the visitor’s browser. In the most comprehensive investigation on the matter, The Wall Street Journal found that each of the 50 largest websites in the United Sates, weighing 40% of the US page views, installed an average of 64 files on a user device. (See the WSJ’s What They Know series and a Monday Note about tracking issues.) As for server logs, they record every page sent to the user and they tell with great accuracy which parts of a page collect most of the reader’s attention.
But when it comes to measuring a digital viewer’s commercial value, sites rely on old-fashioned panels, that is limited user population samples. Why?
Panels are inherited. They go back to the old days of broadcast radio when, in order to better sell advertising, dominant networks wanted to know which station listeners tuned in to during the day. In the late thirties, Nielsen Company made a clever decision: it installed a monitoring box in 1,000 American homes. Twenty years later, Nielsen did the same, on a much larger scale, with broadcast television. The advertising world was happy to be fed with plenty of data — mostly unchallenged as Nielsen dominated the field. (For a detailed history, you can read Rating the Audience, written by two Australian media academics). As Nielsen expanded to other media (music, film, books and all sorts of polls), moving to the internet measurement sounded like a logical step. As of today, Nielsen only faces smaller competitors such as ComScore and others.
I have yet to meet a publisher who is happy with this situation. Fearing retribution, very few people talk openly about it (twisting the dials is so easy, you know…), but they all complain about inaccurate, unreliable data. In addition, the panel system is vulnerable to cheating on a massive scale. Smartypants outfits sell a vast array of measurement boosters, from fake users that will come in just once a month to be counted as “unique” (they are indeed), to more sophisticated tactics such as undetectable “pop under” sites that will rely on encrypted URLs to deceive the vigilance of panel operators. In France for instance, 20% to 30% of some audiences can be bogus — or largely inflated. To its credit, Mediametrie — the French Nielsen affiliate that produces the most watched measurements — is expending vast resources to counter the cheating, and to make the whole model more reliable. It works, but progress is slow. In August 2012, Mediametrie Net Ratings (MNR), launched a Hybrid Measure taking into account site centric analytics (server logs) to rectify panel numbers, but those corrections are still erratic. And it takes more than a month to get the data, which is not acceptable for the real-time-obsessed internet.
Publishers monitor the pulse of their digital properties on a permanent basis. In most newsrooms, Chartbeat (also imperfect, sometimes) displays the performance of every piece of content, and home pages get adjusted accordingly. More broadly, site-centric measures detail all possible metrics: page views, time spent, hourly peaks, engagement levels. This is based on server logs tracking dedicated tags inserted in each served page. But the site-centric measure is also flawed: If you use, say, four different devices — a smartphone, a PC at home, another at work, and a tablet — you will be incorrectly counted as four different users. And if you use several browsers you could be counted even more times. This inherent site-centric flaw is the best argument for panel vendors.
But, in the era of Big Data and user profiling, panels no longer have the upper hand.
The developing field of statistical pairing technology shows great promise. It is now possible to pinpoint a single user browsing the web with different devices in a very reliable manner. Say you use the four devices mentioned earlier: a tablet in the morning and the evening; a smartphone for occasional updates on the move, and two PCs (a desktop at the office and a laptop elsewhere). Now, each time you visit a new site, an audience analytics company drops a cookie that will record every move on every site, from each of your devices. Chances are your browsing patterns will be stable (basically your favorite media diet, plus or minus some services that are better fitted for a mobile device.) Not only your browsing profile is determined from your navigation on a given site, but it is also quite easy to know which sites you have been to before the one that is currently monitored, adding further precision to the measurement.
Over time, your digital fingerprint will become more and more precise. Until then, the set of four cookies is independent from each other. But the analytics firm compiles all the patterns in single place. By data-mining them, analysts will determine the probability that a cookie dropped in a mobile application, a desktop browser or a mobile web site belongs to the same individual. That’s how multiple pairing works. (To get more details on the technical and mathematical side of it, you can read this paper by the founder of Drawbridge Inc.) I recently discussed these techniques with several engineers both in France and in the US. All were quite confident that such fingerprinting is do-able and that it could be the best way to accurately measure internet usage across different platforms.
Obviously, Google is best positioned to perform this task on a large scale. First, its Google Analytics tool is deployed on more than 100 million websites. And the Google Ad Planner, even in its public version, already offers a precise view of the performance of many sites in the world. In addition, as one of the engineers pointed out, Google is already performing such pairing simply to avoid showing the same ad twice to a someone using several devices. Google is also most likely doing such ranking in order to feed the obscure “quality index” algorithmically assigned to each site. It even does such pairing on a nominative basis by using its half billion Gmail accounts (425 million in June 2012) and connecting its Chrome users. As for giving up another piece of internet knowledge to Google, it doesn’t sounds like a big deal to me. The search giant knows already much more about sites than most publishers do about their own properties. The only thing that could prevent Google from entering the market of public web rankings would be the prospect of another privacy outcry. But I don’t see why it won’t jump on it — eventually. When this happens, Nielsen will be in big trouble.