Mining the tar sands of big data
by Michael E. Driscoll | February 14, 2011
This post was co-authored by Roger Ehrenberg, founder and managing partner at IA Ventures. A variation of this post was published by the GigaOm Media network.
The tar sands of Alberta, Canada contain the largest reserves of oil on the planet. However, they remain largely untouched, and for one reason: economics. It costs as much as $40 to extract a barrel of oil from tar sand, and until recently, petroleum companies could not profitably mine these reserves.
In a similar vein, much of the world’s most valuable information is trapped in digital sand, siloed in servers scattered around the globe. These vast expanses of data — streaming from our smart phones, DVRs, and GPS-enabled cars — require mining and distillation before they can be useful.
Both oil and sand, information and data share another parallel: in recent years, technology has catalyzed dramatic drops in the costs of extracting each.
Unlike oil reserves, data is an abundant resource on our wired planet. Though much of it is noise, at scale and with the right mining algorithms, this data can yield information that can predict traffic jams, entertainment trends, even flu outbreaks.
These are hints of the promise of big data, which will mature in the coming decade, driven by advances in three principle areas: sensor networks, cloud computing, and machine learning.The first, sensor networks, historically included devices ranging from NASA satellites and traffic monitors to grocery scanners and Nielsen rating boxes. Expensive to deploy and maintain, these were the exclusive province of governments and industry. But another, wider sensor network has emerged in the last decade: smart phones and web-connected consumer devices. These sensors — and the Tweets, check-ins, and digital pings they generate — form the tendrils of a global digital nervous system, pulsing with petabytes.
Just as these devices have multiplied, so have the data centers that they communicate with. Housed in climate-controlled warehouses, they consume an estimated 2 percent — and represent the fastest growing segment — of the United States energy budget. These data centers are at the heart of cloud computing, the second driver of big data.
Cloud computing reframes compute power as a utility, like electricity or water. It offers large-scale computing to even the smallest start-ups: with a few keystrokes, one can lease 100 virtual machines from Amazon’s Elastic Compute Cloud for less than $10 per hour.
Yet this computing brawn is only valuable when combined with intelligence. Enter machine learning, the third principle component driving value in the industrial age of data.
Machine learning is a discipline that blends statistics with computer science to classify and predict patterns in data. Its algorithms lie at the heart of spam filters, self-driving cars, and movie recommendation systems, including one to which Netflix awarded its million-dollar prize to in 2009. While data storage and distributed computing technologies are being commoditized, machine learning is increasingly a source of competitive advantage among data-driven firms.
Together, these three technology advances lead us to make several predictions for the coming decade:
1. A spike in demand for “data scientists.” Fueled by the oversupply of data, more firms will need individuals who are facile with manipulating and extracting meaning from large data sets. Until universities adapt their curricula to match these market realities, the battle for these scarce human resources will be intense.
2. A reassertion of control by data producers. Firms such as retailers, banks, and online publishers are recognizing that they have been giving away their most precious asset — customer data — to transaction processors and other third-parties. We expect firms to spend more effort protecting, structuring and monetizing their data assets.
3. The end of privacy as we know it. With devices tracking our every point and click, acceptable practice for personal data will shift from preventing disclosures towards policing uses. It’s not what our databases know that matters — for soon they will know everything — it’s how this data is used in advertising, consumer finance, and health care.
4. The rise of data start-ups. A class of companies is emerging whose supply chains consist of nothing but data. Their inputs are collected through partnerships or from publicly available sources, processed, and transformed into traffic predictions, news aggregations, or real estate valuations. Data start-ups are the wildcatters of the information age, searching for opportunities across a vast and virgin data landscape.
The consequence of sensor networks, cloud computing, and machine learning is that the data landscape is broadening: data is abundant, cheap, and more valuable than ever. It’s a rich, renewable resource that will shape how we live in the decades ahead, long after the last barrel has been squeezed from the tar sands of Athabasca.