For many years, companies collected data from various sources that often found its way to relational databases like Oracle and MySQL. However, the rise of the internet and Web 2.0, and recently social media began not only an enormous increase in the amount of data created, but also in the type of data. No longer was data relegated to types that easily fit into standard data fields – it now came in the form of photos, geographic information, chats, Twitter feeds and emails. The age of Big Data is upon us.
A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes. 1 zettabyte equals 1 trillion gigabytes. To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More importantly, this data has to go somewhere, and this report projects that by 2020, more than 1/3 of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be to collect, store, and analyze what it all means.
Business intelligence (BI) systems have always had to deal with large data sets. Typically the strategy was to pull in “atomic” -level data at the lowest level of granularity, then aggregate the information to a consumable format for end users. In fact, it was preferable to have a lot of data since you could also “drill-down” from the aggregation layer to get at the more detailed information, as needed.
Large Data Sets and Sampling
Coming from a data background, I find that dealing with large data sets is both a blessing and a curse. One product that I managed analyzed share of wireless numbers. The number of wireless subscribers in 2011 according to CTIA was 322.9 million and growing. While that doesn’t seem like a lot of data at first, if each wireless number was a unique identifier, there could be any number of activities associated with each number. Therefore the amount of information generated from each number could be extensive, especially as the key element was seeing changes over time. For example, after 2003, mobile subscribers in the United States were able to port their numbers from one carrier to another. This is of great importance to market research since a shift from one carrier to another would indicate churn and also impact the market share of carriers in that Metropolitan Statistical Area (MSA).
Given that it would take a significant amount of resources to poll every household in the United States, market researchers often employ a technique called sampling. This is a statistical technique where a panel that represents the population is used to represent the activity of the overall population that you want to measure. This is a sound scientific technique if done correctly but its not without its perils. For example, it’s often possible to get +/- 1% error at 95% confidence for a large population but what happens once you start drilling down into more specific demographics and geographies? The risk is not only having enough sample (you can’t just have one subscriber represent the activity of a large group for example) but also ensuring that it is representative (is the subscriber that you are measuring representative of the population that you want to measure?). It’s a classic problem of using panelists that sampling errors do occur. It’s fairly difficult to be completely certain that your sample is representative unless you’ve actually measured the entire population already (using it as a baseline) but if you’ve already done that, why bother sampling?
One of the most famous examples of sampling error was the 1948 election where a Gallup Poll all but declared that Thomas Dewey had defeated Harry Truman. Although Gallup used scientific sampling (as opposed to a straw poll), it was with a quota sample that proved to be a deeply flawed measurement tool. Since it relied on human intervention to choose the sample, it was inherently biased. Even with modern techniques, it is important to always take into account the margin of error and the confidence interval, which is the indication of the reliability of the measurement.
Of course, the real luxury is the ability to be able to poll the entire population. While the main issue with polling the entire population is more in the data collection (which is why the census is only conducted one once a decade) and not in the data analysis, assuming that the data collection can be done, being able to analyze that large a data set quickly and efficiently would negate the need for using a sample. For example, while polling every single person in the United States is extremely expensive and difficult, collecting all the social network data regarding your brand should be fairly easy. The majority of social networks have an API and most people who use it are already referencing your brand and/or posting to your content pages. The issue is less of collection than of being able to analyze all that data in an efficient and timely manner.
As mentioned earlier, business intelligence has had to deal with this type of data problem and it was often solved by creating increasingly powerful proprietary hardware. Teradata was one of the early pioneers of this technique, selling large and powerful equipment that was used to process large amounts of data. A more modern incarnation, Netezza (now part of IBM), claimed to pull data at “physics speed,” which removes the database layer and interacts directly with the hardware to extract data as fast as data could be pulled from the spindle. It’s extremely fast, but still required expensive, proprietary hardware.
The Yellow Elephant
So large data sets have been around a long time. There have been attempts at trying to manage, wrangle, and tame the onslaught of data being generated from everywhere. But it was not until Jeffrey Dean and Sanjay Ghemawat of Google Labs wrote their influential paper on MapReduce in 2003 that Big Data really started to take shape. Google has had to deal with large amounts of raw data (such as crawled documents and web request logs) that needed to be analyzed in a timely manner. Creating MapReduce was their way to being able to abstract the compute parallelization, distribution of data, fault tolerance, and load balancing from the developers so that they can focus on expressing the computations necessary to analyze the data. This seminal paper reportedly inspired Doug Cutting to develop an open-source implementation of the MapReduce framework called “Hadoop,” which was named after his son’s toy elephant. Yahoo famously embraced this implementation after hiring Cutting in 2004. Yahoo continued to build upon this technology and first used Hadoop in production in 2008 for it’s search “webmap,” which was an index of all known webpages and all the metadata needed to search them.
One of the key characteristics of Hadoop was that it could run on commodity hardware and automatically distribute jobs. By its nature, it is designed to be fault tolerant so jobs are not impacted by the failure of a single node. According to an article in Wired Magazine about Yahoo’s use of Hadoop, “Hadoop could ‘map’ tasks across a cluster of machines, splitting them into tiny sub-tasks, before ‘reducing’ the results into one master calculation.” Soon after, companies like eBay and Facebook were adopting the technology and implementing it internally. Reportedly, Facebook has the largest Hadoop Cluster in the world, currently at 30 petabytes (PB).
While early adopters of Hadoop and other Big Data technologies tended to form around Internet, social media, and ad networks, Big Data is intended to be a general-purpose tool. With most companies now integrating social media into their offerings, the amount of data created internally combined with those extracted externally will only increase. This is an indication that companies from all industries will need to start investigating how to implement Big Data technologies to make use of all this data that they are collecting and creating.
In Part 2 of this Big Data series, I discuss how Big Data and the Cloud work together.