For many years, companies collected data from various sources that often found its way to relational databases like Oracle and MySQL. However, the rise of the internet and Web 2.0, and recently social media began not only an enormous increase in the amount of data created, but also in the type of data. No longer was data relegated to types that easily fit into standard data fields – it now came in the form of photos, geographic information, chats, Twitter feeds and emails. The age of Big Data is upon us.
A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes. 1 zettabyte equals 1 trillion gigabytes. To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More importantly, this data has to go somewhere, and this report projects that by 2020, more than 1/3 of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be to collect, store, and analyze what it all means.
Business intelligence (BI) systems have always had to deal with large data sets. Typically the strategy was to pull in “atomic” -level data at the lowest level of granularity, then aggregate the information to a consumable format for end users. In fact, it was preferable to have a lot of data since you could also “drill-down” from the aggregation layer to get at the more detailed information, as needed.
Large Data Sets and Sampling
Coming from a data background, I find that dealing with large data sets is both a blessing and a curse. One product that I managed analyzed share of wireless numbers. The number of wireless subscribers in 2011 according to CTIA was 322.9 million and growing. While that doesn’t seem like a lot of data at first, if each wireless number was a unique identifier, there could be any number of activities associated with each number. Therefore the amount of information generated from each number could be extensive, especially as the key element was seeing changes over time. For example, after 2003, mobile subscribers in the United States were able to port their numbers from one carrier to another. This is of great importance to market research since a shift from one carrier to another would indicate churn and also impact the market share of carriers in that Metropolitan Statistical Area (MSA).
Given that it would take a significant amount of resources to poll every household in the United States, market researchers often employ a technique called sampling. This is a statistical technique where a panel that represents the population is used to represent the activity of the overall population that you want to measure. This is a sound scientific technique if done correctly but its not without its perils. For example, it’s often possible to get +/- 1% error at 95% confidence for a large population but what happens once you start drilling down into more specific demographics and geographies? The risk is not only having enough sample (you can’t just have one subscriber represent the activity of a large group for example) but also ensuring that it is representative (is the subscriber that you are measuring representative of the population that you want to measure?). It’s a classic problem of using panelists that sampling errors do occur. It’s fairly difficult to be completely certain that your sample is representative unless you’ve actually measured the entire population already (using it as a baseline) but if you’ve already done that, why bother sampling?