In Part 1 of this Big Data series, I provided a background on the origins of Big Data.
But What is Big Data?
The problem with using the term “Big Data” is that it’s used in a lot of different ways. One definition is that Big Data is any data set that is too large for on-hand data management tools. According to Martin Wattenberg, a scientist at IBM, “The real yardstick … is how it [Big Data] compares with a natural human limit, like the sum total of all the words that you’ll hear in your lifetime.” Collecting that data is a solvable problem, but making sense of it, (particularly in real time), is the challenge that technology tries to solve. This new type of technology is often listed under the title of “NoSQL” and includes distributed databases that are a departure from relational databases like Oracle and MySQL. These are systems that are specifically designed to be able to parallelize compute, distribute data, and create fault tolerance on a large cluster of servers. Some examples of NoSQL projects and software are: Hadoop, Cassandra, MongoDB, Riak and Membase.
The techniques vary, but there is a definite distinction between SQL relational databases and their NoSQL brethren. Most notably, NoSQL systems share the following characteristics:
- Do not use SQL as their primary query language
- May not require fixed table schemas
- May not give full ACID guarantees (Atomicity, Consistency, Isolation, Durability)
- Scale horizontally
Because of the lack of ACID, NoSQL is used when performance and real-time results are more important than consistency. For example, if a company wants to update their website in real time based on an analysis of the behaviors of a particular user interaction with the site, they will most likely turn to NoSQL to solve this use case.
However, this does not mean that relational databases are going away. In fact, it is likely that in larger implementations, NoSQL and SQL will function together. Just as NoSQL was designed to solve a particular use case, so do relational databases solve theirs. Relational databases excel at organizing structured data and is the standard for serving up ad-hoc analytics and business intelligence reporting. In fact, Apache Hadoop even has a separate project called Sqoop that is designed to link Hadoop with structured data stores. Most likely, those who implement NoSQL will maintain their relational databases for legacy systems and for reporting off of their NosQL clusters.
Big Data and the Cloud
The early adopters of Big Data were small web companies that grew to much larger companies with capital budgets that could be invested into dedicated data centers. However, with the incredible increase in the amount of data generated, collected, and analyzed, smaller companies can take advantage of the cloud and off-load the hardware management to those vendors. Two traits that many of these NoSQL solutions have in common make them a seemingly natural fit for the cloud: One is that the nodes are distributed, and the second is that they run on commodity hardware. The cloud is designed for horizontal scaling and often built on low-cost, commodity hardware, especially at the infrastructure-as-service (IaaS) layer, where customers simply need infrastructure and have the application expertise to build and configure their own Big Data application (whether it is with Hadoop, Cassandra, or any number of products).
Given what most users are trying to achieve with Big Data applications – large-scale data sets, large-scale analysis, often in real-time – performance is a key factor. Ideally, users will want a hybrid implementation that combines both virtual and dedicated servers. This gives maximum flexibility that balances the elastic, scalable nature of virtual machines with the single-tenancy of dedicated servers. Big Data projects don’t happen in a vacuum: while a NoSQL database can leverage dedicated servers, the app or web servers that present the results of the analysis to end users can easily be added to as many virtual machines as needed to meet demand. In addition, using the cloud means that users won’t need to invest in expensive equipment, pay for power and connectivity, or hire additional resources to maintain hardware. Users simply need to pay for the infrastructure that they need and have the ability to scale over time. The ability to scale up or down to match demand (and to only pay for the infrastructure that you use) is one of the values of using the cloud for Big Data.
With whatever solution that you select, you should also take into account the nature of the application and where you will want to house the processing and the output. The amount of data you collect, analyze and present will only increase over time. The advantage will go to companies that can collect and analyze this data quickly and efficiently, allowing them to react instantly to customer sentiment and to changing trends in the ever-quickening pace of business. Make sure to select the right infrastructure vendor who can match your performance criteria and has capacity to grow with you as your data and application needs increase to match the demands of your business.