In Part 1 of this Big Data series, I provided a background on the origins of Big Data.
But What is Big Data?
The problem with using the term “Big Data” is that it’s used in a lot of different ways. One definition is that Big Data is any data set that is too large for on-hand data management tools. According to Martin Wattenberg, a scientist at IBM, “The real yardstick … is how it [Big Data] compares with a natural human limit, like the sum total of all the words that you’ll hear in your lifetime.” Collecting that data is a solvable problem, but making sense of it, (particularly in real time), is the challenge that technology tries to solve. This new type of technology is often listed under the title of “NoSQL” and includes distributed databases that are a departure from relational databases like Oracle and MySQL. These are systems that are specifically designed to be able to parallelize compute, distribute data, and create fault tolerance on a large cluster of servers. Some examples of NoSQL projects and software are: Hadoop, Cassandra, MongoDB, Riak and Membase.
The techniques vary, but there is a definite distinction between SQL relational databases and their NoSQL brethren. Most notably, NoSQL systems share the following characteristics:
- Do not use SQL as their primary query language
- May not require fixed table schemas
- May not give full ACID guarantees (Atomicity, Consistency, Isolation, Durability)
- Scale horizontally
Because of the lack of ACID, NoSQL is used when performance and real-time results are more important than consistency. For example, if a company wants to update their website in real time based on an analysis of the behaviors of a particular user interaction with the site, they will most likely turn to NoSQL to solve this use case.
However, this does not mean that relational databases are going away. In fact, it is likely that in larger implementations, NoSQL and SQL will function together. Just as NoSQL was designed to solve a particular use case, so do relational databases solve theirs. Relational databases excel at organizing structured data and is the standard for serving up ad-hoc analytics and business intelligence reporting. In fact, Apache Hadoop even has a separate project called Sqoop that is designed to link Hadoop with structured data stores. Most likely, those who implement NoSQL will maintain their relational databases for legacy systems and for reporting off of their NosQL clusters.