KML_FLASHEMBED_PROCESS_SCRIPT_CALLS
 

Implementing Big Data in the Cloud: 3 Pitfalls that Could Cost You Your Job

November 25th, 2013 by - 10,455 views

In IT departments around the globe, CTOs, CIOs, and CEOs are asking the same question: “How can we use Big Data technologies to improve our platform operations?” Your particular role could be responsible for solving for a wide variety of use cases ranging from real-time monitoring and alerting to platform operations analysis or behavioral targeting and marketing operations. The solutions for each of these use cases vary widely as well. But no matter which Big Data solution you choose, make sure you avoid the following 3 pitfalls.

Pitfall #1: Assuming a single solution fits all use cases

In a recent post, Liam Eagle of 451 Research looked at GoGrid’s Big Data product set, which is purpose-built for handling different types of workloads. He noted that variety is the key here. There isn’t a single one-size-fits-all solution for all your use cases. At GoGrid, for example, many of our Big Data customers are using 3 to 5 solutions, depending on their use case, and their platform infrastructure typically spans a mix of cloud and dedicated servers running on a single VLAN. So when you’re evaluating solutions, it makes sense to try out a few, run some tests, and ensure you have the right solution for your particular workload. It’s easy for an executive to tell you, “I want to use Hadoop,” but it’s your job that’s on the line if Hadoop doesn’t meet your specific needs.

image

As I’m sure you already know, Big Data isn’t just about Hadoop. For starters, let’s talk about NoSQL solutions. The following table lays out a few options and their associated use cases to help illustrate the point.

Solution Common Use Cases Pros and Cons
Cassandra

  • Geographic distribution – Replicating data across multiple data centers out-of-the-box for applications that benefit from having the data near the user.
  • Social media – Analysis of high-velocity, large-volume social media data from Twitter or other channels.
  • Complex applications – Applications that deal with large volumes of multi-structured data and require real-time, highly available data interactivity such as the ability to store and access data in columns, perform fast inserts, use distributed counters, or take advantage of solid-state drives (SSDs).
Pros

  • Awesome solution for High Write applications. Writes with Cassandra can be much faster than reads.
  • Proven ability to scale beyond most other NoSQL solutions.

 

Cons

  • Typically complex to manage and set up (although, GoGrid’s 1-Button Deploy™ technology does make the setup dead simple).
MongoDB
  • Storing log data – Servers generate a large number of events (i.e., “logging”) that contain useful information about their operation, including errors, warnings, and users’ behavior. By default, most servers store this data in plain-text log files on their local file systems. Although plain-text logs are accessible and human-readable, they are difficult to use, reference, and analyze without holistic systems for aggregating and storing the data. MongoDB acts as a persistent storage engine for log data from servers and other machine data.
  • Product data storage – MongoDB’s flexible schema makes it particularly well-suited to storing information for product data management and eCommerce websites and solutions. Product catalogs, for example, need the capacity to store many differed types of objects with different attributes.
Pros

  • Easy to manage and maintain.
  • SQL-like, but shouldn’t be viewed as a replacement for a relational database management system (RDBMS) such as SQL or PostgreSQL.
  • Works as both a caching database and NoSQL data store.
  • Sharding is built in.

Cons

  • Not as fast when performing map/reduce operations as some of the other solutions.
Riak
  • Session storage – Riak was originally created to serve as a highly scalable session store. This is an ideal use case for Riak, which is fundamentally a key value store. Because user/session IDs are usually stored in cookies or otherwise known at lookup time, Riak is able to serve these requests with predictably low latency.
  • Complex session storage – The bitcask storage backend supports automatic expiry of keys, which frees application developers from implementing manual session expiry. Riak’s MapReduce system can also be used to perform analysis on large bodies of session data, for example to compute the average number of active users. If sessions must be retrieved using multiple keys (e.g., a UUID or email address), Secondary Indices (2I) provides an easy solution.
  • Serve ads quickly to many users and platforms – Riak’s tunable CAP controls can be set to favor fast read performance. By setting the r value to 1, only one of n replicas will need to be returned to complete a read operation, yielding lower read latency than an r value equal to the number of replicas. This capability is ideal for advertising traffic that is primarily serving reads.
Pros

  • No need to shard because it’s set up in a ring!
  • Simple to set up and maintain.
  • High availability and fault-tolerance are out-of-the-box.
  • Great fit when you need predictable latency.

Cons

  • Not a good choice if you’re going to run a production cluster smaller than 5 servers.
  • Not a great fit if your data can’t be managed as keys and values.
HBase
  • Serving data to many users or applications – Relational databases are not inherently distributed. Therefore, as the number of users interacting with the database (i.e., reading and writing data) grows, the storage, memory, and CPU requirements can quickly grow beyond what a single machine can serve. HBase is distributed by design, which means it’s architected to leverage the storage, memory, and CPU resources of any number of servers (or nodes) in a “cluster” to scale the database horizontally as load and performance demands increase.
  • Providing fast, random read-write access to users and applications – The Hadoop Distributed File System (HDFS) is a “write once read many” (WORM) file system that’s tuned for batch operations. The emphasis is on high throughput rather than low latency. HBase augments HDFS by providing record-based storage that lets users and applications perform fast, random reads and writes to data. Changes are cataloged in memory and eventually pushed down to HDFS for persistence, which lets the Hadoop system serve random reads and writes to users and applications across big tables in real time.
Pros

  • Near-real-time lookups (not a replacement for RDBMS).
  • Easy to set up as both an input and output repository for MapReduce jobs.

Cons

  • CPU- and memory-intensive.
  • Can have issues related to unpredictable latencies.

And these are just a few examples. Managing Big Data goes way beyond NoSQL and Hadoop. In fact, it probably includes ingestion technology like Flume, NoSQL, MapReduce, RBDMS, caching and associated hooks into various applications, depending on how you want to surface information. This leads us to the second pitfall you’ll want to avoid.

Pitfall #2: Blowing the IT budget by over-planning for scale

Big Data is after all “big,” right? The answer is, it depends. When you’re developing your platform, Big Data is actually not so big. And during the initial launch, you probably don’t actually need as much capacity as you might think you do. Herein lies the conundrum. If you over-plan and blow your budget, you could lose your job. If you under-plan and the application goes down, you could lose your job. So, what’s a smart person to do? The answer is pretty simple: Use the cloud or at least a mix of cloud and dedicated infrastructure so you can meet spikes in demand and avoiding over-provisioning.

At GoGrid, our most successful customers use a mix of dedicated and cloud infrastructure for various reasons: some want to develop in the cloud and move to dedicated infrastructure to accommodate specific configuration requirements, and some want to use our SSD Cloud Servers and Block Storage-optimized network fabric while keeping parts of their operation on single-tenant dedicated servers. By using a flexible mix of infrastructure, our customers are able to meet stringent HIPAA and PCI compliance standards and take advantage of the elastic nature of cloud infrastructure to ensure they always have the appropriate capacity to meet their needs. If you do the same, that unplanned spike of data coming into the system suddenly becomes no problem at all. And if your development cycles take longer than you expected, you haven’t blown the budget because you didn’t purchase thousands of servers’ worth of capacity up-front. By selecting the right partners, you’re able to create the flexible environment your business needs, which leads me to the third and final pitfall you’ll want to avoid.

Pitfall #3: Selecting a proprietary service provider rather than an Open Data Services provider

Evaluation of Big Data solutions typically goes something like this. Line up 3 to 5 solutions that solve for a particular use case, come up with evaluation and testing criteria, run a proof of concept (POC) or two or three, and then select the highest-performing solution. One piece of the evaluation criteria that’s often overlooked, however, is the potential lock-in and use of proprietary technology that only runs one platform. Here are the top 3 disadvantages of lock-in cited by our customer base:

  1. The inability to scale across multiple service providers can bust your budget.
    Translation: You could lose your job because your service provider is too expensive and you’re locked into their platform.
  2. A lack of developers who understand and can support the solution makes hiring hard.
    Translation: The one developer you hired to support the solution left the company, so now you can’t manage the solution, the platform fails, and you lose your job.
  3. Inconsistent architecture across service providers and on-prem leads to operational overhead.
    Translation: You might not lose your job for creating additional operational overhead, but you’d probably like to keep more of your budget if you can.

On the flip side, selecting an Open Data Services (ODS) provider means you gain the following 3 benefits:

  1. You can easily scale across multiple service providers because the technology the ODS provider supports is open and can be translated to other providers. Being able to scale like this also makes highly available operations like disaster recovery and failover much easier because you don’t have to re-invent the wheel when performing them.
  2. Your hiring challenge becomes a thing of the past because when technology is supported by thriving communities, there are lots of people out there you can target and hire as you need them.
  3. You enjoy greater operational efficiency, which means you can spend less time managing multiple platforms and more time focusing on the projects that will help you be more competitive and ultimately drive more revenue.

When you finally do decide you’re ready to evaluate and/or run your Big Data solution, just remember that GoGrid makes it easy. We offer the largest set of open data solutions and purpose-built infrastructure on the market today. But don’t just take my word for it: check out our solutions for yourself. And if you need help performing tests, setting up your evaluation criteria, or selecting a solution, feel free to chat with one of our team members. We’re happy to help.

For even more information on Big Data, you can also read our new “essentials” white paper.

The following two tabs change content below.

Kole Hicks

Senior Director of Product Management at GoGrid
Kole Hicks is the Senior Director of Product Management for GoGrid, the leader in Open Data Services (ODS) and committed to delivering purpose-built, non-opinionated Big Data solutions and services for the management and integration of open source, commercial, and proprietary technologies across multiple platforms..

Leave a reply