Posts Tagged ‘Hadoop’


Big Data Cloud Servers for Hadoop

Monday, January 13th, 2014 by

GoGrid just launched Raw Disk Cloud Servers, the perfect choice for your Hadoop data node. These purpose-built Cloud Servers run on a redundant 10-Gbps network fabric on the latest Intel Ivy Bridge processors. What sets these servers apart, however, is the massive amount of raw storage in JBOD (Just  a Bunch of Disks) configuration. You can deploy up to 45 x 4 TB SAS disks on 1 Cloud Server.

These servers are designed to serve as Hadoop data nodes, which are typically deployed in a JBOD configuration. This setup maximizes available storage space on the server and also aids in performance. There are roughly 2 cores allocated per spindle, giving these servers additional MapReduce processing power. In addition, these disks aren’t a virtual allocation from a larger device. Each volume is actually a dedicated, physical 4 TB hard drive, so you get the full drive per volume with no initial write penalty.

Hadoop in the cloud

Most Hadoop distributions call for a name node supporting several data nodes. GoGrid offers a variety of SSD Cloud Servers that would be perfect for the Hadoop name node. Because they are also on the same 10-Gbps high-performance fabric as the Raw Disk Cloud Servers, SSD servers provide low latency private connectivity to your data nodes. I recommend using at least the X-Large SSD Cloud Server (16 GB RAM), although you may need a larger server, depending on the size of your Hadoop cluster. Because Hadoop stores metadata in memory, you’ll want more RAM if you have a lot of files to process. You can use any size Raw Disk Cloud Server, but you’ll want to deploy at least 3. Also, each Raw Disk Cloud Server has a different allocation of raw disks, which are illustrated in the table below. The Cloud Server in the illustration is the smallest size that has multiple disks per Cloud Server. Hadoop defaults to a replication factor of three, so to protect your data from failure, you’ll want to have at least 3 data nodes to distribute data. Although Hadoop attempts to replica data to different racks, there’s no guarantee that your Cloud Servers will be on different racks.

Note that the example below is for illustrative purposes only and is not representative of a typical Hadoop cluster; for example, most Cloudera and Hortonworks sizing guides start at 8 nodes. These configurations can differ greatly depending on if you intend to use the cluster for development, production, or production with HBase added. This includes the RAM and disk sizes (less of both for development, most likely more for HBase). Plus, if you’re thinking of using these nodes for production, you should consider adding a second name node.

Hadoop-cluster (more…) «Big Data Cloud Servers for Hadoop»

The 2013 Hadoop Summit

Monday, July 29th, 2013 by


I recently attended the Hadoop Summit in San Jose. This is one of two major conferences organized around Hadoop, the other being Hadoop World. Nearly all the companies with Hadoop distributions were present along with several big users of Hadoop like Netflix, Twitter, and Linkedin.

Crossing The Chasm

If you’re not deeply involved with Hadoop, attending one of these conferences a year apart can be shocking. The advancements made in just the span of a year are amazing. The conference seemed notably larger this year, and I noticed more non-tech companies in the audience. I think it’s safe to say that Hadoop has crossed the chasm, at least for enterprise IT users.

Other than the type of attendees at the event, the other signal to me was the emergence of Hadoop 2.0. This second version of Hadoop focused on features that are important for users who want to run production-grade software for mission-critical systems. High-availability finally arrived for the name node (for the Open Source project, not the version Cloudera released for its distribution), a new version of Hive with more SQL-friendly features, and YARN which allows users to run just about anything on the Hadoop Distributed File System (HDFS). These types of stability and availability features tend to show up when there is a critical mass of users who want to use software for production.


Quite A YARN

(more…) «The 2013 Hadoop Summit»

Architecting in the Cloud & Hadoop as A Service – GoGrid CEO Panels – CloudCon Expo 2012

Monday, October 1st, 2012 by

This week, despite the unseasonably hot and sunny weather in San Francisco, there are plenty of clouds, specifically at the 2012 CloudCon Expo & Conference. CloudCon is billed as the “platform to learn, collaborate & network. Find out why Cloud Computing is necessary for your enterprise and what businesses and financial implications will it have on your day-to-day operations.” Register here.


For those looking to learn more about Cloud Computing, this conference is tailored for you. Learn how to leverage the various cloud models available and how to outsource SaaS (Software as a Service), PaaS (Platform as a Service), and IaaS (Infrastructure as a Service). GoGrid is a pure-play cloud infrastructure provider. We have a variety of cloud infrastructure solutions available for large and small businesses alike including:


GoGrid CEO in 2 CloudCon Panels

(more…) «Architecting in the Cloud & Hadoop as A Service – GoGrid CEO Panels – CloudCon Expo 2012»

Press Release & Case Study: Martini Media Delivers Prized Consumer to Advertisers Using GoGrid’s Big Data Solution

Wednesday, May 9th, 2012 by

Hitting the wires in the cloud this morning was our announcement of Martini Media’s customer success story. When we work with our customers, we discover a lot of innovation at work and throughout the process, we assist in crafting the best cloud solution wherever possible. Martini Media’s unique digital platform that advertisers use to reach affluent consumers is a fantastic example of how Big Data and cloud computing can be used to drive business success.


In case you missed the Press Release, it is available here as well as below. But I encourage you, especially if you are looking for a Big Data solution, to download the Martini Media case study and then talk with one of our Cloud Solutions Architects. Through the use of our Big Data solution, hosted within the GoGrid cloud, Martini Media has been able to:

  • Support 100 percent annual growth
  • Realize the performance benefits of Big Data and the cost advantages of cloud computing
  • Serve targeted ads in as little as 150 milliseconds
  • Reduce latency and increase throughput speed

Case Study - Martini Media - Funnel image

And if you need a primer on Big Data, where it came from and where it can take you, I highly recommend these two articles by GoGrid’s Rupert Tagnipes: (more…) «Press Release & Case Study: Martini Media Delivers Prized Consumer to Advertisers Using GoGrid’s Big Data Solution»

The Big Data Revolution – Part 2 – Enter the Cloud

Wednesday, March 21st, 2012 by

In Part 1 of this Big Data series, I provided a background on the origins of Big Data.

But What is Big Data?

Port Vell Barcelona

The problem with using the term “Big Data” is that it’s used in a lot of different ways. One definition is that Big Data is any data set that is too large for on-hand data management tools. According to Martin Wattenberg, a scientist at IBM, “The real yardstick … is how it [Big Data] compares with a natural human limit, like the sum total of all the words that you’ll hear in your lifetime.” Collecting that data is a solvable problem, but making sense of it, (particularly in real time), is the challenge that technology tries to solve. This new type of technology is often listed under the title of “NoSQL” and includes distributed databases that are a departure from relational databases like Oracle and MySQL. These are systems that are specifically designed to be able to parallelize compute, distribute data, and create fault tolerance on a large cluster of servers. Some examples of NoSQL projects and software are: Hadoop, Cassandra, MongoDB, Riak and Membase.

The techniques vary, but there is a definite distinction between SQL relational databases and their NoSQL brethren. Most notably, NoSQL systems share the following characteristics:

  • Do not use SQL as their primary query language
  • May not require fixed table schemas
  • May not give full ACID guarantees (Atomicity, Consistency, Isolation, Durability)
  • Scale horizontally

Because of the lack of ACID, NoSQL is used when performance and real-time results are more important than consistency. For example, if a company wants to update their website in real time based on an analysis of the behaviors of a particular user interaction with the site, they will most likely turn to NoSQL to solve this use case.

However, this does not mean that relational databases are going away. In fact, it is likely that in larger implementations, NoSQL and SQL will function together. Just as NoSQL was designed to solve a particular use case, so do relational databases solve theirs. Relational databases excel at organizing structured data and is the standard for serving up ad-hoc analytics and business intelligence reporting. In fact, Apache Hadoop even has a separate project called Sqoop that is designed to link Hadoop with structured data stores. Most likely, those who implement NoSQL will maintain their relational databases for legacy systems and for reporting off of their NosQL clusters.

(more…) «The Big Data Revolution – Part 2 – Enter the Cloud»