KML_FLASHEMBED_PROCESS_SCRIPT_CALLS
 

HBase Made Simple

April 30th, 2014 by - 2,027 views

GoGrid has just released its 1-Button Deploy™ of HBase, available to all customers in the US-West-1 data center. This technology makes it easy to deploy either a development or production HBase cluster on GoGrid’s high-performance infrastructure. GoGrid’s 1-Button Deploy™ technology combines the capabilities of one of the leading NoSQL databases with our expertise in building high-performance Cloud Servers.

HBase is a scalable, high-performance, open-source database. HBase is often called the Hadoop distributed database – it leverages the Hadoop framework but adds several capabilities such as real-time queries and the ability to organize data into a table-like structure. GoGrid’s 1-Button Deploy™ of HBase takes advantage of our SSD and Raw Disk Cloud Servers while making it easy to deploy a fully configured cluster. GoGrid deploys the latest Hortonworks’ distribution of HBase on Hadoop 2.0. If you’ve ever tried to deploy HBase or Hadoop yourself, you know it can be challenging. GoGrid’s 1-button Deploy™ does all the heavy lifting and applies all the recommended configurations to ensure a smooth path to deployment.

Why GoGrid Cloud Servers?

SSD Cloud Servers have several high-performance characteristics. They all come with attached SSD storage and large available RAM for the high I/O uses common to HBase. The Name Nodes benefit from the large RAM options available on SSD Cloud Servers and the Data Nodes use our Raw Disk Cloud Servers, which are configured as JBOD (Just a Bunch of Disks). This is the recommended disk configuration for Data Nodes, and GoGrid is one of the first providers to offer this configuration in a Cloud Server. Both SSD and Raw Disk Cloud Servers use a redundant 10-Gbps public and private network to ensure you have the maximum bandwidth to transfer your data. Plus, the cloud makes it easy to add more Data Nodes to your cluster as needed. You can use GoGrid’s 1-Button Deploy™ to provision either a 5-server development cluster or an 11-server production cluster with Firewall Service enabled.

Development Environments

The smallest recommended size for a development cluster is 5 servers. Although it’s possible to run HBase on a single server, you won’t be able to test failover or how data is replicated across nodes. You’ll most likely have a small database so you won’t need as much RAM, but will still benefit from SSD storage and a fast network. The Data Nodes use Raw Disk Cloud Servers and are configured with a replication factor of 3.

HBase-dev

Production Environments

HBase production environments need to be configured for high availability. GoGrid’s configuration includes 2 Name Nodes (a primary and a backup) so the cluster continues to run in the event of a Name Node failure. There are also 5 Data Nodes with the assumption that a larger data set will be processed in production. The HBase Master is on a separate node, and Zookeeper is also deployed on its own 3-node cluster. There are other possible configurations available, but this is the configuration provisioned with 1-Button Deploy™. You’ll probably need to deploy more Data Nodes as your cluster grows.

The GoGrid Firewall Service is deployed to limit public access to the cluster. You’re able to modify the level of access through our management console. The cluster itself is configured to communicate completely on the private network. The base production configuration uses 11 public IPs. If you’re a new customer, you should have enough public IPs in your initial public subnet to deploy 1 cluster. However, if you’re an existing customer, you may need to request additional public IPs to deploy this cluster. You can make this request in the management console.

HBase-prod

Hadoop 2.0

Hadoop is a very popular framework and is the analytic engine of choice for companies dealing with Big Data. The first release of Hadoop is often described as MapReduce + HDFS (Hadoop Distributed File System). As a new technology, it lacked the flexibility and robustness enterprises required. Hadoop 2.0 is an attempt to prepare Hadoop for mainstream use. With the initial release of Hadoop, the Name Node was a single point of failure. Hadoop 2.0 addresses this issue by allowing a true highly available Name Node (support for multiple namespaces for managing HDFS). If the active Name Node fails, the standby Name Node then takes over. This process isn’t the same as Hadoop 1.x’s Secondary Name Node, which only stored checkpoints. The MapReduce + HDFS paradigm was also limiting, tying nearly ever addition to a MapReduce job. In Hadoop 2.0, MapReduce and HDFS are now separate, with a new manager called YARN that allows anything to run on HDFS (MapReduce, SQL, Storm, etc.). This approach greatly increases the flexibility of Hadoop; however, it’s a major change for those familiar with Hadoop 1.x. Gone are 1.x’s  JobTracker and TaskTracker; they’ve been replaced by ResourceManagers and ApplicationMasters. Although Hadoop 2.0 is now GA, it’s relatively new and constantly updated. I recommend looking through Hortonworks’ documentation to get a better understanding of all the changes.

Hadoop2

Why HBase?

Hadoop is great if you want to run batch jobs. When you want to interact with your data in real time and interactively, however, you’ll want to use HBase. Although it’s a separate Apache project, it’s built on top of Hadoop and leverages many of its components, notably HDFS. Customers can use HBase like a database: inserting data, creating tables, and running queries. It also supports real-time interactions on a large scale. HBase often competes with Cassandra because they have similar capabilities, but there is one major distinction. Cassandra is frequently described as an AP system, meaning that it favors Availability and Partition tolerance over Consistency. In comparison, HBase is a CP system, meaning that it favors Consistency and Partition tolerance over Availability.

Facebook is an early adopter of HBase, which it uses for its messaging service (and now for graph search). The messaging service is a perfect use case for why a company would want to use HBase. First, messaging involves a large amount of data that is unstructured and that users expect to interact with in real time. When you post a message, you expect a quick response (assuming the person on the other end is online). You also expect the response you enter to be the same response received by the recipient. This is where consistency is important. HBase will ensure strong consistency, meaning that regardless of which node responds, it will always return the same answer. However, it may not always respond. In the case of a messaging app, this is acceptable behavior because users are accustomed to recipients not always responding immediately.

1-Button-Deploy™ of HBase

Assuming you have a GoGrid account, login to the management console. Once you’re in the Grid View, click on the  “Add” (+) button to open the “Add New Infrastructure” window. Once you’re there, click on “Big Data Clusters” and select HBase.

Add-HBase-wizard

You’re then presented with a screen where you can select either a production or development cluster. Once you click “Deploy,” all the servers, related infrastructure, and configurations will be deployed.

HBase-options-2

You’ll then see the servers for each node as well as an icon representing the cluster itself.

Getting Started

When all the server status lights turn green, the cluster has been deployed successfully. However, the cluster may still be converging even if all the servers show green. I recommend waiting a few minutes longer before trying to run HBase. If there’s a problem building the cluster, then the process will roll back, deleting all the infrastructure. You’ll need to attempt to deploy the cluster again in that scenario.

Assuming everything is up and running, login to the HBase Master. Use the console service or your favorite client to SSH into the server. If you’re unable to connect with your client, you may need to check your firewall rules, especially if you’ve deployed the production cluster. Once you’re logged into the server, type “hbase shell” at the prompt to have a command line interface with the database. If you don’t get any errors, then HBase should be running successfully on that server. If you want more details on HBase, look through our wiki and for more advanced help, review the HBase online manual.

GoGrid is committed to releasing additional Big Data solutions using 1-Button Deploy™ technology. In our experience, customers typically deploy multiple Big Data solutions to test what works best for them and which solutions are better suited for specific parts of their business process. Try GoGrid’s 1-Button Deploy™ for HBase today and see if it’s the right solution for you!

The following two tabs change content below.

Rupert Tagnipes

Director, Product Management at GoGrid
Rupert Tagnipes is Director of Product Management at GoGrid who is responsible for managing and expanding the company’s multiple product lines. His focus is on leveraging his technical background and industry knowledge to drive product innovation and increase adoption of the cloud. He has extensive software product experience at technology companies in Silicon Valley solving data analytics and cloud infrastructure problems for customers across multiple industries.

Latest posts by Rupert Tagnipes (see all)

Leave a reply