We're Hiring!  
Toll Free US & Canada: 1(877) 946-4743   Worldwide: +1(415) 869-7444

Author Archive

I recently attended Under the Radar 2012 as GoGrid was a sponsor of this event. As there were several tracks, Michael Sheehan and I split the tracks and I covered Infrastructure, Database Scalability and Big Data. Michael covered Mobile Access, Infrastructure, Performance Monitoring, PaaS in Part 1.  Overall, the presenting companies have some compelling ideas and it gives an indicator as to the new thinking happening in Silicon Valley. The trends that I noticed were: a continued interest in private clouds, the increase in adoption of Openstack and the prevalence integrating Big Data.

UTR-logo1

If you never attended Under the Radar, the format is to have four startups that already have a real product present for 6 minutes and are then judged by a panel of experienced executives at more established companies. The presenters had to be companies that are actual startups with a unique value proposition and a real product that they are able to monetize. Alumni or companies that are already more established can also present as a “Grad Circle” member but they are not included in the awards presented at the end of the show. And like American Idol, the audience also has a vote on their favorites for each category.  I included the Judge’s choice and Audience choice for each category but also added my own choice which reflects my own opinion and not that of GoGrid.

Infrastructure

This category focused on companies that are delivering infrastructure or infrastructure management products. So this would include services that could offer up infrastructure components (like compute, network, and storage) or even tools for managing configurations and deployments. Not surprisingly, nearly all of them focus on the cloud as the operating model of choice.

Cloudscaling – This company focuses on delivering an amazon-like cloud using Openstack. Their solution is comprised of Open Cloud OS, which is a product grade version of Openstack, Cloudblocks, a comprehensive architecture for cloud services and Hardware Blueprints, which are templates for physical hardware. Customers can leverage this solution to deploy a public or private cloud in their own DC.

Nodejitsu – Sticking with the Japanese-theme of cloud automation companies (a la Heroku), this company makes it easy for customers using Node.js to deploy and automate services on the cloud. While Heroku’s strength is Ruby, Nodejitsu focuses on Javascript which they believe to be faster and to have greater staying power than other higher level languages.

Piston Cloud Computing – Its core product, Piston Enterprise OS is a massively scalable private cloud operating system build on Openstack and is designed for any company tackling Big Data and for 1/4 the cost of VMWare.

Zadara Storage – Focused on providing low cost, block storage as a service inside the cloud. Zadara provides an easy to use and flexible storage solution on multiple leading public clouds. The product operates as a virtual private storage array.

Puppet Labs (Grad Circle) -  A former best in show company, Puppet Labs manages the open source Puppet configuration management tool, one of the leading products used  at companies like Zynga and Citrix.

Judge’s Winner: Audience Winner: My Choice:
Piston_logo Piston_logo Piston_logo

Providing competition for VMWare is compelling, especially if can be done leveraging open source technologies, so Piston Cloud is also my choice in this category. Enterprises have great interest in private clouds but providing one that is based on an open technology gives more flexibility for hybrid clouds and a path to the eventual migration to the public cloud.

Database Scalability

This category focused on companies that are building out products that are designed to handle large scale datasets. This would include enhancements to open source products to better handle scale or new designs for handling larger datasets that need to be delivered faster and more efficiently. I felt that there is some overlap in this category with the Big Data category since they are interrelated. For example MongoLab can be used to solve for Big Data problems and the Big Data presenters can argue that they can also offer some form of database scalability.

Drawn to Scale – Builds a database called Spire that leverages Hadoop, Hbase and their own software to provide real-time Analysis for Big Data. This helps to solve for use cases where users already use Hadoop but need the ability to do real-time SQL queries from the data.

MemSQL – A Y combinator startup that offers an OLTP database that lives in memory. In combination with MySQL, this gives users a way to have high through-put transactions while also having persistence of data on disk.

MongoLab – Provides MongoDB as a service that is designed to better work with object-oriented development. It removes the operational and administration layer from developers and provides monitoring and backups for MongoDB.

ScaleArc – Sits in between the database and the app servers and helps with optimization and performance of MySQL-related databases. It operates like a load balancer for databases.

NuoDB (Grad Circle and formerly known as NimbusDB) – A new SQL and ACID compliant relational database that is designed to run on a distributed architecture like the cloud.

Judge’s Winner: Audience Winner: My Choice:

scalearc_logo

memsql_logo

mongolab1

In this category, I would pick MongoLab. With MongoDB’s adoption growing in the marketplace, it’s an important alternative to have services that enhance the database, especially if it can be delivered via a public cloud.

Big Data

This category offered companies that are solving problems around Big Data. This could involve technologies that are used to better handle Big Data or services that ease the collection, transformation or analysis of Big Data.  It seemed that most presenters generated or provided large amounts of 3rd party data in addition to providing products and services for Big Data.

Chart.io – Makes charting and analytics easy for non-technical users. Able to connect to MySQL, Postgres, Google Analytics and Oracle (in the near future). Chart.io provides an alternative to heavyweight on-premise business intelligence products.

Datasift – Helps customers extract, analyze and gain value from social networks. Although it can pull from over 30 data source, it is one of two exclusive re-syndicators of Twitter data. Datasift operates in a SaaS model so companies can be up and running in minutes.

Infochimps –  Provides a full Big Data platform for processing and analyzing data from their own data marketplace or anywhere in the web. This makes it easy to source data leveraging Hadoop and providing services on top of that platform.

Metamarkets – Provides data science as a service, providing for data exploration, decision support and operational awareness. Using their own technology, the product is able to process large volumes of data at speed and scale.

Judge’s Winner: Audience Winner: My Choice:
mmxlogo mmxlogo infochimps1

Having the ability to pull 3rd party data or any data from the web and analyze it without investing in your own infrastructure is a valuable product. Hadoop is difficult to wrangle although it is currently one of the leading technologies for running Map Reduce jobs. Providing the data and services on top of Hadoop makes Infochimps a key solution in my mind.

Top Winners

I would be remiss if I didn’t mention the top winners for the day:

Venture Beat People’s Choice Winner: Best in Show – Judge’s Winner: Best in Show – Audience Winner:
appfog1 cloudability1 Piston_logo

Overall, all the companies that presented were very interesting, showing the kind of innovation and creativity that we have come to expect from early stage startups. While cloud is still a strong theme here, I think that the future is moving towards Big Data. I think we will see the two themes start to converge as users start to see the power of using Big Data solutions in the cloud for performance and cost-effective deployments. The enthusiasm and effort given by these startups bode well for the technology industry and are a harbinger of the great things to come.


In Part 1 of this Big Data series, I provided a background on the origins of Big Data.

But What is Big Data?

Port Vell Barcelona

The problem with using the term “Big Data” is that it’s used in a lot of different ways. One definition is that Big Data is any data set that is too large for on-hand data management tools. According to Martin Wattenberg, a scientist at IBM, “The real yardstick … is how it [Big Data] compares with a natural human limit, like the sum total of all the words that you’ll hear in your lifetime.” Collecting that data is a solvable problem, but making sense of it, (particularly in real time), is the challenge that technology tries to solve. This new type of technology is often listed under the title of “NoSQL” and includes distributed databases that are a departure from relational databases like Oracle and MySQL. These are systems that are specifically designed to be able to parallelize compute, distribute data, and create fault tolerance on a large cluster of servers. Some examples of NoSQL projects and software are: Hadoop, Cassandra, MongoDB, Riak and Membase.

The techniques vary, but there is a definite distinction between SQL relational databases and their NoSQL brethren. Most notably, NoSQL systems share the following characteristics:

  • Do not use SQL as their primary query language
  • May not require fixed table schemas
  • May not give full ACID guarantees (Atomicity, Consistency, Isolation, Durability)
  • Scale horizontally

Because of the lack of ACID, NoSQL is used when performance and real-time results are more important than consistency. For example, if a company wants to update their website in real time based on an analysis of the behaviors of a particular user interaction with the site, they will most likely turn to NoSQL to solve this use case.

However, this does not mean that relational databases are going away. In fact, it is likely that in larger implementations, NoSQL and SQL will function together. Just as NoSQL was designed to solve a particular use case, so do relational databases solve theirs. Relational databases excel at organizing structured data and is the standard for serving up ad-hoc analytics and business intelligence reporting. In fact, Apache Hadoop even has a separate project called Sqoop that is designed to link Hadoop with structured data stores. Most likely, those who implement NoSQL will maintain their relational databases for legacy systems and for reporting off of their NosQL clusters.

CloudBigData-300x239-resized-600

Big Data and the Cloud

The early adopters of Big Data were small web companies that grew to much larger companies with capital budgets that could be invested into dedicated data centers. However, with the incredible increase in the amount of data generated, collected, and analyzed, smaller companies can take advantage of the cloud and off-load the hardware management to those vendors. Two traits that many of these NoSQL solutions have in common make them a seemingly natural fit for the cloud: One is that the nodes are distributed, and the second is that they run on commodity hardware. The cloud is designed for horizontal scaling and often built on low-cost, commodity hardware, especially at the infrastructure-as-service (IaaS) layer, where customers simply need infrastructure and have the application expertise to build and configure their own Big Data application (whether it is with Hadoop, Cassandra, or any number of products).

Given what most users are trying to achieve with Big Data applications – large-scale data sets, large-scale analysis, often in real-time – performance is a key factor. Ideally, users will want a hybrid implementation that combines both virtual and dedicated servers. This gives maximum flexibility that balances the elastic, scalable nature of virtual machines with the single-tenancy of dedicated servers. Big Data projects don’t happen in a vacuum: while a NoSQL database can leverage dedicated servers, the app or web servers that present the results of the analysis to end users can easily be added to as many virtual machines as needed to meet demand. In addition, using the cloud means that users won’t need to invest in expensive equipment, pay for power and connectivity, or hire additional resources to maintain hardware. Users simply need to pay for the infrastructure that they need and have the ability to scale over time. The ability to scale up or down to match demand (and to only pay for the infrastructure that you use) is one of the values of using the cloud for Big Data.

With whatever solution that you select, you should also take into account the nature of the application and where you will want to house the processing and the output. The amount of data you collect, analyze and present will only increase over time. The advantage will go to companies that can collect and analyze this data quickly and efficiently, allowing them to react instantly to customer sentiment and to changing trends in the ever-quickening pace of business. Make sure to select the right infrastructure vendor who can match your performance criteria and has capacity to grow with you as your data and application needs increase to match the demands of your business.


data-security

For many years, companies collected data from various sources that often found its way to relational databases like Oracle and MySQL. However, the rise of the internet and Web 2.0, and recently social media began not only an enormous increase in the amount of data created, but also in the type of data. No longer was data relegated to types that easily fit into standard data fields – it now came in the form of photos, geographic information, chats, Twitter feeds and emails. The age of Big Data is upon us.

A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes. 1 zettabyte equals 1 trillion gigabytes. To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More importantly, this data has to go somewhere, and this report projects that by 2020, more than 1/3 of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be to collect, store, and analyze what it all means.

Business intelligence (BI) systems have always had to deal with large data sets. Typically the strategy was to pull in “atomic” -level data at the lowest level of granularity, then aggregate the information to a consumable format for end users. In fact, it was preferable to have a lot of data since you could also “drill-down” from the aggregation layer to get at the more detailed information, as needed.

Large Data Sets and Sampling

Coming from a data background, I find that dealing with large data sets is both a blessing and a curse. One product that I managed analyzed share of wireless numbers. The number of wireless subscribers in 2011 according to CTIA was 322.9 million and growing. While that doesn’t seem like a lot of data at first, if each wireless number was a unique identifier, there could be any number of activities associated with each number. Therefore the amount of information generated from each number could be extensive, especially as the key element was seeing changes over time. For example, after 2003, mobile subscribers in the United States were able to port their numbers from one carrier to another. This is of great importance to market research since a shift from one carrier to another would indicate churn and also impact the market share of carriers in that Metropolitan Statistical Area (MSA).

Given that it would take a significant amount of resources to poll every household in the United States, market researchers often employ a technique called sampling. This is a statistical technique where a panel that represents the population is used to represent the activity of the overall population that you want to measure. This is a sound scientific technique if done correctly but its not without its perils. For example, it’s often possible to get +/- 1% error at 95% confidence for a large population but what happens once you start drilling down into more specific demographics and geographies? The risk is not only having enough sample (you can’t just have one subscriber represent the activity of a large group for example) but also ensuring that it is representative (is the subscriber that you are measuring representative of the population that you want to measure?). It’s a classic problem of using panelists that sampling errors do occur. It’s fairly difficult to be completely certain that your sample is representative unless you’ve actually measured the entire population already (using it as a baseline) but if you’ve already done that, why bother sampling?

Deweytruman12

One of the most famous examples of sampling error was the 1948 election where a Gallup Poll all but declared that Thomas Dewey had defeated Harry Truman. Although Gallup used scientific sampling (as opposed to a straw poll), it was with a quota sample that proved to be a deeply flawed measurement tool. Since it relied on human intervention to choose the sample, it was inherently biased. Even with modern techniques, it is important to always take into account the margin of error and the confidence interval, which is the indication of the reliability of the measurement.

Of course, the real luxury is the ability to be able to poll the entire population. While the main issue with polling the entire population is more in the data collection (which is why the census is only conducted one once a decade) and not in the data analysis, assuming that the data collection can be done, being able to analyze that large a data set quickly and efficiently would negate the need for using a sample. For example, while polling every single person in the United States is extremely expensive and difficult, collecting all the social network data regarding your brand should be fairly easy. The majority of social networks have an API and most people who use it are already referencing your brand and/or posting to your content pages. The issue is less of collection than of being able to analyze all that data in an efficient and timely manner.

As mentioned earlier, business intelligence has had to deal with this type of data problem and it was often solved by creating increasingly powerful proprietary hardware. Teradata was one of the early pioneers of this technique, selling large and powerful equipment that was used to process large amounts of data. A more modern incarnation, Netezza (now part of IBM), claimed to pull data at “physics speed,” which removes the database layer and interacts directly with the hardware to extract data as fast as data could be pulled from the spindle. It’s extremely fast, but still required expensive, proprietary hardware.

The Yellow Elephant

So large data sets have been around a long time. There have been attempts at trying to manage, wrangle, and tame the onslaught of data being generated from everywhere. But it was not until Jeffrey Dean and Sanjay Ghemawat of Google Labs wrote their influential paper on MapReduce in 2003 that Big Data really started to take shape. Google has had to deal with large amounts of raw data (such as crawled documents and web request logs) that needed to be analyzed in a timely manner. Creating MapReduce was their way to being able to abstract the compute parallelization, distribution of data, fault tolerance, and load balancing from the developers so that they can focus on expressing the computations necessary to analyze the data. This seminal paper reportedly inspired Doug Cutting to develop an open-source implementation of the MapReduce framework called “Hadoop,” which was named after his son’s toy elephant. hadoopYahoo famously embraced this implementation after hiring Cutting in 2004. Yahoo continued to build upon this technology and first used Hadoop in production in 2008 for it’s search “webmap,” which was an index of all known webpages and all the metadata needed to search them.

One of the key characteristics of Hadoop was that it could run on commodity hardware and automatically distribute jobs. By its nature, it is designed to be fault tolerant so jobs are not impacted by the failure of a single node. According to an article in Wired Magazine about Yahoo’s use of Hadoop, “Hadoop could ‘map’ tasks across a cluster of machines, splitting them into tiny sub-tasks, before ‘reducing’ the results into one master calculation.” Soon after, companies like eBay and Facebook were adopting the technology and implementing it internally. Reportedly, Facebook has the largest Hadoop Cluster in the world, currently at 30 petabytes (PB).

While early adopters of Hadoop and other Big Data technologies tended to form around Internet, social media, and ad networks, Big Data is intended to be a general-purpose tool. With most companies now integrating social media into their offerings, the amount of data created internally combined with those extracted externally will only increase. This is an indication that companies from all industries will need to start investigating how to implement Big Data technologies to make use of all this data that they are collecting and creating.

In Part 2 of this Big Data series, I discuss how Big Data and the Cloud work together.


IMG_0132

To celebrate the release of their API, Spotify sponsored a Hack-a-thon at SPiN Ping Pong Club in New York City from Friday February 24 until Sunday February 26. Spotify was joined by big brands like Doritos, CW, McDonald’s, Showtime, State Farm and Mountain Dew. Technology companies sponsoring the event included Facebook, Twilio, FourSquare, The Echo Nest and of course, GoGrid. GoGrid provided all the cloud servers for the event to support the developers as they created brand new apps using the Spotify API in conjunction with other API like Facebook’s Open Graph. GoGrid’s manager of cloud ecosystem, Paul Lancaster and I were on-hand to meet with developers and provide support for the event.

IMG_0137

50 CentOS x64 cloud servers were provisioned to the hackers by GoGrid to build their applications free of charge with root level access for maximum flexibility. Hundreds of hackers showed up to build the next great apps and were treated to live performances by Blood Orange and MNDR. While hack-a-thons tend to have attrition over time, hackers stayed throughout the night and most for the entire weekend.

Museik App

There were roughly 30 projects worked on during the weekend which ranged from an app called Museik (UI shown above) that extracts content from the internet related to the release date of a song on Spotify to a project called Orbidal by the students of the VCU Brandcenter that gathers the collective feelings of your Facebook feed and creates a playlist based on that mood on Spotify.

musicappshackwkndposter

GoGrid awarded a prize to JukeSpot, an app that combined the APIs of Spotify, Facebook, and Foursquare to allow users to play their songs and playlists at any bar/club/restaurant that has the JukeSpot app running on their system. It also allows users to compete against each other for points, share points socially, and buy points from marketers or retailers.

One app called Music Monster integrated Spotify with Foursquare and The Echo Nest. This app is designed to be used by a DJ to check-in to a location, set up a playlist and start tracking audience responses in real-time. The data comes from a client app that users at the venue use to check-in and then provide real-time feedback to the DJ (indicating that they want to hear more familiar tracks or something more mellow). This particular app won for best Echo Nest hack.

Another app called Swarm.fm built by Peter Watts inverts the paradigm of integrating Spotify activity into Facebook. Rather, Facebook (and other sources) activity is integrated into Spotify, so that you can see your friend’s likes of particular bands and the activity of artists that you like all tied to your music collection. It can also find similarities between friends and can generate playlists based on artists, interests and brands you have in common. This clever app won the Spotify Grand prize and $10,000.

The Grand Prize was judged on the following criteria:

  • Overall Product Viability
  • Level of innovation
  • Depth of integration with Spotify API(s)
  • User Experience
  • Utility
  • Integration of Music

Overall, this was a well-organized, fun and productive event. Lot of great, innovative apps were built in a short period of time. Spotify and OMD provided a great atmosphere with plenty food (seemingly unlimited supply of Doritos and Mountain Dew), live music and support from technology partners and brand executives.


As of today, GoGrid has released multiple images of the leading software load balancer, Riverbed Stingray! The following images are available on the GoGrid Partner Exchange in both San Francisco and Amsterdam:

  • Riverbed 7.4 Simple Load Balancer 10 Mbps
  • Riverbed 8.1 Load Balancer 10 Mbps
  • Riverbed 8.1 Load Balancer 200 Mbps
  • Riverbed 8.1 Load Balancer 200 Mbps WAF
wpid3797-Stingray.png

Note that the Riverbed 7.4 image is still Zeus branded. We have made that available in order for users to have access to the Simple Load Balancer on GoGrid. It currently supports up to 10Mbps bandwidth and basic load balancing. It does not have clustering, SSL decryption, health checks or any advanced load balancing features.

The Riverbed 8.1 Load Balancer 10 Mbps Load Balancer supports bandwidth up to 10Mbps, clustering, no SSL, and basic load balancing.

The Riverbed 8.1 Load Balancer 200 Mbps Load Balancer supports bandwidth up to 200Mbps, clustering, no SSL, and basic load balancing.

The Riverbed 8.1 Load Balancer 200 Mbps Load Balancer WAF supports bandwidth up to 200Mbps, clustering, SSL, load balancing, health checks and integrated Web Application Firewall.

Finding the Images

wpid3789-media_1327947751889.png

The images are available via our image selector. In order to find and launch the Riverbed images, click on “Add Cloud Server” for the Data Center that you want to use. In the “Name” field type “Riverbed” and then hit enter. This will filter for just the Riverbed images.

The charges are monthly and you will be charged after you deploy the image. There is a special promotion occurring for Amsterdam regarding deployment of the Riverbed images. Please contact your GoGrid Sales Representative for more details.

Deploying the Load Balancer

wpid3795-media_1328226831712.png

The deployment of Stingray is similar to the setup for Zeus. The main difference is the setup is now automated and the license is automatically applied. Note that these instructions ONLY apply to the Riverbed 8.1 versions. These are the basic steps.

  1. Select the Load Balancer image based on your needs. For this example, I will select “Riverbed 8.1 Load Balancer 10Mbps”. Click “Next” and then enter a Server Name, select an IP and the amount of RAM – I recommend using at least 1 GB of RAM on the server. This will generate a Virtual Machine with the software pre-deployed after you click “Save”.
  2. All the Stingray Images run on Ubuntu x64 base images. You will need to access the server via SSH using the root login. Your logins can be found in the GoGrid web portal by clicking on the server icon, then Tools > Passwords.
  3. One of the main differences with this version is that the installer starts immediately upon login and applies the appropriate license. Type “accept” at the prompt to begin the installer or press “return” to abort. If you do not accept the license terms, please delete the server.
  4. The script will configure the Load Balancer for you and generate a temporary password. The password for the Load Balancer will be documented at the end of the script so look for it there. Make sure to take note of it since you will need it to login to the GUI.
  5. You will be returned to the prompt – at this point I recommend changing the server password (note that this is NOT the password for the load balancer). This is the password that you will use to access the server again via SSH. In case you have forgotten, the command to enter a new password for Ubuntu is “passwd”.

Launching the UI

wpid3793-media_1328220648196.png

Launch your favorite browser and enter the IP address of the server with the port 9090. For example, you would enter something like:

https://190.10.1.1:9090

Since you are connecting via SSL with a self-signed certificate, your browser will give you a warning message. Since this is your own server, you can bypass the message (assuming that you entered the address correctly) and set an exception for this address.

Once you have cleared the warning page, you will be presented with the Riverbed Stingray GUI. At the login screen, enter the following:

Username: admin

Password: [the password generated for you by the system in the previous step]

Update the Admin Password

wpid3791-media_1328220486670.png

Go to the tab “System”.

Select Users > Local > Admin

Change your admin password on this screen. You can also create other accounts from the User tab.

All the Stingray 8.1 licenses in GoGrid allow for clustering and passive health checks. You can configure this on the GUI – the process is the same as the Zeus Load Balancer so you can refer to my previous blog post for more details – “How to Configure Zeus’ New Load Balancer in the GoGrid Cloud“. You can just scroll past the SSL Certificate graphic to bypass the Zeus-specific instructions and into the details on how to add servers to a pool and configure the load balancer.

You can also refer to the Riverbed Quick Start Guide on our wiki.

Since this is a partner image, all support will go through Riverbed. There is extensive documentation on the Riverbed support website as well.

With four different images to chose from, you will now have the flexibility to select the features and price point that work best for you. From controlling traffic to a single web server to managing a large pool of servers across multiple data centers, GoGrid with Riverbed Load Balancers offers the right, scalable solutions for your unique Cloud Fingerprint.