We're Hiring!  
Toll Free US & Canada: 1(877) 946-4743   Worldwide: +1(415) 869-7444

It’s always difficult to be self-congratulatory, but when two distinct organizations post articles putting GoGrid in the limelight, I feel that they have to be mentioned. The two sites, CRN and OnDemand, both posted articles that we, as a company, are proud of. If you are a current customer of GoGrid, you should be proud as well, as it is a recognition of your selection process in making GoGrid your cloud partner.

cloud_100_infrastructure_400

The first to appear was on CRN, a website providing news, analysis and perspectives for VARs and technology integrators. In his article, author Andrew Hickey, lists out “The 20 Coolest Cloud Infrastructure Vendors” of which GoGrid is one of the 20.

CRN-GoGrid-coolest-vendor-2012

As Hickey describes:

If you want to be in the cloud business, these are some of the cloud infrastructure companies that will help you get there. These are the providers that will host your customers’ business applications and provision them on-demand as Software-as-a-Service. They will store customers’ data in the cloud and secure it there as well. Whether customers want to use a private cloud, a public cloud or a hybrid mixture of both, these companies can help make it happen. And they can even help your customers exchange their expensive legacy hardware for a simple monthly payment plan.

(The others mentioned are, alphabetically: Amazon Web Services, AT&T, Bluelock, Cisco, Dell, Eucalyptus Systems, Gale Technologies, GoGrid, HP, Nebula, NephoScale, OpenStack, Opscode, OpSource, Rackspace, Savvis, SoftLayer, SunGard Availability Services, Terremark, and Verizon. Congrats to the other companies in the list!)

ODTOP100.TopArtCollage_300x211_use_me_0

The second article of recognition comes from AlwaysOn (“Networking the Global Silicon Valley”) OnDemand. OnDemand is an event “where top Internet companies disrupt the enterprise and square off with the incumbent players pioneering cloud computing and SaaS.” This month, they posted their “2012 OnDemand 100 Top Private Companies“.

OnDemand-gogrid-2012-priv-company

The editors describe this list as such:

This year’s 100 top, private on-demand and SaaS companies-plus 20 to watch-are creating a complex world of interconnected business intelligence, merging valuable legacy data and systems with new, vital streams of information. AlwaysOn is proud introduce the third annual OnDemand 100-the top emerging Internet companies disrupting the established enterprise, reinventing legacy data streams, and pioneering cloud computing and SaaS.

This year’s OnDemand 100 companies are true leaders, developing game-changing approaches and technologies that are pushing outside the bounds of existing markets and away from entrenched institutions. Companies were selected based on a set of five criteria: innovation, market potential, commercialization, stakeholder value, and media buzz.

(Companies in the “Cloud-Infrastructure” space include: Actifio, AlienVault, Box.net, Centrify, CloudShare, Coraid, Delphix, Dropbox, Evolve IP, GoGrid, IntelePeer, LiveOps, Mu Dynamics, Opscode, Plantir Technologies, RainStor, RingCentral, Skytap, Sonian, Spiceworks, Syncplicity, Veracode and WhiteHat Security. Great job to all those companies!)

Again, thanks for being a GoGrid customer and making us even stronger and better as a company. We look forward to providing you with industry-leading technologies and services! For those who are not yet GoGrid customers, please be sure to contact us to find out why we are a recognized leader in the cloud industry.


In Part 1 of this Big Data series, I provided a background on the origins of Big Data.

But What is Big Data?

Port Vell Barcelona

The problem with using the term “Big Data” is that it’s used in a lot of different ways. One definition is that Big Data is any data set that is too large for on-hand data management tools. According to Martin Wattenberg, a scientist at IBM, “The real yardstick … is how it [Big Data] compares with a natural human limit, like the sum total of all the words that you’ll hear in your lifetime.” Collecting that data is a solvable problem, but making sense of it, (particularly in real time), is the challenge that technology tries to solve. This new type of technology is often listed under the title of “NoSQL” and includes distributed databases that are a departure from relational databases like Oracle and MySQL. These are systems that are specifically designed to be able to parallelize compute, distribute data, and create fault tolerance on a large cluster of servers. Some examples of NoSQL projects and software are: Hadoop, Cassandra, MongoDB, Riak and Membase.

The techniques vary, but there is a definite distinction between SQL relational databases and their NoSQL brethren. Most notably, NoSQL systems share the following characteristics:

  • Do not use SQL as their primary query language
  • May not require fixed table schemas
  • May not give full ACID guarantees (Atomicity, Consistency, Isolation, Durability)
  • Scale horizontally

Because of the lack of ACID, NoSQL is used when performance and real-time results are more important than consistency. For example, if a company wants to update their website in real time based on an analysis of the behaviors of a particular user interaction with the site, they will most likely turn to NoSQL to solve this use case.

However, this does not mean that relational databases are going away. In fact, it is likely that in larger implementations, NoSQL and SQL will function together. Just as NoSQL was designed to solve a particular use case, so do relational databases solve theirs. Relational databases excel at organizing structured data and is the standard for serving up ad-hoc analytics and business intelligence reporting. In fact, Apache Hadoop even has a separate project called Sqoop that is designed to link Hadoop with structured data stores. Most likely, those who implement NoSQL will maintain their relational databases for legacy systems and for reporting off of their NosQL clusters.

CloudBigData-300x239-resized-600

Big Data and the Cloud

The early adopters of Big Data were small web companies that grew to much larger companies with capital budgets that could be invested into dedicated data centers. However, with the incredible increase in the amount of data generated, collected, and analyzed, smaller companies can take advantage of the cloud and off-load the hardware management to those vendors. Two traits that many of these NoSQL solutions have in common make them a seemingly natural fit for the cloud: One is that the nodes are distributed, and the second is that they run on commodity hardware. The cloud is designed for horizontal scaling and often built on low-cost, commodity hardware, especially at the infrastructure-as-service (IaaS) layer, where customers simply need infrastructure and have the application expertise to build and configure their own Big Data application (whether it is with Hadoop, Cassandra, or any number of products).

Given what most users are trying to achieve with Big Data applications – large-scale data sets, large-scale analysis, often in real-time – performance is a key factor. Ideally, users will want a hybrid implementation that combines both virtual and dedicated servers. This gives maximum flexibility that balances the elastic, scalable nature of virtual machines with the single-tenancy of dedicated servers. Big Data projects don’t happen in a vacuum: while a NoSQL database can leverage dedicated servers, the app or web servers that present the results of the analysis to end users can easily be added to as many virtual machines as needed to meet demand. In addition, using the cloud means that users won’t need to invest in expensive equipment, pay for power and connectivity, or hire additional resources to maintain hardware. Users simply need to pay for the infrastructure that they need and have the ability to scale over time. The ability to scale up or down to match demand (and to only pay for the infrastructure that you use) is one of the values of using the cloud for Big Data.

With whatever solution that you select, you should also take into account the nature of the application and where you will want to house the processing and the output. The amount of data you collect, analyze and present will only increase over time. The advantage will go to companies that can collect and analyze this data quickly and efficiently, allowing them to react instantly to customer sentiment and to changing trends in the ever-quickening pace of business. Make sure to select the right infrastructure vendor who can match your performance criteria and has capacity to grow with you as your data and application needs increase to match the demands of your business.


data-security

For many years, companies collected data from various sources that often found its way to relational databases like Oracle and MySQL. However, the rise of the internet and Web 2.0, and recently social media began not only an enormous increase in the amount of data created, but also in the type of data. No longer was data relegated to types that easily fit into standard data fields – it now came in the form of photos, geographic information, chats, Twitter feeds and emails. The age of Big Data is upon us.

A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes. 1 zettabyte equals 1 trillion gigabytes. To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More importantly, this data has to go somewhere, and this report projects that by 2020, more than 1/3 of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be to collect, store, and analyze what it all means.

Business intelligence (BI) systems have always had to deal with large data sets. Typically the strategy was to pull in “atomic” -level data at the lowest level of granularity, then aggregate the information to a consumable format for end users. In fact, it was preferable to have a lot of data since you could also “drill-down” from the aggregation layer to get at the more detailed information, as needed.

Large Data Sets and Sampling

Coming from a data background, I find that dealing with large data sets is both a blessing and a curse. One product that I managed analyzed share of wireless numbers. The number of wireless subscribers in 2011 according to CTIA was 322.9 million and growing. While that doesn’t seem like a lot of data at first, if each wireless number was a unique identifier, there could be any number of activities associated with each number. Therefore the amount of information generated from each number could be extensive, especially as the key element was seeing changes over time. For example, after 2003, mobile subscribers in the United States were able to port their numbers from one carrier to another. This is of great importance to market research since a shift from one carrier to another would indicate churn and also impact the market share of carriers in that Metropolitan Statistical Area (MSA).

Given that it would take a significant amount of resources to poll every household in the United States, market researchers often employ a technique called sampling. This is a statistical technique where a panel that represents the population is used to represent the activity of the overall population that you want to measure. This is a sound scientific technique if done correctly but its not without its perils. For example, it’s often possible to get +/- 1% error at 95% confidence for a large population but what happens once you start drilling down into more specific demographics and geographies? The risk is not only having enough sample (you can’t just have one subscriber represent the activity of a large group for example) but also ensuring that it is representative (is the subscriber that you are measuring representative of the population that you want to measure?). It’s a classic problem of using panelists that sampling errors do occur. It’s fairly difficult to be completely certain that your sample is representative unless you’ve actually measured the entire population already (using it as a baseline) but if you’ve already done that, why bother sampling?

Deweytruman12

One of the most famous examples of sampling error was the 1948 election where a Gallup Poll all but declared that Thomas Dewey had defeated Harry Truman. Although Gallup used scientific sampling (as opposed to a straw poll), it was with a quota sample that proved to be a deeply flawed measurement tool. Since it relied on human intervention to choose the sample, it was inherently biased. Even with modern techniques, it is important to always take into account the margin of error and the confidence interval, which is the indication of the reliability of the measurement.

Of course, the real luxury is the ability to be able to poll the entire population. While the main issue with polling the entire population is more in the data collection (which is why the census is only conducted one once a decade) and not in the data analysis, assuming that the data collection can be done, being able to analyze that large a data set quickly and efficiently would negate the need for using a sample. For example, while polling every single person in the United States is extremely expensive and difficult, collecting all the social network data regarding your brand should be fairly easy. The majority of social networks have an API and most people who use it are already referencing your brand and/or posting to your content pages. The issue is less of collection than of being able to analyze all that data in an efficient and timely manner.

As mentioned earlier, business intelligence has had to deal with this type of data problem and it was often solved by creating increasingly powerful proprietary hardware. Teradata was one of the early pioneers of this technique, selling large and powerful equipment that was used to process large amounts of data. A more modern incarnation, Netezza (now part of IBM), claimed to pull data at “physics speed,” which removes the database layer and interacts directly with the hardware to extract data as fast as data could be pulled from the spindle. It’s extremely fast, but still required expensive, proprietary hardware.

The Yellow Elephant

So large data sets have been around a long time. There have been attempts at trying to manage, wrangle, and tame the onslaught of data being generated from everywhere. But it was not until Jeffrey Dean and Sanjay Ghemawat of Google Labs wrote their influential paper on MapReduce in 2003 that Big Data really started to take shape. Google has had to deal with large amounts of raw data (such as crawled documents and web request logs) that needed to be analyzed in a timely manner. Creating MapReduce was their way to being able to abstract the compute parallelization, distribution of data, fault tolerance, and load balancing from the developers so that they can focus on expressing the computations necessary to analyze the data. This seminal paper reportedly inspired Doug Cutting to develop an open-source implementation of the MapReduce framework called “Hadoop,” which was named after his son’s toy elephant. hadoopYahoo famously embraced this implementation after hiring Cutting in 2004. Yahoo continued to build upon this technology and first used Hadoop in production in 2008 for it’s search “webmap,” which was an index of all known webpages and all the metadata needed to search them.

One of the key characteristics of Hadoop was that it could run on commodity hardware and automatically distribute jobs. By its nature, it is designed to be fault tolerant so jobs are not impacted by the failure of a single node. According to an article in Wired Magazine about Yahoo’s use of Hadoop, “Hadoop could ‘map’ tasks across a cluster of machines, splitting them into tiny sub-tasks, before ‘reducing’ the results into one master calculation.” Soon after, companies like eBay and Facebook were adopting the technology and implementing it internally. Reportedly, Facebook has the largest Hadoop Cluster in the world, currently at 30 petabytes (PB).

While early adopters of Hadoop and other Big Data technologies tended to form around Internet, social media, and ad networks, Big Data is intended to be a general-purpose tool. With most companies now integrating social media into their offerings, the amount of data created internally combined with those extracted externally will only increase. This is an indication that companies from all industries will need to start investigating how to implement Big Data technologies to make use of all this data that they are collecting and creating.

In Part 2 of this Big Data series, I discuss how Big Data and the Cloud work together.


I recently celebrated my 5-year anniversary with GoGrid! That’s me holding the service award we hand out to employees who celebrate 3 and 5-year anniversaries. It was 5 years ago that I implemented the idea of handing out these gems, gems that GoGrid employees proudly display at their desks.

jeryn5year

I remember my first interview with, then, ServePath, and interviewing in a very small office down the street from our current corporate office. The office was so small my interview was conducted in the break area/lobby surrounded by cubes. 5 years later GoGrid’s corporate office occupies the second floor in the Hills Plaza complex that overlooks the Bay Bridge. Talk about an upgrade!

image

Image source: www.noehill.com

My first tour of our San Francisco Data Center was amazing. Back then the data center shared the floor with cubes and offices, housing our small engineering and support teams. Now the data center is so big, with only servers and more servers lining pretty much every inch of our space, that I get lost on the occasional tours I give to my vendors and employees.

Although I have witnessed many changes over the last 5 years what hasn’t changed are the people that work for GoGrid. We are still that group of people I saw on the Craigslist posting for my job 5 years ago but better. We still work hard and like to have fun. This is what makes GoGrid a great place to work and why there are quite a few of us who have been here for a very long time.

So now that I have been inducted to the unofficial (still working on internal marketing) “Old School” club I join the ranks of 14% of the company who have been here for 5 years or more and another 12% that will hit their 5 years sometime this year! This means as an Old School member I get one week to Take a Break this year- that’s GoGrid’s way of saying thanks for your continued years of service and that’s in addition to the 25 other days I get for PTO. Thoughts on where I should vacation would be appreciated!

I’m looking forward to more years to come with GoGrid!


In January, we announced the opening of GoGrid’s new EMEA Headquarters and deployment of our cloud infrastructure in Equinix’s International Business Exchange (IBX) data center in Amsterdam, The Netherlands. To mark this collaboration, we’re pleased to announce an upcoming event jointly hosted with our friends at Equinix. The “Heads in the Cloud” happy hour will take place Thursday, 15 March, 2012, from 16:30–19:00 at De Goudfazant, Amsterdam (Noord). If you’re in the area, we’d be delighted to have you join us for the celebration.

The highlight of the evening will definitely be the comedy of Greg Shapiro, our guest speaker. If you haven’t heard Shapiro perform, check out his website and learn how this Chicago native ended up in Amsterdam and why his humor helps make everyday business bearable. We expect it to be a terrific event, so make sure to block the date out in your calendar.

image

After Shapiro’s performance, relax over drinks and chat with your peers—partners, suppliers, and customers—as well as the GoGrid and Equinix teams. We’ll be happy to answer your questions about our plans for the region moving forward. And for those who’ve never visited De Goudfazant before, ”Heads in the Cloud” provides the perfect complement to the stylish waterfront location and unique atmosphere.

If you’re interested in attending, we encourage you to register ahead of time via email to Geesje.Duis-Bakker[AT]eu.equinix.com. We look forward to seeing you on the 15th!