We're Hiring!  
Toll Free US & Canada: 1(877) 946-4743   Worldwide: +1(415) 869-7444

Archive for the ‘Industry’ Category

In Part 1 of this Big Data series, I provided a background on the origins of Big Data.

But What is Big Data?

Port Vell Barcelona

The problem with using the term “Big Data” is that it’s used in a lot of different ways. One definition is that Big Data is any data set that is too large for on-hand data management tools. According to Martin Wattenberg, a scientist at IBM, “The real yardstick … is how it [Big Data] compares with a natural human limit, like the sum total of all the words that you’ll hear in your lifetime.” Collecting that data is a solvable problem, but making sense of it, (particularly in real time), is the challenge that technology tries to solve. This new type of technology is often listed under the title of “NoSQL” and includes distributed databases that are a departure from relational databases like Oracle and MySQL. These are systems that are specifically designed to be able to parallelize compute, distribute data, and create fault tolerance on a large cluster of servers. Some examples of NoSQL projects and software are: Hadoop, Cassandra, MongoDB, Riak and Membase.

The techniques vary, but there is a definite distinction between SQL relational databases and their NoSQL brethren. Most notably, NoSQL systems share the following characteristics:

  • Do not use SQL as their primary query language
  • May not require fixed table schemas
  • May not give full ACID guarantees (Atomicity, Consistency, Isolation, Durability)
  • Scale horizontally

Because of the lack of ACID, NoSQL is used when performance and real-time results are more important than consistency. For example, if a company wants to update their website in real time based on an analysis of the behaviors of a particular user interaction with the site, they will most likely turn to NoSQL to solve this use case.

However, this does not mean that relational databases are going away. In fact, it is likely that in larger implementations, NoSQL and SQL will function together. Just as NoSQL was designed to solve a particular use case, so do relational databases solve theirs. Relational databases excel at organizing structured data and is the standard for serving up ad-hoc analytics and business intelligence reporting. In fact, Apache Hadoop even has a separate project called Sqoop that is designed to link Hadoop with structured data stores. Most likely, those who implement NoSQL will maintain their relational databases for legacy systems and for reporting off of their NosQL clusters.

CloudBigData-300x239-resized-600

Big Data and the Cloud

The early adopters of Big Data were small web companies that grew to much larger companies with capital budgets that could be invested into dedicated data centers. However, with the incredible increase in the amount of data generated, collected, and analyzed, smaller companies can take advantage of the cloud and off-load the hardware management to those vendors. Two traits that many of these NoSQL solutions have in common make them a seemingly natural fit for the cloud: One is that the nodes are distributed, and the second is that they run on commodity hardware. The cloud is designed for horizontal scaling and often built on low-cost, commodity hardware, especially at the infrastructure-as-service (IaaS) layer, where customers simply need infrastructure and have the application expertise to build and configure their own Big Data application (whether it is with Hadoop, Cassandra, or any number of products).

Given what most users are trying to achieve with Big Data applications – large-scale data sets, large-scale analysis, often in real-time – performance is a key factor. Ideally, users will want a hybrid implementation that combines both virtual and dedicated servers. This gives maximum flexibility that balances the elastic, scalable nature of virtual machines with the single-tenancy of dedicated servers. Big Data projects don’t happen in a vacuum: while a NoSQL database can leverage dedicated servers, the app or web servers that present the results of the analysis to end users can easily be added to as many virtual machines as needed to meet demand. In addition, using the cloud means that users won’t need to invest in expensive equipment, pay for power and connectivity, or hire additional resources to maintain hardware. Users simply need to pay for the infrastructure that they need and have the ability to scale over time. The ability to scale up or down to match demand (and to only pay for the infrastructure that you use) is one of the values of using the cloud for Big Data.

With whatever solution that you select, you should also take into account the nature of the application and where you will want to house the processing and the output. The amount of data you collect, analyze and present will only increase over time. The advantage will go to companies that can collect and analyze this data quickly and efficiently, allowing them to react instantly to customer sentiment and to changing trends in the ever-quickening pace of business. Make sure to select the right infrastructure vendor who can match your performance criteria and has capacity to grow with you as your data and application needs increase to match the demands of your business.


data-security

For many years, companies collected data from various sources that often found its way to relational databases like Oracle and MySQL. However, the rise of the internet and Web 2.0, and recently social media began not only an enormous increase in the amount of data created, but also in the type of data. No longer was data relegated to types that easily fit into standard data fields – it now came in the form of photos, geographic information, chats, Twitter feeds and emails. The age of Big Data is upon us.

A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes. 1 zettabyte equals 1 trillion gigabytes. To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More importantly, this data has to go somewhere, and this report projects that by 2020, more than 1/3 of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be to collect, store, and analyze what it all means.

Business intelligence (BI) systems have always had to deal with large data sets. Typically the strategy was to pull in “atomic” -level data at the lowest level of granularity, then aggregate the information to a consumable format for end users. In fact, it was preferable to have a lot of data since you could also “drill-down” from the aggregation layer to get at the more detailed information, as needed.

Large Data Sets and Sampling

Coming from a data background, I find that dealing with large data sets is both a blessing and a curse. One product that I managed analyzed share of wireless numbers. The number of wireless subscribers in 2011 according to CTIA was 322.9 million and growing. While that doesn’t seem like a lot of data at first, if each wireless number was a unique identifier, there could be any number of activities associated with each number. Therefore the amount of information generated from each number could be extensive, especially as the key element was seeing changes over time. For example, after 2003, mobile subscribers in the United States were able to port their numbers from one carrier to another. This is of great importance to market research since a shift from one carrier to another would indicate churn and also impact the market share of carriers in that Metropolitan Statistical Area (MSA).

Given that it would take a significant amount of resources to poll every household in the United States, market researchers often employ a technique called sampling. This is a statistical technique where a panel that represents the population is used to represent the activity of the overall population that you want to measure. This is a sound scientific technique if done correctly but its not without its perils. For example, it’s often possible to get +/- 1% error at 95% confidence for a large population but what happens once you start drilling down into more specific demographics and geographies? The risk is not only having enough sample (you can’t just have one subscriber represent the activity of a large group for example) but also ensuring that it is representative (is the subscriber that you are measuring representative of the population that you want to measure?). It’s a classic problem of using panelists that sampling errors do occur. It’s fairly difficult to be completely certain that your sample is representative unless you’ve actually measured the entire population already (using it as a baseline) but if you’ve already done that, why bother sampling?

Deweytruman12

One of the most famous examples of sampling error was the 1948 election where a Gallup Poll all but declared that Thomas Dewey had defeated Harry Truman. Although Gallup used scientific sampling (as opposed to a straw poll), it was with a quota sample that proved to be a deeply flawed measurement tool. Since it relied on human intervention to choose the sample, it was inherently biased. Even with modern techniques, it is important to always take into account the margin of error and the confidence interval, which is the indication of the reliability of the measurement.

Of course, the real luxury is the ability to be able to poll the entire population. While the main issue with polling the entire population is more in the data collection (which is why the census is only conducted one once a decade) and not in the data analysis, assuming that the data collection can be done, being able to analyze that large a data set quickly and efficiently would negate the need for using a sample. For example, while polling every single person in the United States is extremely expensive and difficult, collecting all the social network data regarding your brand should be fairly easy. The majority of social networks have an API and most people who use it are already referencing your brand and/or posting to your content pages. The issue is less of collection than of being able to analyze all that data in an efficient and timely manner.

As mentioned earlier, business intelligence has had to deal with this type of data problem and it was often solved by creating increasingly powerful proprietary hardware. Teradata was one of the early pioneers of this technique, selling large and powerful equipment that was used to process large amounts of data. A more modern incarnation, Netezza (now part of IBM), claimed to pull data at “physics speed,” which removes the database layer and interacts directly with the hardware to extract data as fast as data could be pulled from the spindle. It’s extremely fast, but still required expensive, proprietary hardware.

The Yellow Elephant

So large data sets have been around a long time. There have been attempts at trying to manage, wrangle, and tame the onslaught of data being generated from everywhere. But it was not until Jeffrey Dean and Sanjay Ghemawat of Google Labs wrote their influential paper on MapReduce in 2003 that Big Data really started to take shape. Google has had to deal with large amounts of raw data (such as crawled documents and web request logs) that needed to be analyzed in a timely manner. Creating MapReduce was their way to being able to abstract the compute parallelization, distribution of data, fault tolerance, and load balancing from the developers so that they can focus on expressing the computations necessary to analyze the data. This seminal paper reportedly inspired Doug Cutting to develop an open-source implementation of the MapReduce framework called “Hadoop,” which was named after his son’s toy elephant. hadoopYahoo famously embraced this implementation after hiring Cutting in 2004. Yahoo continued to build upon this technology and first used Hadoop in production in 2008 for it’s search “webmap,” which was an index of all known webpages and all the metadata needed to search them.

One of the key characteristics of Hadoop was that it could run on commodity hardware and automatically distribute jobs. By its nature, it is designed to be fault tolerant so jobs are not impacted by the failure of a single node. According to an article in Wired Magazine about Yahoo’s use of Hadoop, “Hadoop could ‘map’ tasks across a cluster of machines, splitting them into tiny sub-tasks, before ‘reducing’ the results into one master calculation.” Soon after, companies like eBay and Facebook were adopting the technology and implementing it internally. Reportedly, Facebook has the largest Hadoop Cluster in the world, currently at 30 petabytes (PB).

While early adopters of Hadoop and other Big Data technologies tended to form around Internet, social media, and ad networks, Big Data is intended to be a general-purpose tool. With most companies now integrating social media into their offerings, the amount of data created internally combined with those extracted externally will only increase. This is an indication that companies from all industries will need to start investigating how to implement Big Data technologies to make use of all this data that they are collecting and creating.

In Part 2 of this Big Data series, I discuss how Big Data and the Cloud work together.


From February 13-16, 2012, in Santa Clara, CA, GoGrid sponsored Cloud Connect 2012, an expo devoted to educating professional seeking to learn more about the benefits of Cloud Computing. We have been a long time sponsor of this show and each year it seems to get better, not only from the caliber of content being presented, but also in terms of the level of expertise on cloud computing that attendees profess.

GoGrid-CloudConnect-2012-booth-04

As I attend many of these conferences as a sponsor, exhibitor and interested party, I have seen a great evolution not only of knowledge and education but also in the cloud services being presented by various companies at the show. A few years ago, it was all about “what is cloud” and how do we define it. The past years have allowed us to really fine-tune the definition and really move beyond this to rolling up the sleeves and implementing cloud solutions. I’m definitely encouraged by the progress of companies with their cloud innovations and the individuals looking to capitalize on this influx of knowledge.

IMG_3267

Talking with customers and prospects looking for or implementing cloud infrastructure solutions gives and insight into what is working in the cloud and what people are really looking for. For example, a few years ago, we introduced the concept of Hybrid Hosting – the ability to mix and match virtual and physical servers within the same architecture, all managed through a single pane of glass, so to speak. In fact, many of our recent Case Studies show that hybrid environments are really the reason why these companies turned to GoGrid for their cloud solution.

GoGrid Customer Presentation – Microgroove

At the show, one GoGrid customer, Microgroove, presented their decision-making process and implementation strategies on moving their music platform to our cloud. It was an interesting journey for them as they did have to do some tests on other clouds only provided completely virtualized environment that simply did not meet their requirements. In the coming days, we will post a video of their presentation but if you are interested in reading about their case study, you can download it here.

We have compiled a list of new GoGrid customer success stories that you may find interesting. These stories can be downloaded here.

The full PowerPoint presentation can be viewed below:

A Hybrid Hosting Primer

David Michael, a Solutions Architect here at GoGrid, also gave a presentation on the benefits and use cases of our hybrid hosting model in the Demo Theater of the Cloud Connect show. To a captive and interested audience, David walked through things to consider and the advantages that companies can benefit from by using a hybrid hosting scenario.

IMG_3261

The full presentation is shown below.

Listening to You

While presenting success stories and best practices is always helpful, it’s also important to listen to and answer questions about GoGrid and the services that we provide. That is where I personally find it important to be an active exhibitor, staffing our booth. A while back, we developed this concept of the “Cloud Fingerprint” – essentially that your business and infrastructure needs are unique and the cloud partner your choose should be adaptable and flexible enough to meet those needs. We believe that it is important for not just the sales and marketing teams to be at these shows, but also representatives from all departments at GoGrid. On-site, we had executives from Engineering, Support and even Human Resources as well (yes, we are hiring) gathering their own perspectives of the show and what people are looking for in a cloud provider.

“Cloudy with a Chance of Cocktails”

Lastly, GoGrid also sponsored a party at Cloud Connect 2012. With everyone’s mind swirling with cloudy thoughts, it was important to relax, be social and have a gathering place to literally blow off some steam. Below are a few pictures from the Party (also be sure to view the Cloud Connect Flickr set that has more pictures from the party – towards the end of the set).

IMG_3278 IMG_3283

IMG_3268 IMG_3277

IMG_3272 IMG_3273

If you did attend Cloud Connect 2012, we hope that you found it educational and rewarding and allowed you to build up your set of cloud resources and tools. If you didn’t, we hope that you can attend next year.


IMG_0132

To celebrate the release of their API, Spotify sponsored a Hack-a-thon at SPiN Ping Pong Club in New York City from Friday February 24 until Sunday February 26. Spotify was joined by big brands like Doritos, CW, McDonald’s, Showtime, State Farm and Mountain Dew. Technology companies sponsoring the event included Facebook, Twilio, FourSquare, The Echo Nest and of course, GoGrid. GoGrid provided all the cloud servers for the event to support the developers as they created brand new apps using the Spotify API in conjunction with other API like Facebook’s Open Graph. GoGrid’s manager of cloud ecosystem, Paul Lancaster and I were on-hand to meet with developers and provide support for the event.

IMG_0137

50 CentOS x64 cloud servers were provisioned to the hackers by GoGrid to build their applications free of charge with root level access for maximum flexibility. Hundreds of hackers showed up to build the next great apps and were treated to live performances by Blood Orange and MNDR. While hack-a-thons tend to have attrition over time, hackers stayed throughout the night and most for the entire weekend.

Museik App

There were roughly 30 projects worked on during the weekend which ranged from an app called Museik (UI shown above) that extracts content from the internet related to the release date of a song on Spotify to a project called Orbidal by the students of the VCU Brandcenter that gathers the collective feelings of your Facebook feed and creates a playlist based on that mood on Spotify.

musicappshackwkndposter

GoGrid awarded a prize to JukeSpot, an app that combined the APIs of Spotify, Facebook, and Foursquare to allow users to play their songs and playlists at any bar/club/restaurant that has the JukeSpot app running on their system. It also allows users to compete against each other for points, share points socially, and buy points from marketers or retailers.

One app called Music Monster integrated Spotify with Foursquare and The Echo Nest. This app is designed to be used by a DJ to check-in to a location, set up a playlist and start tracking audience responses in real-time. The data comes from a client app that users at the venue use to check-in and then provide real-time feedback to the DJ (indicating that they want to hear more familiar tracks or something more mellow). This particular app won for best Echo Nest hack.

Another app called Swarm.fm built by Peter Watts inverts the paradigm of integrating Spotify activity into Facebook. Rather, Facebook (and other sources) activity is integrated into Spotify, so that you can see your friend’s likes of particular bands and the activity of artists that you like all tied to your music collection. It can also find similarities between friends and can generate playlists based on artists, interests and brands you have in common. This clever app won the Spotify Grand prize and $10,000.

The Grand Prize was judged on the following criteria:

  • Overall Product Viability
  • Level of innovation
  • Depth of integration with Spotify API(s)
  • User Experience
  • Utility
  • Integration of Music

Overall, this was a well-organized, fun and productive event. Lot of great, innovative apps were built in a short period of time. Spotify and OMD provided a great atmosphere with plenty food (seemingly unlimited supply of Doritos and Mountain Dew), live music and support from technology partners and brand executives.


GoGrid is one of the Platinum Sponsors of this week’s Cloud Connect 2012 conference and Expo at the Santa Clara Convention Center. The event promises to be a memorable one for cloud newcomers as well as those of us trying to keep up with the blazing pace of cloud innovation.

cc12_date-loc_PMS

This year, we’re particularly excited to be focusing on GoGrid’s hybrid infrastructure solution, which we think combines the best of both the physical and virtual worlds. We believe that your company is unique, and your infrastructure should be, too. Stop by our booth 709 to find out what your unique “cloud fingerprint” looks like. Chances are it’s a flavor of our hybrid solution.

Cloud-Fingerprint

Presentations

Maybe you’re wondering whether to keep your dedicated servers or move to the cloud. What if you could have it all? Join one of our solutions architects as he walks through real-life examples of how hybrid hosting can improve your business’s infrastructure: Tuesday, Feb. 14, 3:35 – 3:55pm in the Cloud Solutions Theater on the Expo Floor. Here’s the presentation description: “Different businesses have different infrastructure needs. And the choices of clouds, colocation, or dedicated services can be daunting if not confusing. So why choose just one when GoGrid’s hybrid architecture (a union of the best of virtual and physical) provides options for both flexibility and growth? Physical hardware provides guaranteed, dedicated, high performance coupled with an assurance of strict data control and security, while cloud architecture scales when your business demands it. Learn the secrets of hybrid hosting and how it can improve your business’s infrastructure in this 20-minute walk-through.

Then on Wednesday, Feb. 15, 1:15 – 2:15pm (Grand Ballroom H), learn how Microgroove is creating a high-performance, cost-effective cloud environment for the music industry with GoGrid. The title says it all: “Performance Matters, Especially in the Music Industry – Global Hybrid Infrastructure Makes Artists Sing.” Here’s an overview: Learn first hand how Microgroove leveraged physical and virtual infrastructure components in creating a high-performance, cost-effective cloud environment for the music industry. One that easily supported their need for cloud scalability coupled with the permanence and single-tenancy of dedicated servers — a hybrid solution not found in commodity clouds. Microgroove’s technology platform running on GoGrid is powering hundreds of popular artists’ sites from Snoop Dogg to Yani as well as an eCommerce site of 1.5 million+ SKUs.

We’ll also be holding several short presentations at the GoGrid booth. Topics include:

Be sure to drop by for these quick primers on cloud computing.

GoGrid-Sponsored Reception

wine-clouds

And don’t forget to drop by “Cloudy with a Chance of Cocktails,” the GoGrid party on Wednesday from 6 – 8pm in the Hyatt Mezzanine. Register now and beat the rush! This party is open to all Cloud Connect attendees but, space is limited, so act now.

Everybody Loves Free Stuff!

Lastly, Cloud Connect attendees can participate in the GoGrid Cloud Scavenger Hunt, where one lucky person will win an iPad 2. Stop by the GoGrid booth (709) for details. We’ll also be giving away various GoGrid goodies, including a Mystic Pyramid to help predict your future –- a perfect complement to our Cloud Pyramid!

gogrid_cloud_pyramid

Register for the Expo

To make sure you don’t miss out, we’d like you to be our guest at the Expo; just complete the online registration form and mark your calendar. Hope to see you there!