Posts Tagged ‘Database’

 

The Big Data Revolution – Part 1 – The Origins

Tuesday, March 20th, 2012 by

data-security

For many years, companies collected data from various sources that often found its way to relational databases like Oracle and MySQL. However, the rise of the internet and Web 2.0, and recently social media began not only an enormous increase in the amount of data created, but also in the type of data. No longer was data relegated to types that easily fit into standard data fields – it now came in the form of photos, geographic information, chats, Twitter feeds and emails. The age of Big Data is upon us.

A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes. 1 zettabyte equals 1 trillion gigabytes. To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More importantly, this data has to go somewhere, and this report projects that by 2020, more than 1/3 of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be to collect, store, and analyze what it all means.

Business intelligence (BI) systems have always had to deal with large data sets. Typically the strategy was to pull in “atomic” -level data at the lowest level of granularity, then aggregate the information to a consumable format for end users. In fact, it was preferable to have a lot of data since you could also “drill-down” from the aggregation layer to get at the more detailed information, as needed.

Large Data Sets and Sampling

Coming from a data background, I find that dealing with large data sets is both a blessing and a curse. One product that I managed analyzed share of wireless numbers. The number of wireless subscribers in 2011 according to CTIA was 322.9 million and growing. While that doesn’t seem like a lot of data at first, if each wireless number was a unique identifier, there could be any number of activities associated with each number. Therefore the amount of information generated from each number could be extensive, especially as the key element was seeing changes over time. For example, after 2003, mobile subscribers in the United States were able to port their numbers from one carrier to another. This is of great importance to market research since a shift from one carrier to another would indicate churn and also impact the market share of carriers in that Metropolitan Statistical Area (MSA).

Given that it would take a significant amount of resources to poll every household in the United States, market researchers often employ a technique called sampling. This is a statistical technique where a panel that represents the population is used to represent the activity of the overall population that you want to measure. This is a sound scientific technique if done correctly but its not without its perils. For example, it’s often possible to get +/- 1% error at 95% confidence for a large population but what happens once you start drilling down into more specific demographics and geographies? The risk is not only having enough sample (you can’t just have one subscriber represent the activity of a large group for example) but also ensuring that it is representative (is the subscriber that you are measuring representative of the population that you want to measure?). It’s a classic problem of using panelists that sampling errors do occur. It’s fairly difficult to be completely certain that your sample is representative unless you’ve actually measured the entire population already (using it as a baseline) but if you’ve already done that, why bother sampling?

(more…) «The Big Data Revolution – Part 1 – The Origins»

How To Optimize Your Database Backups and Text File Compression with pbzip2 and pigz

Thursday, February 9th, 2012 by

Recently, GoGrid was examining performance enhancements on several internal processes; among these enhancements was switching from standard gzip to “pigz”. Since I had never heard of this “pigz”, I was intrigued by this supposed “parallel” implementation of gzip; meaning it uses all available CPU’s/cores unlike gzip. This prompted me to ask, “I wonder if there is a parallel implementation of bzip2 as well”, and there began my endeavor.

pigz and pbzip2 are multi-threaded (SMP) implementations of their respective idol file compressors. They are both actively maintained and are fully compatible with all current bzip2 and gzip archives.

If you’re like me, you might’ve stayed away from using gzip or bzip2 due to the single-threaded aspect. If I try to compress a, let’s say, 2GB file, the system becomes rather sluggish; the reason being is that the “compression tool of choice” uses almost all of 1 core of today’s multi-core, multi-CPU systems and creates an uneven load between the cores, causing the CPU to operate very inefficiently.

In this example I have a .tar file with several databases in it, which totals 1.3GB. The system in question is a GoGrid dedicated server with 8 cores. The server’s load is around 1 and is a production database server.

Using bzip2, the file took approximately 6 minutes and 30 seconds to compress. Yikes!

bzip2

(more…) «How To Optimize Your Database Backups and Text File Compression with pbzip2 and pigz»

Meetup: A NOSQL Evening in Palo Alto, CA – October 26, 2010

Tuesday, October 12th, 2010 by

Have you ever wondered what this “new” term “NOSQL” means? Is it something that you need to seriously look at when you are crafting your cloud environment? Is it a viable alternative to MySQL, SQL Server, Oracle, PostreSQL or other database applications? Do you even know how to get started with it?

If any of the questions above left you scratching your head AND you are in the Silicon Valley on October 26th, you might want to head over to the following meetup “A NOSQL Evening in Palo Alto“.

meetup_GoGrid_logo_sm

Many industry experts are saying that NOSQL will be or is the database of choice for Cloud Computing because of its scalability and ability to handle data-intensive applications better. Many popular websites that are high-traffic or deliver lots of rich media are adopting and re-architecting their infrastructures to take advantage of NOSQL.

From the Event Page:

Tim Anglade (founder, A NOSQL Summer), hosts a 2-hour discussion and Q&A with representatives from some of the most prominent NOSQL vendors and projects. The topics will look back on the origins of the movement and its growing pains, discuss the current technological & business states, as well as look ahead to the opportunities for improvement and expansion.

So if you ever wanted to hear the lessons learned by an emerging tech industry, or wanted to have an inside look into the non-relational revolution, come join us for a unique evening at beautiful Dinah’s Garden Hotel, in Palo Alto, California.

(more…) «Meetup: A NOSQL Evening in Palo Alto, CA – October 26, 2010»

Partner Press Release & Webinar: Sentrigo & GoGrid to Host Webinar on Building Stable, Scalable, Secure Databases in the Cloud

Tuesday, August 10th, 2010 by

GoGrid Partner, Sentrigo, a database security software innovator, today announced that it will host a webinar with GoGrid focusing on Building Secure, Scalable and Stable databases in the Cloud. The Webinar details are as follows:

Date: Thursday August 12, 2010
Time: 10:00 am (Pacific Daylight time)
Duration: 1 hour
Registration: http://bit.ly/bNkfd1

Topics to be covered are:

  • Intro to cloud web service and management tools
  • Best practices in deploying databases in the cloud
  • Available data storage options
  • Technical challenges when protecting remote databases
  • Securing databases in the cloud

Sentrigo currently has its Hedgehog Database Compliance and Security Suite images available on GoGrid as CentOS 5.3 and Windows Server 2008 server images:

GG_Sentrigo_hedgehog_images

The following Press Release was delivered today discussing the upcoming Webinar.

(more…) «Partner Press Release & Webinar: Sentrigo & GoGrid to Host Webinar on Building Stable, Scalable, Secure Databases in the Cloud»

Partner Press Release: InfiniteGraph and GoGrid to Enable Large-Scale Graph Data Processing and Discovery, Free for Qualified Startups

Monday, July 26th, 2010 by

Today, Objectivity, a leader in distributed, scalable data management technology, delivered a press release about InfiniteGraph, a new distributed graph database product which is now available on GoGrid.

partner_press_release_GoGrid_logo_sm

About Objectivity & InfiniteGraph

Objectivity, Inc., the leader in distributed, scalable data management technology, formed the InfiniteGraph business unit in 2010. The team was tasked with creating a product to meet a global need to obtain real-time answers from deep analysis of very large volumes of complex data.

The InfiniteGraph team has developed a solution that gives organizations significant technical cost savings and time-to-market advantages, in developing advanced and large-scale social networking, business intelligence, scientific research and national security systems across highly distributed environments. InfiniteGraph is unique among other products in the marketplace, providing a high-performance, distributed graph database with virtually unlimited scalability.

The InfiniteGraph graph database leverages enterprise-proven technology that supports the highest graph computing requirements, helping partners and customers to find and utilize the valuable connections and multi-dimensional relationships that exist within their data.

InfiniteGraph on the GoGrid Exchange

(more…) «Partner Press Release: InfiniteGraph and GoGrid to Enable Large-Scale Graph Data Processing and Discovery, Free for Qualified Startups»