KML_FLASHEMBED_PROCESS_SCRIPT_CALLS
 

Measuring the Performance of Clouds – GoGrid

March 17th, 2009 by - 11,530 views

Raditha Dissanayake posted a blog entry comparing Amazon EC2 and GoGrid performance. Unfortunately, we think Raditha did not use the most rigorous methodology possible for doing his comparison. It would be inappropriate for GoGrid to performance test Amazon’s EC2. In fact, their Customer Agreement may actually make such activity questionable, but IANAL (I Am Not A Lawyer).

Let’s take a more rigorous look at GoGrid disk subsystem performance.

Framing the Issue

As a start the entire issue is a LOT more complex than can potentially be covered here. Today’s disks, hard drive controllers, and operating systems have many different kinds of caching mechanisms. In addition, virtualization systems like Xen can impact results in unexpected ways. For example, did you know that Xen can be deployed in two major manners?

Either ‘paravirtualized’ or ‘hardware virtualized’. The two different models almost certainly impact any testing methodology. And yes, you guessed it, Amazon and GoGrid don’t configure Xen in the same way. Amazon uses paravirtualization and GoGrid uses hardware virtualization. Beyond this public information neither Amazon nor GoGrid provide significant details about their infrastructure considering it, rightfully so, proprietary intellectual property.

Without a deep understanding of all of the issues it’s difficult to do a test much less a proper comparison.

But we are certain of a few very important things.

Clouds Are Multi-Tenant

First off, it’s hard to do a serious comparison like this using one server on each system. Clouds are inherently multi-tenant systems and since end users have no visibility into who else is using or sharing their disk resources at any given time there is no real way to verify that the results aren’t tainted by other activity.

Use the Right Tool

Secondly, hdparm -t isn’t a very good way to measure disk speed. It’s susceptible to noise from background activity, in fact the man page says:

-t Perform timings of device reads for benchmark and comparison purposes. For meaningful results, this operation should be repeated 2-3 times on an otherwise inactive system (no other active processes) with at least a couple of megabytes of free memory. [...]

As you can see in Raditha’s test, hdparm doesn’t really do enough I/O to get consistent results in a multi-tenant environment. In the tests, hdparm is only active for a very short period of time allowing tenancy to have a dramatic effect on the results.  hdparm requires an inactive system and since that can’t be guaranteed in the cloud it fails the sniff test for a robust tool for cloud performance testing.

Another factor here that is unaccounted for is that hdparm is a utility tuned for real physical disks, not virtual disks.

Better Measurements

Ideally if you want to measure the streaming performance of a block device in a more reliable way in a multi-tenant environment, then use a larger amount of I/O. When doing this I/O you want to try to eliminate:

  • Hard disk controller layer cache effects
  • Hard disk layer cache effects
  • OS level cache effects
  • Effects of disk activity from other VMs

All current GoGrid nodes have caches in the storage layer. These are designed to be robust and to absorb burst of write activity. These caches are sufficiently large though that if you do repetitive small I/Os what you end up measuring in the performance in pulling this data out of the storage layers caches, not from the storage itself.

To avoid OS level cache effects use ‘direct I/O’. High performance applications and databases tend to use this internally for similar reasons (because they want to avoid OS level cache pollution and do their own caching). Oracle is probably the most obvious example here.

Testing Performance

On a ‘small VM’ located on a fairly busy node:

[root@foo ~]# dd if=/dev/hda bs=10M of=/dev/null iflag=direct count=100     
100+0 records in      
100+0 records out      
1048576000 bytes (1.0 GB) copied, 3.50983 seconds, 299 MB/s
[root@foo ~]# dd if=/dev/hda bs=10M of=/dev/null iflag=direct count=100     
100+0 records in      
100+0 records out      
1048576000 bytes (1.0 GB) copied, 3.06811 seconds, 342 MB/s
[root@foo ~]# dd if=/dev/hda bs=10M of=/dev/null iflag=direct count=100     
100+0 records in      
100+0 records out      
1048576000 bytes (1.0 GB) copied, 2.14147 seconds, 490 MB/s

That’s using enough I/O to minimize noise from other VM activity and large enough to avoid hitting cache effects.

If the I/O load is small enough you can hit storage layer cache effects:

[root@foo ~]# dd if=/dev/hda bs=10M of=/dev/null iflag=direct count=10     
10+0 records in      
10+0 records out      
104857600 bytes (105 MB) copied, 0.116491 seconds, 900 MB/s
[root@foo ~]# dd if=/dev/hda bs=10M of=/dev/null iflag=direct count=10     
10+0 records in      
10+0 records out      
104857600 bytes (105 MB) copied, 0.16058 seconds, 653 MB/s
[root@foo ~]# dd if=/dev/hda bs=10M of=/dev/null iflag=direct count=10     
10+0 records in      
10+0 records out      
104857600 bytes (105 MB) copied, 0.115701 seconds, 906 MB/s

While this is a fairly contrived example, it’s useful in other ways because it shows you can get very good burst throughput (consider a database updating a few thousand pages).

A larger memory instance (where average performance should be a lot better).

Sustained (large) IO:

[root@ubdev1 ~]# dd if=/dev/hda bs=10M count=100 of=/dev/null iflag=direct     
100+0 records in      
100+0 records out      
1048576000 bytes (1.0 GB) copied, 1.80415 seconds, 581 MB/s
[root@ubdev1 ~]# dd if=/dev/hda bs=10M count=100 of=/dev/null iflag=direct     
100+0 records in      
100+0 records out      
1048576000 bytes (1.0 GB) copied, 1.70448 seconds, 615 MB/s
[root@ubdev1 ~]# dd if=/dev/hda bs=10M count=100 of=/dev/null iflag=direct     
100+0 records in      
100+0 records out      
1048576000 bytes (1.0 GB) copied, 1.6799 seconds, 624 MB/s

Burst (small) IO:

[root@ubdev1 ~]# dd if=/dev/hda bs=10M count=10 of=/dev/null iflag=direct     
10+0 records in      
10+0 records out      
104857600 bytes (105 MB) copied, 0.105183 seconds, 997 MB/s
[root@ubdev1 ~]# dd if=/dev/hda bs=10M count=10 of=/dev/null iflag=direct     
10+0 records in      
10+0 records out      
104857600 bytes (105 MB) copied, 0.089827 seconds, 1.2 GB/s
[root@ubdev1 ~]# dd if=/dev/hda bs=10M count=10 of=/dev/null iflag=direct     
10+0 records in      
10+0 records out      
104857600 bytes (105 MB) copied, 0.090264 seconds, 1.2 GB/s

Don’t take my word for any of this. Try it out. If you’re really bored graph I/O performance vs I/O size and you’ll likely see a step function with a soft edge that will give you some idea of what the storage system is capable of and the degree of I/O variation.

Bottom Line

It’s great that people are kicking the tires of various clouds, but let’s be careful to make sure our testing is rigorous and makes sense for the environment.  If you have questions about how to measure performance on clouds, please send them to us.  Or if you’re a performance and virtualization system guru and have some knowledge to share, please do so.

We always want to improve our cloud and take seriously any feedback that shows a real problem, but in this case the test needs tweaking, not GoGrid.

6 Responses to “Measuring the Performance of Clouds – GoGrid”

  1. raditha says:

    You have perhaps not noticed that I have done the tests three times exactly as specified by the hdparm man page. While I agree with you that it may not be the best tool for testing virtual hard drives, it certainly does give a fair indication.

    I think you have also missed the fact that I have not condemned your service but ended my blog posts suggesting that readers should try it out.

    Amazon AWS documentation actually suggests that you do your own benchmarking, so you will not be breaking their agreement if you did that (unless of course their agreement says, competitors may not sign up for our services :-) )

    all the best

  2. randybias says:

    Raditha,

    Thanks for following up. We appreciate your support of the GoGrid service. It's important to us as a business to identify and respond to blog postings like yours. In this case we felt the methodology you used may not fairly represent GoGrid's performance and we wanted to present our case.

    Open debates like this are what the Internet is about. :) We still have some concerns with the methodology that you used, but recognize that you put some effort into trying to make it fair.

    Would be happy to work with you on an update to your posting that perhaps uses a few different measures of performance (bonnie++, iozone, dd, and perhaps vmmark?) to compare disk subsystems.

    Best,

    –Randy

  3. Yoav says:

    Fact is I have been using GoGrid for over 6 month and it's true.
    IO performance can sometimes get so bad the entire system is slowed down to a crawl.

    The really sad thing is that GoGrid denies such a problem and the good people in Banglore have no idea what they are doing.

    Answers from GoGrid regarding this phenomena range from "delete you server and create a new one so you'll get a new node", to "the system was just fine when we looked at it"

    So Randy, I invite you to log on to my server and see first hand how i/o performance is indeed a very serious issue at GoGrid.

    Personally, I'm just waiting to have time to migrate my system out of here.

    • While I can't address your server environment directly, nor what you have configured therein, the point of the article was to outline other ways of potentially testing a cloud server.

      While this has been mentioned already, I would encourage you to re-test on a new server, using both your method and the ones outlined above. Please let me know your results: michael AT gogrid.com.

  4. Stuart says:

    Interesting. I'm currently trying to get to the bottom of some IO problems on GoGrid too at the moment. I'm coming across SQL Server errors – where SQL Server is telling me that some IO operations are taking longer than 15 seconds – this sort of thing – http://blogs.msdn.com/sqlserverstorageengine/arch… – and this is occurring on a brand new server (set up 1 week ago)

    The response so far from GG support is that if I add more RAM to the server then these errors will happen less often – which is not really much of a solution. As someone who used to write real time code on DSPs, I'm amazed that anyone should think any IO operation should ever take 15 seconds.

    I'm hopeful these problems will be analysed, identified and fixed.

  5. Dave Friedel says:

    I love the concept of the Cloud… but until more specifics and real world people start talking positive about it, I must keep my clients on physical dedicated boxes. I think what is lost on companies offering cloud services is the key issue of client retention. You may trying to draw us to your service but I have an obligation to deliver best class services (regardless of the client) and if I do not – I loose multiple clients while you loose one.

    Please continue to advance the technology – many of us are waiting patiently.

Leave a reply