KML_FLASHEMBED_PROCESS_SCRIPT_CALLS
 

How To Optimize Your Database Backups and Text File Compression with pbzip2 and pigz

February 9th, 2012 by - 5,003 views

Recently, GoGrid was examining performance enhancements on several internal processes; among these enhancements was switching from standard gzip to “pigz”. Since I had never heard of this “pigz”, I was intrigued by this supposed “parallel” implementation of gzip; meaning it uses all available CPU’s/cores unlike gzip. This prompted me to ask, “I wonder if there is a parallel implementation of bzip2 as well”, and there began my endeavor.

pigz and pbzip2 are multi-threaded (SMP) implementations of their respective idol file compressors. They are both actively maintained and are fully compatible with all current bzip2 and gzip archives.

If you’re like me, you might’ve stayed away from using gzip or bzip2 due to the single-threaded aspect. If I try to compress a, let’s say, 2GB file, the system becomes rather sluggish; the reason being is that the “compression tool of choice” uses almost all of 1 core of today’s multi-core, multi-CPU systems and creates an uneven load between the cores, causing the CPU to operate very inefficiently.

In this example I have a .tar file with several databases in it, which totals 1.3GB. The system in question is a GoGrid dedicated server with 8 cores. The server’s load is around 1 and is a production database server.

Using bzip2, the file took approximately 6 minutes and 30 seconds to compress. Yikes!

bzip2

Now we’ll try this again with pbzip2, the parallel implementation of bzip2.

pbzip2

Not only did pbzip2 take roughly 1/7th the time as regular bzip2, it has the verbose option that provides some nice output and a progress bar (not visible here) while compressing.

The file compressed to an impressive 127M down from 1.3GB using bzip2 or pbzip2

Now let’s try the same test with gzip2 and pigz.

# time gzip dbbackup-12_10_2011.tar

gzip

gzip took a considerably less amount of time than bzip2 to compress the same archive; roughly 40 seconds instead of 6 and 1/2 minutes. However, the resulting file is a bit bigger at 177M.

Now with pigz.

pigz

pigz took about 1/7th the time as gzip did and clocked in around 6 seconds; an impressive speed again.

Conclusion

Both parallel implementations provided a very large time difference in comparison to the standard, single-threaded implementations of these compression tools.

pbzip2 and pigz provided an almost spot-on 7/8ths time difference between using 1 and all 8 CPU cores — an impressive performance gain. pbzip2 and bzip2 also compress the files a bit better than gzip – %93 compression ratio.

In particular, this could save you a lot of money when using GoGrid Cloud Storage, as well as save your Cloud or Dedicated resources and stability for the purposes they were intended for in the first place.

Depending on how these archives will be used, you may consider either implementation based on your needs. pbzip2 has a higher compression percentage, while pigz is faster.

Whatever you choose, hopefully this will increase the usability of compression tools and help provide a more stable and optimized environment GoGrid environment.

Happy Compressing!

2 Responses to “How To Optimize Your Database Backups and Text File Compression with pbzip2 and pigz”

  1. jbohm says:

    I think there must be some typo near "127M down from 1.3GB using bzip2 or pbzip2" . If pbzip2 does what it says it should produce exactly the same output as bzip2, and certainly the same output as itself.

    Besides that, there is a key downside to this technique: While the wall time to do the compression gets smaller, the risk of taking away CPU from production processes (such as the database engine on this server) increases. Depending on server specifics, this may be mitigated by careful use of the "nice" command to run the parallel compression job at a lower priority.

    RAM consumption may or may not be an issue, as a parallel bzip2 will probably increase its memory consumption by about 9.4MB per CPU used (400K + 8 x 900K + 2x900K buffers to hold input and output as external input and output happens serially).

    A related issue for virtual servers is if the surrounding framework (such as the GoGrid API) allows the CPU and RAM allocation to a server to be temporarily boosted for the few minutes that particular server is running a scheduled backup job. So far in the industry, I have only seen APIs that allow such adjustments while the virtual server is shut down, which typically isn't a good thing to do in a nightly backup job. And yes, I have seen this limitation in the underlying virtualization engines too, so it is hardly GoGrid's fault.

  2. gogridzack says:

    Hi jbohm,

    Thanks for your reply.

    —8<—
    I think there must be some typo near "127M down from 1.3GB using bzip2 or pbzip2" . If pbzip2 does what it says it should >produce exactly the same output as bzip2, and certainly the same output as itself.
    —>8—

    There is no typo in this statement; "using bzip2 or bzip2" was just expressing that either program resulted in the same size. 1.3GB is the original size of the file.

    —8<—
    Besides that, there is a key downside to this technique: While the wall time to do the compression gets smaller, the risk of taking away CPU from production processes (such as the database engine on this server) increases. Depending on server specifics, this may be mitigated by careful use of the "nice" command to run the parallel compression job at a lower priority.
    —>8—

    Yes, one can use the nice command as with any CPU-intensive process. Please take note that this is just an example; as I'm sure you would agree, this is a very general statement and must be applied to anything that's CPU intensive like compressing. I tested this on a production database server that does about 2K queries/sec, and I saw almost no performance decrease when compressing those files with the parallel implementations. Your results may very and you should always exercise caution in production environments.

    —8<—
    RAM consumption may or may not be an issue, as a parallel bzip2 will probably increase its memory consumption by about 9.4MB per CPU used (400K + 8 x 900K + 2x900K buffers to hold input and output as external input and output happens serially).
    —>8—

    Interesting information and dully noted. Unfortunately, I can't cover every single difference on the system that will be felt by using the parallel versions of these compression tools.

    —8<—
    A related issue for virtual servers is if the surrounding framework (such as the GoGrid API) allows the CPU and RAM allocation to a server to be temporarily boosted for the few minutes that particular server is running a scheduled backup job. So far in the industry, I have only seen APIs that allow such adjustments while the virtual server is shut down, which typically isn't a good thing to do in a nightly backup job. And yes, I have seen this limitation in the underlying virtualization engines too, so it is hardly GoGrid's fault.
    —>8—

    This seems more like a generality in cloud computing, and not really related to the topic at hand.

    Thanks again for your response, and hopefully I cleared up some of your concerns.

Leave a reply