I've been playing around a bit with file compression again, as we generate some very large backup files daily on Stack Overflow.
We're using the latest 64-bit version of 7zip (4.64) on our database server. I'm not a big fan of more than dual core on the desktop, but it's a no brainer for servers. The more CPU cores the merrier! This server has two quad-core CPUs, a total of 8 cores, and I was a little disheartened to discover that neither RAR nor 7zip seemed to make much use of more than 2.
Still, even if it does only use 2 cores to compress, the 7zip algorithm is amazingly effective, and has evolved over the last few years to be respectably fast. I used to recommend RAR over Zip, but given the increased efficiency of 7zip and the fact that it's free and RAR isn't, it's the logical choice now.
Here are some quick tests I performed compressing a single 4.76 GB database backup file. This was run on a server with dual quad-core 2.5 GHz Xeon E5420 CPUs.
| 7zip | fastest | 5 min | 14 MB/sec | 973 MB |
| 7zip | fast | 7 min | 11 MB/sec | 926 MB |
| 7zip | normal | 34 min | 2.5 MB/sec | 752 MB |
| 7zip | maximum | 41 min | 2.0 MB/sec | 714 MB |
| 7zip | ultra | 48 min | 1.7 MB/sec | 698 MB |
For those of you who are now wondering, wow, if 7zip does this well on maximum and ultra, imagine how it'd do on ultra-plus, don't count on it. There's a reason most compression programs default to certain settings as "normal". Above these settings, results tend to fall off a cliff; beyond that sweet spot, you tend to get absurdly tiny increases in compression ratio in exchange for huge order of magnitude increases in compression time.
Now watch what happens when I switch 7zip to use the bzip2 compression algorithm:
We'll compress that same 4.76 GB file, on the same machine:
| bzip2 | fastest | 2 min | 36 MB/sec | 1092 MB |
| bzip2 | fast | 2.5 min | 29 MB/sec | 1011 MB |
| bzip2 | normal | 3.5 min | 22 MB/sec | 989 MB |
| bzip2 | maximum | 7 min | 12 MB/sec | 987 MB |
| bzip2 | ultra | 21 min | 4 MB/sec | 986 MB |
Why is bzip2 able to work so much faster than 7zip? Simple:
7zip algorithm CPU usage
bzip2 algorithm CPU usage
Bzip2 uses more than 2 CPU cores to parallelize its work. I'm not sure what the limit is, but the drop-down selector in the 7zip GUI allows up to 16 when the bzip2 algorithm is chosen. I used 8 for the above tests, since that's how many CPU cores we have on the server.
Unfortunately, bzip2's increased speed is sort of moot at high compression levels. The difference between normal, maximum, and ultra compression is a meaningless 0.06 percent. It scales beautifully in time, but hardly at all in space. That's a shame, because that's exactly where you'd like to spend the speed increase of paralellization. Eking out a percent of size improvement could still make sense, depending on the circumstances:
total time = compression time + n * (compressed file size / network speed + decompression time)For instance, if you compress a file to send it over a network once, n equals one and compression time will have a big influence. If you want to post a file to be downloaded many times, n is big so long compression times will weigh less in the final decision. Finally, slow networks will do best with a slow but efficient algorithm, while for fast networks a speedy, possibly less efficient algorithm is needed.
On the other hand, the ability to compress a 5 GB source file to a fifth of its size in two minutes flat is pretty darn impressive. Still, I can't help wondering how fast the 7zip algorithm would be if it was rewritten and parallelized to take advantage of more than 2 CPU cores, too.
The bzip2 compressor splits data into blocks of 900kB and compresses each block separately; compressing your 4.7GB file could make use of over 5000 CPUs efficiently. The 7zip compressor isn't easily parallelized for exactly the same reason as it produces better compression: It doesn't work operate on small blocks. Compressors operate by forming a model of the data which is being compressed; the more data they're compressing at once the better the model.
Colin Percival on February 28, 2009 8:23 AMColin -- so it is effectively impossible to paralellize 7zip? I sort of suspected this, for the reasons you stated, but I was hesitant to come out and say that in the post.
Alternately, what's the largest block size that could be used without defeating parallelization? The same size as the L2 cache per CPU core?
Jeff Atwood on February 28, 2009 8:28 AMBreak your backups into various parts, and run parallel copies of your favorite tool to compress the individual parts. One way to deal with the problem if your tool-of-choice doesn't scale.
As for multi-core CPUs -- developers are *the* likely consumers to use multi-core CPUs on their desktop. My grandparents and parents aren't going to take advantage of multi-core when sending e-mail. But as a serious career developer, when I have multiple VMs running, I can see the load spread across the cores. I typically have several VMs running with different versions of my C++ environment/project, and another 1 or 2 VMs with python projects. I can have huge re-compile events going on in several VMs, while continuing to work in the foreground of yet another VM. And I do this regularly. I've never understood your stance on the topic.
(Otherwise, keep up the great work. :) Love the blog SO.)
Stéphane Charette
Stéphane on February 28, 2009 8:33 AMI'm not a big expert on compression algorithms but I'm still going to comment because this is the first time I've been on your site recently and there haven't been 500 comments so maybe this gets read :)
I started using 7zip after my last desktop upgrade because it was the first thing I grabbed that was free. I don't do any major compression/decompression more than a dozen MB or so. Speed isn't my major concern. It's fast enough. The only big files I sometimes have to decompress are gzipped log and pgsql backups and it does fine with those.
The benchmarks are interesting. 10x more wait for only a 10% decrease in files size seems like such a waste. I'd rather use 7zip than bzip if file size was an issue.
By the way, I'm confused. You say you don't like more than 2 cores on a desktop but your showing windows task manager.
Oh. Coding HORROR. Now I get it :)
htnmmo on February 28, 2009 8:38 AM7zip doesn't operate on blocks at all, it's entirely streaming. And since it uses an arithmetic coder, it probably only outputs one bit at a time.
You could parallelize it per file, though.
astrange on February 28, 2009 8:39 AM7zip doesn't operate on blocks at all, it's entirely streaming
The 7zip preferences allows a selection of 1 or 2 in the Number of CPU Threads option. Looks like it does *something* with the 2nd core, but obviously it doesn't extend to more than 2.
Jeff Atwood on February 28, 2009 8:49 AMastrange,
A streaming compressor counts as one with an infinite block size. :-)
Jeff,
I don't think it is possible to parallelize 7z compression given the current file format; but that doesn't mean that the file format couldn't be changed. The problem with stream compressors is that each time a byte is processed the model is updated; you can't have additional processors jump ahead and process data which the 1st processor hasn't reached, because you don't know what the compression model is going to look like until you get there. To solve this, you would need to have the compression state periodically revert to an earlier state; one way of doing this would be to zero the compression state (i.e., to operate on blocks completely independently of each other), but somewhat better results could be obtained by branching and merging models. Either way, however, you'd need to change the compression format, since the decompressor needs to update its model in sync with the compressor.
Colin Percival on February 28, 2009 8:59 AMBTW, the 7z compressor operates in several stages (optional preprocessing, deflation, arithmetic compression), so I'm guessing that the option to use two threads just pushes one of these (probably arithmetic compression) off to a second thread. This is a cheap way of parallelizing, but it doesn't scale.
Colin Percival on February 28, 2009 9:02 AMIf you broke up your backups into multiple files and did multiple invocations of 7zip as Stephane suggested, that would nullify the benefits of the 7zip format.
It may very well be possible to parallelize the format without blocking up the data, but as to how trivial it would be, I don't know.
If nothing else, reimplementing the algorithm in a functional language designed for parallelism might be worth the overhead of a 'slower-than-C' language with that many cores available.
I'm actually a tinkerer with asm and C, so if you (Jeff Atwood) expressed sufficient interest in the subject I could look at 7zip's code and see what the deal is.
Chris Allen on February 28, 2009 9:02 AMSo, 7zip with default settings takes 34 meters to compress the database, while bzip2 takes only three and a half meters. Cool. But I was confused with the MB/sec units you used in the forth column, This resembles me the unit for data throughput, but that is MB/s, not MB/sec.
Just kidding. :-)
As suggested, you should break the big database in separate tables, and compress tables in parallel. Since the LZMA algorithm is streaming, it makes more sense to compress tables individually, since data may look more alike inside each table (better opportunity for compression) than taking the whole database at once.
Juliano on February 28, 2009 9:07 AMWhat is bzip doing with all that cpu time? It looks like 7zip running fastest mode (with less parallelization) beats bzip ultra mode (in file size). Based upong the final file size, it doesn't look like bzip runs faster. It just generates more heat.
Metro Sauper on February 28, 2009 9:11 AMBased upong the final file size, it doesn't look like bzip runs faster. It just generates more heat.
This was my thought too. It's a good example of why I still don't think anything beyond 2 cores on the desktop is really necessary yet. Get a super fast dual core (mine runs at 3.8 GHz now), and you will win every time except for edge conditions.
(to be fair to @Stephane, C++ compilation is one of those edge conditions; it's very slow and highly, *highly* parallelizable.)
The only reason to go bzip2 is if you need the file compressed as rapidly as possible to a reasonable size.
On pure efficiency / energy, 7zip wins, despite bzip2 using 4x the CPU time.
Jeff Atwood on February 28, 2009 9:16 AMNever mind, I get it. The bzip throughput is higher (bytes processed faster0. However the result size is worse (larger). It would be interesting to come up with some ratio that would represent compressed bytes per minute.
Metro Sauper on February 28, 2009 9:17 AMMetro, maximum compression developed an efficiency measure:
http://www.maximumcompression.com/data/summary_mf2.php
That combines compression time, decompression time, and size, which sounds like what you're looking for. It used to be only size and compression time, which for some scenarios is more useful (like this one).
Well, actually, it seems to me to be feasible to have both the parallel operations and the better compression size.
Compression works on the model of the data. So effectively, you'd need some sort of modeller algorithm that can analyze blocks of the dataset, then merge the results together. The question becomes then, is that algorithm better than one that does not parallelize?
An easy way to prove something does not work is by trying to find a situation where it fails. Of course, logically, not finding a situation does not mean it always works either.
Let us assume that in each block there is a unique byte, b1. Because it is unique within only that block, but not in the context of the program, most naive modeling algorithms would miss the correlation. But it is possible, with an accompanying increase in memory consumption, to build a modeling algorithm that catches that pattern, while still parallelizing the work.
Now, for a real proof, we'd start at a base case, where each block is one byte. Since there is no repetition in the block, we cannot compress it, not without increasing the size of the file. Then we merge a two of the checked bytes together. And one of two situations happen: either there is repetition, or there is not. At each stage, the blocks are merged together with more blocks, and again, the same two situations happen. Each merged (n+1) block is checked. Then merged. If the modeler can find a pattern of data at an n-size block, it follows it will find more at each n+i-size block. Therefore, we achieve the parallelization and the proper data modeling we need to do this.
Not quite a full proper proof, but it should be sufficient. If not, I apologize. I'd suggest taking a look at the literature on how DBMS' deal with efficient indexing while taking advantage of multiple cores. ;) Most of the problems we face now were solved in that area 10-15-20 years ago.
Sorry, but your example is rather meaningless.
bzip2's ultra is worst in both time and size than 7zip's fastest.
So its rather pointless to say bzip2 is more parallel and scales better, when 7zip out right beats it.
However this is only a sample size of 1, so its hard to say.
Pyrolistical on February 28, 2009 11:39 AMAny reason you didn't use gzip? If possible, can you please post results for that, too?
My personal observation is gzip is extremely fast compared to bzip2 (and 7 zip) and the penalty of extra storage is not that high.
Shashikant on February 28, 2009 11:45 AMEr... look at the numbers...
We can just conclude that 7zip fastest does a great job in 5 minutes on 1 core, while bzip2 ultra isn't as good in 20 minutes on 8 cores.
Philippe on March 1, 2009 1:28 AMPhilippe: Or we can conclude that bzip is better if you need speed and 7zip is better if you need size. If you look at both of them they both shine in their specialty but suck when not.
Dobbs on March 1, 2009 1:39 AMDobbs: I stand corrected - bzip2 is really fast for fairly good results.
Philippe on March 1, 2009 2:16 AMHi Jeff: could you change the font; color for this blog please? otherwise an excellent place
maruti on March 1, 2009 2:47 AMJeff, you may want to try lzop too; it won't give as good compression, but it should be the fastest by CPU seconds, and the decompressor is very fast too.
Ralph Corderoy on March 1, 2009 3:30 AM
7zip fastest 5 min 14 MB/sec 973 MB
bzip2 ultra 21 min 4 MB/sec 986 MB
Why is bzip2 able to work so much faster than 7zip?
---
To compare like for like, we need similar final file sizes, and bzip is not faster than 7zip.
Peter Rounce on March 1, 2009 3:48 AMto hmm:
Does your quad core have more L2 or L3 cache than your dual core laptop?
It might not take much more cache to improve the hit rate an extra 1%. At high hit rates,(e.g. 90%), a 1% improvement hit rate improvement is a major (e.g. 10% ) improvement in the miss rate.
At a 100 clock tick penalty for a miss, that's the difference between 100 instructions in 1090 cycles vs 100 instructions in 991 cycles.
Depending on the working set of all the processes (including clock-tick, keystroke, and mickey triggered background processes) an additional 1 MB of cache can make a BIG difference.
David Kra
Another relevant measure with multicore is MiB/J - megabytes per joule.
Ed Davies on March 1, 2009 3:50 AMYes everyone saw that:
7zip fastest 5 min 14 MB/sec 973 MB
bzip2 ultra 21 min 4 MB/sec 986 MB
But that's missing the point, which was you can get a very fast compression of 5GB of data using bzip2: 2min vs 5min with 7zip, with a loss of 'only' 100MB, which when compared to 5GB is not that much (2%).
And at normal bzip2 gets a much smaller size diff for still 1.5 min less.
Of course one wouldn't want to max out ALL the server's cores, you may need to actually 'serve' things during that time. So IMO the 3 minutes saving isn't really worth that much in that use case.
But i guess if you needed to compress your skynet DB on with a gazillion cores then bzip2 would be the way to go.
jwickers on March 1, 2009 4:52 AM@Pyrolistical:
We can get a sample size greater than 1 if you compare something other than the one configuration that makes bzip2 look worst :).
bzip2 ultra is clearly the wrong choice on this hardware and dataset. But there's no configuration of 7zip that can get the original file under a quarter of its original size in under 5 minutes, which bzip2 normal (and faster) achieve. So bzip2 is clearly faster than 7zip in those scenarios.
Of course, as stressed above, the scenario must be considered to decide the proper space/time tradeoff.
Ens on March 1, 2009 5:04 AMInteresting, but WHY ARE YOU USING UNCOMPRESSED NATIVE SQL BACKUPS, and not compressed SQL backups (let the SQL engine compress the file on the fly as it is performing the backup)?
Yes, SQL 2008 supports it: http://sqlblog.com/blogs/tibor_karaszi/archive/2007/12/12/backup-compression-in-sql-server-2008.aspx
(ok, so only Enterprise version supports it, so if you aren't using that version, then use LightSpeed backup instead to compress backups on the fly. Also allows other cool stuff like restores of individual tables instead of forcing you to restore an entire copy of a huge database.
BradC: In Unix I take that for granted... I pipe the output of mysqldump to bzip2 and there's your on the fly as it's performing the backup.
Nicolas on March 1, 2009 6:17 AMDid you try LZMA2 in 7-zip 4.66 Alpha?
http://sourceforge.net/forum/forum.php?thread_id=2965956forum_id=45797
Not all the programs/algorithms can be parallelized infinitely. Some algorithms are inherently non-parallelizable beyond a limit(for ex.:Calculation of the Fibonacci series by use of the formula: F(k + 2) = F(k + 1) + F(k) is strictly sequential). It can be possible that the algorithm used by 7zip does not scale well beyond 2 threads.
Sachin Khot on March 1, 2009 6:36 AM Why bzip2 works so faster than 7zip? Everything is simple:
I do not see that it works faster, have a look for yourself:
7zip for 7 minutes to compress the archive 926 MB,
and bzip2 compresses over seven minutes to 987 MB
Ie for one time 7zip squeezed better means to the same size 7zip sozhmet faster.
In doing so, bzip2 for the use of all eight cores, while 7zip barely 2. If he uses all 8 cores, he would be back in 4 Araz faster. This 7zip does not semmitrichny algorithm, unpacks it 10 times faster.
Conclusion: 7zip much-a lot faster than bzip2.
You blog in bad character.
homm on March 1, 2009 6:55 AMI first came here to say what your first commenter said.
So instead I'll say that people using the original bzip2 program (rather than its algorithm implemented by 7zip) should install pbzip2 to get the parallelization advantages.
http://compression.ca/pbzip2/
(It seems to be available already in Ubuntu Linux.)
One other thought... have you considered how caching of some of the data between runs may affect your speed results? You didn't mention how much memory you have, but I assume it's at least the size of your data file.
rfunk on March 1, 2009 7:25 AMNat -- sounds very interesting!
copied from that thread, from Igor Pavlov (the author of 7zip):
------
LZMA2 provides the following advantages over LZMA:
1) Better compression ratio for data than can't be compressed. It can store such blocks of data in uncompressed form. Also it decompresses such data faster.
2) Better multithreading support. If you compress big file, LZMA2 can split that file to chunks and compress these chunks in multiple threads.
LZMA2 uses subchunks (64K of compressed / 2 MB of uncompressed). For each subchunk it strores uncompressed / compressed sizes (5 bytes) and restarts RangeCoder (~5 bytes). So there is overhead of 10 bytes per 50-64 KB.
That schema allows also multithreading decompression - just read big amount of data and walk over 64 KB subchunks headers searching new chunk start. But I don't plan to implement multithreading decompression in near future.
WHY ARE YOU USING UNCOMPRESSED NATIVE SQL BACKUPS, and not compressed SQL backups (let the SQL engine compress the file on the fly as it is performing the backup)?
Because the default compression offered is very poor -- and once compressed, it cannot be *re*compressed with any kind of savings.
We can literally get 1/2 the size of those compressed backups using 7zip.
Jeff Atwood on March 1, 2009 8:05 AM@Jeff
But even though it isn't compressed as much, the backups are many times faster when done to a compressed file. That's the biggest advantage - try benchmarking that! The difference is startling (like 90% speed improvement).
Cheers
Philip on March 1, 2009 8:58 AMThere are versions of gzip and bzip2 specifically written in a parallel fashion (at least for compression). A post with some benchmarks:
http://blogs.sun.com/timc/entry/tamp_a_lightweight_multi_threaded
You may want to check out the pigz utility (Parallel Implementation of GZip):
It leverages the standard zlib found on most Unix-y systems, and by default launches up to eight threads to go through the file / stream (configurable).
(Sun has some systems that can run in up-to 256 threads in parallel, so the people there are kind of interested in this.)
David Magda on March 1, 2009 9:48 AMOn the file format debate (which of 7zip, gzip, bzip2 is better?), wouldn't it depend on what you're doing with the compressed file?
If you're sending it out once or twice to a few people, then the overhead of bzip2 compression may not be worth it.
However, if you're posting it on a web or FTP site, to be downloaded by thousands of people, then the bandwidth savings may be worth waiting an extra five minutes for the file to be compressed (something you only have to do once). So if you're saving 1 MB by going from gzip to bzip2, and the file will be download 10,000 times in a month, that's a saving of 10 GB.
Depending on the the file savings, and your distribution audience, a few minutes' wait can save you in other places down the road.
David Magda on March 1, 2009 9:55 AMI tried unraring a huge 7-Zip archive using 7-Zip and WinRAR. 7 Zip takes 50-60% of CPU while rar is utilizing dual core close to 100%.
7-Zip reports 50 minutes to complete the operation, even with high CPU utilization rar is showing above 60 minutes to complete.
Basic queuing theory might indicate that 3 cores/CPUs may offer a practical advantage over just 2 even in light-load settings (under ~ 33% utilization) so a quad scenario *might* be overkill on the desktop.
However my quad-core desktop is vasty more responsive than my dual-core laptop. Similar clock speeds and same RAM configuration, very similar usage profiles. I doubt it's all about disk I/O but I can't discount it knowing the limitations of typical laptop drives.
hmm on March 1, 2009 10:19 AMJohn Goerzen recently blogged about comparing various different compression schemas, and the time/space tradeoff you get from each.
It's amazing how good gzip still is in terms the compression you get for a given amount of CPU time.
http://changelog.complete.org/archives/910-how-to-think-about-compression
Chris on March 1, 2009 11:06 AM
So, to test parallelization's effects, do a simple 'split' of your multi-GB file into 4 or 8 chunks and have 4 (2-CPU setting) or 8 (1-CPU setting) 7Zip processes running at them. The timings which would be relevant are the total process time for the longest-running of the 4 or 8 processes, and the relevant file size is the sum of all 4 (or 8) chunks post-compression.
Seems like the obvious next step, given the conclusion that bzip's performance gains are due to it using multiple cores. Also seems logical that you'd see a significant performance boost without a significant effectiveness drop, given the huge source file size (I have a hard time believing that any compression algorithm is going to be significantly more efficient on 4GB of data then on 2GB of data! Ten years ago I found the efficiency differences in gzip between 50MB and 100MB file sizes weren't significant, on the order of 0.1%, and I don't think the algorithms have improved enough to take advantage of the larger pattern space in the intervening years!) Who knows: you might find that a split/7zip process fits your needs the best.
On the other hand, it also seems like a straight block split to multiple cores would have occurred to the folks who make 7-zip too, so maybe they've already tried this and found no advantage. Still, their test cases and your specific case are likely different enough that what doesn't make sense for them as a general-use option does make enough sense in your scenario.
Tom Dibble on March 1, 2009 11:24 AMDoes this stuff really matters? Even for your backup solution Jeff, it is a little overkill to research this much if not just by curiosity.
For almost all users, it really doesn't matter which compression you pick, as long as you pick one (just like copyright licenses).
Hoffmann on March 1, 2009 11:32 AMThis article tells a bit of a different story: http://changelog.complete.org/archives/910-how-to-think-about-compression: 7za -2 (the LZMA algorithm used in 7-Zip and p7zip) is both faster and smaller than any possible bzip2 combination.
l0b0 on March 1, 2009 12:02 PMwow... This is impressive. You should try messing with dictionary size to see if there are any improvements
Gerghon on March 1, 2009 12:06 PMiirc 7zip can't hanle unix permissions so I've always used tar then compressed with the lzma util and uncompressed with lzma -d before untarring. more recently it is also possible to call lzma directly from tar.
Has anyone benchmarked the lzma util?
Tim on March 1, 2009 12:34 PMAny reason you didn't use gzip? If possible, can you please post results for that, too?
same file, same conditions, Gzip.
fastest:
2.5 min, 29 MB/sec, 1169 MB, one CPU core at 100%
normal:
17 min, 5 MB/sec, 1098 MB, one CPU core at 100%
Jeff Atwood on March 1, 2009 12:38 PMI second Nat's suggestion: keep an eye on 7-Zip development as LZMA2 is going to become a very interesting solution in near/mid term future if current multicore trend goes on.
Another interesting project to observe, in a longer time frame, is PAQ.
Based on neural networks, it is currently the top performing compressor (it means it, or derived works like WinRK and KGBArchiver, is in the top position of all comparative benchmarks); ZPAQ is an ongoing effort to standardize a format:
http://www.cs.fit.edu/~mmahoney/compression/#zpaq
It is too heavy in terms of memory usage and computation complexity to be a feasible choice for current generation's machine, but proven to be the best currently known approach in reducing file sizes to the theoretical amount of enthropy they contains.
Another way to reduce the size of your backups is to not take full backups every day. Instead, you can take weekly full backups and daily differential backups.
In fact, you can use a combination of full, differential and transaction log backups to increase the amount of data covered by your backups while still reducing the size of your backup files.
Have you compared your home built solution against RedGate SQL Backup?
http://www.red-gate.com/products/SQL_Backup/index.htm
And remember, recovery time is just as important as backup time. I will concur with BradC, native SQL 2008 compression does enough compression at speed, and is an easy option to implement to make it almost the default setting on new 2008 servers.
Jeff, which server is doing the compression? Are you maxing out your DB server for 3 or so minutes? What happens to site response times during this activity?
Guy on March 2, 2009 2:14 AMThe point is that it doesn't much to load up current CPUs - even a top of the line machine like mine with an efficient OS can easily make efficient use of more than 2 cores with typical consumer use.
No actual published benchmarks of typical computer use support your statement. I can point to dozens of articles backed by data on AnandTech, TechReport, Tom's Hardware, etc, that all show the same thing -- there is a *massive* point of diminishing return beyond 2 cores.
If you aren't in one of the narrow niches that can exploit parallelism, an ultra-fast dual core is all you need.
On the other hand, it is fun to believe that you need one CPU core for your email, one for your iTunes playback, etc. It's like people have forgotten how multi-tasking operating systems work.. :)
Jeff Atwood on March 2, 2009 4:11 AMOne thing that hasn't been noted, is that bzip2 takes all the cores, resulting in the sql service being put on 'pause' while it's working (it will be fighting with the sql service for cpu space)
While if you use 7zip, you will only hold 2 cores, while the other 6 are free for the server to do it's business :)
Thomas Winsnes on March 2, 2009 4:28 AMHow about lrzip? http://ck.kolivas.org/apps/lrzip/README
Slightly different implementation, uses your choice of bzip2, lzma, lzo.
Brian on March 2, 2009 6:33 AMHas anyone tried compiling either bzip2 and/or 7zip with intel's vector instructions i.e using XMM or MMX or any of the newer instructions ?
Sami on March 2, 2009 6:38 AMWall Clock vs. CPU time
It's a shame that 7zip, as a streaming algorithm, can't really be used in a massively parallel manner. If you're wondering just how parallel bzip2 can get (it is a block algorithm, with -1 through -9 correlating to the uncompressed data size), though, Wikipedia implemented a distributed network bzip2 compressor.
I've done a lot of thinking trying to address the Wikipedia backup problem (the English wiki is so huge that the backup takes months and regularly fails without completion), and I have favored 7zip entirely because of the performance of 7zip (or lzma) using faster compression. Tricks need to be used with 7zip for large amounts of data, and namely that trick is multiple input to multiple output streams. A la db-backup-1, db-backup-2, etc. Kind of like what we did in the 'old days' when you'd use 100 floppy disks to back up your desktop. That sucked, by the way.
Anyway, the answer at the end of it is that I pipe my database backup streams into 'lzma -2' because I find it compresses faster than bzip2 and compresses to a smaller size as well (assuming you're not running bzip2 with a 4 fold cpu advantage, at least).
Of course it's all moot in the sense of Wikipedia since I guess something on the order of 100 of us are willing to tackle the problem, and nobody seems to want to let us try to get our hands dirty. It was fun reading on the mailing list last week.
It's years old, but Linux Journal made one of the niftiest looking compression comparison graphs I've seen in their comparison of 18 compressors: http://www.linuxjournal.com/article/8051
-Jeff
Autocracy on March 2, 2009 6:48 AMIt never occurred to me before but since bzip uses the BWT (Burrows-Wheeler Transform), this does allow parallelism since the compression has two independent steps:
(1) cyclic suffix sort
(2) compress the last column
and it operates on fixed size chunks making it a perfect candidate for multi-core CPUs. Sounds like another blog posting, BWT is a beautiful and elegant idea. It is one of those ideas that seems mysterious and you wonder how it was ever invented.
Not to be a link-whore, but I tracked 7-zip's compression efficiency just recently: http://heliologue.com/2009/02/09/tracking-lzma-efficiency/
There was a precipitous drop in encoding time between the early 4.x series and about v4.57, at which time it's been idling. Looks like LZMA2 (4.66 in my data) promises to increase the efficiency even further.
Final compressed sizes are similar (too similar to bother reporting, I found, at least on Ultra).
Ben on March 2, 2009 8:37 AMThere is of course a certain point where current OSes and apps give diminishing returns when more CPUs are used. There are also some studies that say beyond a certain point processing actually slows down for some tasks.
However, having four or eight cores is not a waste for many consumer desktops. I have an eight core MacPro with 10GB or RAM, running OSX (and often Windows and Linux in VMWare). OSX has a nice little dev tool where you can view the CPU load graphically, and you can disable CPUs as you wish - down to one CPU.
To make a point this weekend to someone considering whether to get a dual core or a quad core machine, I ran only an email client, a browser and iTunes - a typical consumer app load. In iTunes I was using a 'spectrum analyzer' visualizer which does load down the CPUs a bit, but it is not unusual for consumers to do something like this (or to be watching a DVD or streaming video).
At 8 cores not all 8 cores were used at the same time - it was switching between them. At four cores it was using all four cores simultaneously at about 25% average load each. At two cores it was using about 50% average load each. At one core the CPU was often pegged.
The point is that it doesn't much to load up current CPUs - even a top of the line machine like mine with an efficient OS can easily make efficient use of more than 2 cores with typical consumer use. Whether it is worth the extra money for a given consumer depends on their budget, their use and what they are buying. I have seen Dell quad core desktops on sale for under $500 so I don't hesitate to tell someone to get a quad core machine if the price is right. It won't be wasted and in the near future apps and OSes will take even more advantage of more cores.
I remember when the '386 first came out. A pundit in a review article said it would never be useful for the desktop - it would only be useful for 'file servers'. I also remember a couple of years after the Mac came out - another pundit asserted that Macs would never run Windows.
Never say never.
Developer Dude on March 2, 2009 8:46 AMGuy: And remember, recovery time is just as important as backup time.
Why would you claim this? It seems that, generally speaking, you should be doing the backup well over 100x as many times as you do the recovery (other than spot checks for backup reliability, but that's not a time-sensitive task unless you insert it as a blocker). If I have a catastrophic failure and need to take 2 hours to decompress the backup, I'm willing to charge that to the unfortunate but uncommon catastrophic failure event; if I'm taking 2 hours to compress every day so that recovery is a 10 minute operation I think I'm wasting a heck of a lot more time for little real payoff!
Or, are you implying that every backup should be uncompressed and verified immediately before closing the backup window?
@ Tom Dibble: Recovery time is as important as backup time if you want to have a DR site ready to go when a failure happens. In that kind of situation, you'll want to restore your latest backup to your DR server as fast as possible.
Andrew Francis on March 2, 2009 10:04 AMUnfortunately compression algorithms are like symmetric block-ciphers. If you want fast, then it won't compress well / it will leak information.
If you want it to compress well / not leak information, the task becomes highly serialized.
If we can only standardize on a compression algorithm, and then add specialized circuits that would do that very quickly, then it would be fast. It's kinda like that test where the Via Nano trounced a C2Q on AES compression.
Calyth on March 2, 2009 12:29 PMIn 7-zip 4.19 (IIRC) bzip2 implementation was rewritten to archive higher compression ratios on redundant data (such as source tarballs) on highest modes while remaining compatible to bzip2. Unfortunatelly that rewrite also resulted in the normal mode becoming slower in total CPU time than the standard bzip2 implementation without higher compression ratio.
Guillermo Gabrielli on March 3, 2009 1:59 AM@Jeff: Even the simplest compression algorithms (Shannon-Fano and Huffman) already eliminate between 80-90% of the information content in a file.
With so much entropy reduced, trying to compress the file again usually yields less than 10% size reduction (or even a negative compression) so it's far more efficient to use a better algorithm from the very beginning (I bet SQL2008 is simply using simple ZIP's deflate compression).
Now, what I have never understood is: why has nobody ever created a compression card, just as we have specialized graphics, sound and network cards? As a DBA I have to do backups continuously, copy them over the network, burn them, etc. (and yes, I do full backups only on Sundays, differential backups at midnight the other days and transaction backups twice during working hours - I hope the same's done on StackOverflow's database).
I think there's a market for a card whose circuitry is dedicated to compress.
Just imagine it: it'd copy a stream of bytes from memory (using DMA, of course), compress it without bothering the main CPU(s), then copy the result back to main memory. For both database and web server scenarios this would be a huge win!!!
@Developer Dude:
You must be kidding. Seriously. Are you really a developer?
Watching a video is the only CPU intensive application from the bunch you mention. Fetching email is mostly bandwith-constrained rather than processor-bound - no matter what Intel tells you, you won't be either navigating faster or retrieving your mails faster by using a more powerful CPU (or multi-cores).
Writing to the HD is thousands of times slower than writing to memory - your backup task is constraint by your HD's speed, you're not gaining much sending it to its own core.
Placing data on a CD/DVD is actually a *slow* operation and requires very few CPU, the big CPU usage when burning is due to copying big chunks of data from HD to memory, and hence to the burner. The process is also highly intolerant of delays, so unless you have a RAID or a darn fast HD you should *not* be playing a video while burning a disk, anyway.
Joe on March 3, 2009 2:25 AM@Developer Dude:
You must be kidding. Seriously. Are you really a developer?
Watching a video is the only CPU intensive application from the bunch you mention. Fetching email is mostly bandwith-constrained rather than processor-bound - no matter what Intel tells you, you won't be either navigating faster or retrieving your mails faster by using a more powerful CPU (or multi-cores).
Writing to the HD is thousands of times slower than writing to memory - your backup task is constraint by your HD's speed, you're not gaining much sending it to its own core.
Placing data on a CD/DVD is actually a *slow* operation and requires very few CPU, the big CPU usage when burning is due to copying big chunks of data from HD to memory, and hence to the burner. The process is also highly intolerant of delays, so unless you have a RAID or a darn fast HD you should *not* be playing a video while burning a disk, anyway.
Joe on March 3, 2009 2:35 AMWhy not simply try splitting the original file in four equally sized parts and compress them all individually with 7zip?
I'd be really curious what the total size would be. If it's not significantly larger than the total size when compressing the original file in one piece, then doing this in parallel would be a piece of cake. Once you've established a maximum size of input file at which point making the input file larger doesn't give better compression, you can easily split up the input file. It would require a slightly different 7zip file format though.
If the combined size is much larger than the single size, it might be much harder to parallelize 7zip though...
Frederik Slijkerman on March 3, 2009 4:19 AMDid you notice that Apple has remained resolutely in favor of Core Duos in their latest refresh of their iMac consumer desktop? I'm in complete agreement with you, desktop PC performance is primarily driven by CPU clock speed, L1/L2 cache size and memory bus bandwidth rather than the number of cores.
As an aside did you measure file I/O during this test? I wonder how much of the bzip time was just spend reading/writing to the disk? I/O latency is definitely something that multiple cores can do very little about, and may actually make things worse (by randomizing the read/write operations).
Andrew on March 3, 2009 5:30 AMEven though additional cpus past 2 cannot be used efficently, I would think that a system with 3+ cores would feel more responsive in interactive use since there is a greater possibility that cpu resources would be available immediately to respond to the user.
Metro Sauper on March 3, 2009 7:14 AMBasic queuing theory might indicate that 3 cores/CPUs may offer a practical advantage over just 2 even in light-load settings (under ~ 33% utilization) so a quad scenario *might* be overkill on the desktop.
Yep, as a compromise between limiting to dual-core and going all the way to quad-core, AMD offers tri-core (X3) versions of the Phenom and Phenom II. Sadly, Intel doesn't offer tri-cores.
BTW: in moving LARGE databases, have you looked into MS' util called
ESEUTIL? This Technet article mentions it and it works fabulously on sql databases as well (it's designed to move large Exchange dbs).
http://blogs.technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx
File Size: 388 GB (417,091,305,984 bytes)in 10.27 hours.
Keng on March 3, 2009 7:45 AMbtw:btw: that same file transfer failed when it was taking almost 24hours.
keng on March 3, 2009 7:46 AMAs Frederik suggests, you should ...simply try splitting the original file in four equally sized parts and compress them all individually with 7zip.
Let us know what the result is. If this turns out to be a winner, it would be easy enough to automate the process.
PAWebb on March 3, 2009 10:47 AMI should mention, however, that in similar scenarios I have often found no improvement or even poorer performance when paralelizing encodings in this manner. You won't know until you try, though. It all depends on where your bottlenecks are, and what kind of resource contention might develop between the processes.
PAWebb on March 3, 2009 10:51 AMAnother alternative you can consider for remote backups is rdiff-backup. It uses delta compression like rsync does to only send the difference. It also keeps reverse diffs on the server side so that you can go back to a previous version of the data at any time. It works kind of like Time Machine on the Mac. You can see what everything looked like X days ago and extract files from the backup.
This means you can keep 60 days worth of old backups at a fraction of the cost of disk space. It's very convenient to be able to go back to old data in case you find a bug in your code that's been making your data go away.
We back up 25 GB in less than an hour over a 70kbytes / sec connection.
It can be tricky to get this going. Let me know if you want to try it and I'll give you some pointers.
Sarel Botha on March 3, 2009 11:10 AM JeffA wrote:
No actual published benchmarks of typical computer use support your statement. I can point to dozens of articles backed by data on AnandTech, TechReport, Tom's Hardware, etc, that all show the same thing -- there is a *massive* point of diminishing return beyond 2 cores.
If you aren't in one of the narrow niches that can exploit parallelism, an ultra-fast dual core is all you need.
lt;lt;
If you are running benchmarks where you are looking at the performance of a single app, sure, many apps are not very parallel including not being very parallel beyond 2 CPUs.
However, if you are looking at the bigger picture where someone is using more than one app at a time, then you can start exploiting the ability of most modern OSes to use more than one CPU, to assign a given application at least one CPU.
It is not a 'narrow' niche to be listening to tunes or watching a video while something else is happening on your computer. Whether you email client is fetching email/RSSfeeds/etc., and/or you are burning a DVD, and/or you have an automatic backup process running, and/or you are downloading something. These are all typical uses that if you - the user - multitask - can find that you can easily put more than 2 CPUs to good use.
I bought my MacPro because it supported OSX (so I can run OSX, Linux and Windows, including 64 bit versions, all at the same time - I am a cross platform developer) and because it supports up to 32 GB (those VMs and DBMS servers and web app back ends take up a lot of memory) - not so much because it had 8 Xeon CPUs - but I did get the 8 CPUs instead of the 4 because I knew from time to time I could use them and in the near future more and more apps would take advantage of them.
Developer Dude on March 3, 2009 11:54 AMJeff -
Is there some reason not to compress (and maybe encrypt) the db in situ? Would insert/compress or extract/decompress on-the-fly on (presumably) mostly small text files impact on response time? If not, your backup problem becomes a simple copy or ftp. Sarel's point is good - when the backup medium of choice was 9-track tapes, backing up changes was the only feasible method.
- Lepto
Lepto Spirosis on March 4, 2009 4:50 AMJeff,
We did the same benchmarks a year ago and winzip won hands down.
if you want fast and ok size it seems using a GPU would get even better results. has anyone worked on a compression program for that?
Andrew on March 6, 2009 7:15 AMwow... This is impressive. You should try messing with dictionary size to see if there are any improvements
a href=http://www.goarticles.com/cgi-bin/showa.cgi?C=1453262Forex Investment/a a href=http://www.goarticles.com/cgi-bin/showa.cgi?C=1455812Automated Forex/a
On the file format debate (which of 7zip, gzip, bzip2 is better?), wouldn't it depend on what you're doing with the compressed file?
If you're sending it out once or twice to a few people, then the overhead of bzip2 compression may not be worth it.
However, if you're posting it on a web or FTP site, to be downloaded by thousands of people, then the bandwidth savings may be worth waiting an extra five minutes for the file to be compressed (something you only have to do once). So if you're saving 1 MB by going from gzip to bzip2, and the file will be download 10,000 times in a month, that's a saving of 10 GB.
Depending on the the file savings, and your distribution audience, a few minutes' wait can save you in other places down the road.
http://www.goarticles.com/cgi-bin/showa.cgi?C=1455812
http://www.goarticles.com/cgi-bin/showa.cgi?C=1453262
I just came across this really interesting Masters' Thesis: A Parallel File I/O API for Cilk (Matthew S. DeBergalis, 2000)
http://gd.tuwien.ac.at/languages/c/cilk/deberg-thesis.pdf
I haven't finished reading it yet, but according to the abstract, the author claims to have achieved a speedup of 1.8x on a 2-processor system and a speedup of 3.75x on a 4-processor system for the popular `compress' benchmark.
Adam Rosenfield on March 12, 2009 10:15 AMBTW, 6-core desktop processors are coming in 2009 and 4 cores will be standard, Jeff, do you think 4-cores will make sense by then?
Yuhong Bao on March 15, 2009 3:21 AMOnly very selfish people write about compression speed on their blog. The unselfish ones write about the extraction speed which is where the largest differences are. Today the decompression is the bottleneck. You have to think always that the file is compressed only ones but could be decompressed many times by many people. So looking only at compression speed is very self serving and ignorant at the potential time wastage of everyone decompressing it later.
ac on August 2, 2009 2:30 AMIf you try this on a large database (ours is 120 GB) and set the dictionary size to minimum, you'll find that 7z actually runs faster than bzip2 in spite of only using 2 threads. It's a difference of about 5 hours vs. 3 hours.
It's not that surprising, really; 7z gives a much better compression ratio, and at lower compression rates your main bottleneck is going to be the disk I/O, not the CPU.
I've also noticed that hyperthreading somehow manages to spread the load even more. If you run it on a dual-core Xeon you'll see all 4 virtual cores brick-wall.
Once you hit this scale though, it's all kind of a moot point. You can't afford to have a 4-hour backup and a 4-hour compression window. Hardware is cheaper than time today, and in the long run it's better just to have a mirror at your transport endpoint (I assume you compress for transport) and do active replication.
Realtime compression using 3rd-party backup tools is alright, but even that will only get you so far. Backup compression is a total dead-end, IMO - nothing more than a band-aid fix for people who aren't using an appropriate backup solution. But as always, YMMV; you're dealing with a different kind of scale right now.
Incidentally, I believe that the reason why the SQL 2008 backup compression feature aims for a 25-35% ratio and doesn't provide much customization is that for large database, that is roughly the point at which CPU load will overtake I/O load. The main objective is not to generate tiny backups for transport, but to reduce overall backup times by reducing the amount of data that has to be written.
Aaron G on February 6, 2010 11:14 PM@Tom Dibble: In most businesses, the *only* important metrics in a backup strategy are RPO and RTO, and they are usually part of an SLA.
If your RTO is 1 day then sure, don't waste too much time optimizing for recovery. But for some businesses (banks, stock trading, military, etc.) the RTO is literally zero or measured in seconds. IIRC, our RTO is about 2-4 hours, which is about as much time as we can afford to lose before critical tasks and processes have to be shunted to the next day and a chain reaction begins. Even that's going to be too high when our scale triples.
Backup time is not only less important than recovery time - it's meaningless. The SOX auditors don't care how much time you spend on your backups as long as you meet your BCP requirements.
Welcome to the real world. It doesn't look a lot like your last PHP/Rails web startup.
Of course, backup time is important from a technical perspective; you generally need them to happen within a certain window so that they won't interrupt other jobs or operations. But that is incidental, not fundamental.
Aaron G on February 6, 2010 11:14 PMThis is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |
Posted by: |