I've always been leery of RAID on the desktop. But on the server, RAID is a definite must:
"RAID" is now used as an umbrella term for computer data storage schemes that can divide and replicate data among multiple hard disk drives. The different schemes/architectures are named by the word RAID followed by a number, as in RAID 0, RAID 1, etc. RAID's various designs all involve two key design goals: increased data reliability or increased input/output performance. When multiple physical disks are set up to use RAID technology, they are said to be in a RAID array. This array distributes data across multiple disks, but the array is seen by the computer user and operating system as one single disk.
I hadn't worked much at all with RAID, as I felt the benefits did not outweigh the risks on the desktop machines I usually build. But the rules are different in the datacenter; the servers I built for Stack Overflow all use various forms of RAID, from RAID 1 to RAID 6 to RAID 10. While working with these servers, I was surprised to discover there are now umpteen zillion numbered variants of RAID -- but they all appear to be based on a few basic, standard forms:
RAID 0: Striping
Data is striped across (n) drives, which improves performance almost linearly with the number of drives, but at a steep cost in fault tolerance; a failure of any single striped drive renders the entire array unreadable.
RAID 1: Mirroring
Data is written across (n) drives, which offers near-perfect redundancy at a slight performance decrease when writing -- and at the cost of half your overall storage. As long as one drive in the mirror array survives, no data is lost.
Raid 5: Parity
Data is written across (n) drives with a parity block. The array can tolerate one drive failure, at the cost of one drive in storage. There may be a serious performance penalty when writing (as parity and blocks are calculated), and when the array is rebuilding.
Raid 6: Dual Parity
Data is written across (n) drives with two parity blocks. The array can tolerate two drive failures, at the cost of two drives in storage. There may be a serious performance penalty when writing (as parity and blocks are calculated), and when the array is rebuilding.
(yes, there are other forms of RAID, but they are rarely implemented or used as far as I can tell.)
It's also possible to generate so-called RAID 10 or RAID 50 arrays by nesting these RAID levels together. If you take four hard drives, stripe the two pairs, then mirror the two striped arrays -- why, you just created yourself a magical RAID 10 concoction! What's particularly magical about RAID 10 is that it inherits the strengths of both of its parents: mirroring provides excellent redundancy, and striping provides excellent speed. Some would say that RAID 10 is so good it completely obviates any need for RAID 5, and I for one agree with them.
This was all fascinating new territory to me; I knew about RAID in theory but had never spent hands-on time with it. The above is sufficient as a primer, but I recommend reading through the wikipedia entry on RAID for more depth.
It's worth mentioning here that RAID is in no way a substitute for a sane backup regimen, but rather a way to offer improved uptime and survivability for your existing systems. Hard drives are cheap and getting cheaper every day -- why not use a whole slew of the things to get better performance and better reliability for your servers? That's always been the point of Redundant Array of Inexpensive Disks, as far as I'm concerned. I guess Sun agrees; check out this monster:
That's right, 48 commodity SATA drives in a massive array, courtesy of the Sun Sunfire X4500. It also uses a new RAID system dubbed RAID-Z:
RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write.But far more important, going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes. Traditional RAID products can't do this; they simply XOR the data together blindly.
Which brings us to the coolest thing about RAID-Z: self-healing data. In addition to handling whole-disk failure, RAID-Z can also detect and correct silent data corruption. Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data. It then repairs the damaged disk and returns good data to the application. ZFS also reports the incident through Solaris FMA so that the system administrator knows that one of the disks is silently failing.
Finally, note that RAID-Z doesn't require any special hardware. It doesn't need NVRAM for correctness, and it doesn't need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks.
Pardon the pun, but I'm not sure if it makes traditional hardware RAID redundant, necessarily. Even so, there are certainly fantastic, truly next-generation ideas in ZFS. There's a great ACM interview with the creators of ZFS that drills down into much more detail. Hard drives may be (mostly) dumb hunks of spinning rust, but it's downright amazing what you can do when you get a whole bunch of them working together.
I never had the pleasure to handle a RAID configuration by myself, but ZFS really looks promising. The only concern I have is that all those checks that ZFS does will probably add a lot of I/O operations...
Lucacri on May 26, 2009 10:33 AM>...dumb hunks of spinning rust..."
yeah, but they are smooth, SHINY hunks of spinning rust.
Nice article Jeff.
Inigo Montoya on May 26, 2009 10:40 AMCongratulations on another pointless post.
Phillip on May 26, 2009 10:51 AMZFS is one of "those things" that came out of Sun that they will very fondly be remembered for, in the vein in how I look back at DEC and the incredible stuff they produced.
* Due to how ZFS handles parity data, expanding or shrinking volumes is cake. Delicious cake.
* With ZFS you can actually have separate disks for the transaction logs. For instance, you could use SSDs for the logs to cache writes, and let it flush to the slower disks later, for a nice boost.
* You can have local read caches on SSDs as well.
* NFS and iSCSI are native.
* Native LZA and GZIP compression support, albeit CPU intensive.
* Again, due to how parity data is handled, the RAID5 write hole doesn't exist. Lose power to a RAID5 array and data is lost because uncommitted parity data is lost.
ZFS is storage porn. Additionally, the X4500 uses 3.5-inch form-factor disks. Once 2.5-inch form factor disks become a lot more prevalent (I'm hoping 2010 will be the year OEMs will really push to 2.5's), density will increase further. Even 1U pizza-box servers are pushing into 8-disk land, never mind storage boxen. :)
Wow,
If by "fantastic, truly next-generation ideas" you mean "a complete knock-off of NetApp's WAFL, which has been around for a decade and a half" the yes, I suppose ZFS is praise-worthy.
Now that's taking all that energy conservation talk for a spin!
Suyi on May 26, 2009 11:25 AMWe interviewed David Brittle of the ZFS team on FLOSS Weekly #58 (http://twit.tv/floss58). Good interview, check it out. Makes me wish Snow Leopard was here sooner, so I can boot off ZFS and say goodbye to quiet disk errors.
Randal L. Schwartz on May 26, 2009 11:48 AMIf you're intrigued by post-RAID data redundancy systems you might want to take a look at Isilon systems (http://www.isilon.com/). I have a few friends who work there, they roll their own drivers and file systems and the degree of redundancy they are able to provide on many terabytes of data is ridiculous (disk level redundancy, system level redundancy, etc, it's like RAID+SAN+smarts), combined with crazy things like the ability to grow and shrink storage volumes by adding more servers or drives. I don't know how detailed their for-public-consumption docs are, but they're here: http://www.isilon.com/products/OneFS.php
More than likely this kind of technology will eventually filter down to commodity computers (since it's mostly just software).
Wedge on May 26, 2009 12:03 PMWelcome to the world of real datacenters :)
creating really large storages (100's of TB) requirems not only understanding the tradeoff for each RAID but also understanding the system requirements in throughput, latency and reliability.
Daniel on May 26, 2009 12:09 PMZFS rocks! I run Solaris on my home server and have been running ZFS for over 2 years (6x750GB drives in RAID-Z2, RAID-Z with dual parity).
One very important feature of ZFS is snapshots, which are like instant time-consistent copies of a filesystem at the time the snapshot was taken. Unlike copies, they take seconds to make, only occupy as much disk space as needed because unchanged files use the same disk blocks. You can take daily or even hourly snapshots, and they essentially make backups obsolete for the purpose of fixing human error (you can roll back to a previous snapshot, like a database transaction rollback, or copy files from the snapshot to the active filesystem). Once you've experienced a filesystem with snapshot capabilities, there's no going back.
And yes, ZFS has taken many concepts from WAFL (just reimplemented better), but similarly NetApp borrowed NFS from Sun, what's sauce for the gander is also sauce for the goose.
Fazal Majid on May 26, 2009 12:10 PM> I hadn't worked much at all with RAID, as I felt the benefits did not outweigh the risks on the desktop machines I usually build.
What risk? Just because RAID 0 is a dumb thing to do on a desktop machine doesn't mean RAID as a whole is. RAID 1, 5, 6, 10, etcetera are all just fine for desktops. Provided you can afford the extra disks of course, and that you stay away from the awful fakeraid implementations found on most motherboards.
Sander Marechal on May 26, 2009 12:44 PMWhy is RAID 0 even considered as a RAID configuration? The "R" stands for redundant. What's redundant about Raid 0 if when a disk fails, the whole storage fails.
(Did the Orange captcha fail or did you want to do some good by using reCaptcha?)
Abdu on May 26, 2009 12:59 PMNever really messed with any RAID configurations. I know the theory and all that but what is the software that implements it?
Joe Beam on May 27, 2009 2:05 AMor is it all hardware based?
Joe Beam on May 27, 2009 2:07 AMI'm pretty surprised that nobody has brought up that Apple is moving towards ZFS on both the desktop and the server in a big way.
Glenn Howes on May 27, 2009 2:13 AMJeff,
You don't believe in RAID for the desktop? Try striping a couple of Velociraptors together and then tell me that. I'm running that setup and it's insane compared to a single drive. You can do multiple things on a striped drive and not thrash the drive like you do with a single drive, it's amazing.
Jeff Lorenzini on May 27, 2009 2:15 AMI have always enjoyed the concepts behind RAID. On the desktop RAID is generally used as a way to avoid proper backups so I haven't really spent the massive time expense to understand it. With servers, just like you said, RAID shines. Disk prices have been getting to the point that with researching RAID on servers I'm starting to consider it for my desktops as crutch to help prevent data loss on hardware failure. Backup are important but sometimes it can be impractical to backup 100's of GBs multiple times a day.
Cam Birch on May 27, 2009 2:35 AMBack in the day, hard drives used to pause every 10 minutes or so to run recalibration or similar. If they still do, and do so asynchronously, performance problem. The result I recall was you can't just stuff any old commodity hard drives in, but only the ones blessed by the raid supplier, which, you guessed it, weren't so cheap.
Soon there will be a linux equivalent for the ZFS file system:
BTRFS
http://btrfs.wiki.kernel.org/index.php/Main_Page
Due to the license of ZFS they couldn't use the ZFS in linux.
For the time being there is a fuse ZFS implementation for linux to experiment with...
Soon there will be a linux equivalent for the ZFS file system:
BTRFS
http://btrfs.wiki.kernel.org/index.php/Main_Page
Due to the license of ZFS they couldn't use the ZFS in linux.
For the time being there is a fuse ZFS implementation for linux to experiment with...
Oh, and to folk who think that a hardware RAID controller eliminates the write performance problem of RAID5.. nope, sorry it doesn't really work like that.
The big problem with all the parity/ECC RAID schemes (apart from RAID 2 or RAID 3 - which are rare these days) isn't calculating the parity/ECC bits at all, not with the controllers and CPUs we have today.
The real problem is that typical writes seldom lay down a full, perfectly-aligned stripe covering all the blocks in a single stripe-width with new data, overwriting everything that was there before.
Instead typical writes change only some subset of the data blocks in a given stripe. But you still need to know what the other blocks in that stripe contain in order to calculate the new parity/ECC. At the very least you need to read the parity disk and the current content of any block that you're replacing (presuming XOR) and possibly you need to read all the blocks from all the disks that aren't going to be touched by the write and do the whole stripe calculation afresh.
That incurs seek and read delays from the spinning rust, and causes more data to pass over the IO channels, clogging them up relative to the ideal scenario.
This read-before-write problem is why people aren't keen, even now, on parity/ECC RAID for OLTP-type scenarios, or even for heavily-used filesystems where you choose to allow the access times of files to be updated. For decision support databases, and other read-predominant (including metadata) scenarios, RAID 5/6/7/whatever is just a more resilient type of stripe and is a clear win.
BTW, apart from WAFL and ZFS, most storage subsystems all the way up to the user-level still happily believe that a block from a disk is what the disk says it is without referring to the other disks and/or calculating the parity bits even in a RAID.
They do this because disks are meant to be reliable and either fail-fast or at least meant to be honest and up-front about unrecoverable errors. By not being uber-paranoid you get more actual throughput and lower latency from the IO subsystem, which most-times seems like a good trade.
The bad news is that sometimes disks, controllers, IO busses, can and do suffer from Byzantine faults and you end up with unacknowledged garbage reaching your programs and users from supposedly protected storage. Sometimes you even resilver bad data over good in a mirror. And I have to tell you that it's a pain in the arse to figure out what's happening, how to fix it, and how to contain and recover from the business/science/whatever impact of that. Pick up your cane and vicodin, stick on the deerstalker and disguise and have at it.
And while ZFS RAID protects you against more hardware failures than ever before, and you can use ZFS and WAFL snapshots to help you recover more gracefully from user thinkos ("what do you mean, you didn't want to delete your whole working directory?"). You still need those remote off-site backups to protect against fire/flood/hackers/mad-axeman/police-confiscation/etc..
Blah, sorry, I typed too much. Storage, it matters y'know?
Yeah, ZFS is awesome - I run RAID-Z2 on my storage box (6x1TB, 6x500GB), never had a problem. Incredibly resilient to corruption too; you can even overwrite the "boot sector" of the disks in question and ZFS recognizes it and doesn't skip a beat.
Of course, I'd be even happier if my storage box wasn't slow as hell (40-50 MB/s peak throughput), but having tested the arrays on another computer, it's not really ZFS' fault.
I'm not going back to hardware RAID anytime soon, at least.
Asm on May 27, 2009 3:11 AMIt's very important to note that one should be very wary of hardware RAID, while software RAID is often useful. Details at the bottom of:
http://www.pixelbeat.org/docs/hard_disk_reliability/#RAID
I think the benefits of raid and/or zfs are fairly obvious. What are your opinions on ecc memory?
Tim M on May 27, 2009 4:24 AMI'm probably preaching to the choir on this, but for home use the Windows Home Server software is pretty great. RAID like redundancy without having to match drives sizes.
Austin Wise on May 27, 2009 4:37 AMTry to compare RAID 6 with RAID 1+0 for a fairer comparison:
- Both use 4 drives as a base configuration
- Both deliver capacity 2*N for drives of N size.
- Both handle a failure in a single disk flawlessly
But:
Raid6 has higher overhead but also tolerates a second disk failure.
Raid1+0 has lower overhead but only gives you a 33% chance of the second disk failure also being tolerable.
When you go to 6 disks (as 5 disks on raid1+0 is pointless) it becomes even clearer:
Raid6 still has tolerance for any two disks failing, and has storage capacity 4*N
Raid1+0 still only tolerates 1 disk failing with 100% guarantee of certainty, 2 disks with 40% certainty and 3 disks (given that two already have failed) with only 25% chance of it being OK. The capacity remains at 3*N, which is less.
In other words, Raid1+0 gives you capacity N*(diskcount)/2 and redundancy of one disk (with if you're lucky up to N/2 disks), where Raid6 gives you capacity N*(diskcount-2) and redundancy of two disks guaranteed.
Taking into account that the RAID6 overhead is about 5-10% of the CPU power on a modern software raid6 with 4 disks, Raid0+1 seems like the dumbest thing you can do IMO.
Peter Bindels on May 27, 2009 4:54 AMPeter, RAID 10 has significantly better write performance than RAID 5 or RAID 6, especially when it comes to small random writes. You are correct that the CPU usage isn't the problem. These slow random writes can bring a database server to its knees.
To perform a random write on RAID 5/6 you have to:
1. Read one full stripe from each disk
2. Compute parity
3. Write one full stripe to each disk
This is the "write hole" that ZFS fixes with RAID-z and RAID-z2. ZFS uses copy-on-write to eliminate the read-before-write of RAID 5 and 6. It basically just write the new data and marks the old as unused.
You might also be interested in something I only recently learned about. Linux MD RAID allows for odd numbers of drives in a RAID 10:
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10
Pat Regan on May 27, 2009 5:17 AMI dont understand the aversion to having raid on your workstation. I've used a 2 disk (10,000rpm veloceraptor) striped raid on my desktop for a couple of years now and just wouldn't go back to a single disk setup.
Nowadays the reliability of a hard disk is excellent and in 10 years I've only had one failure. Also if you're machine is built on a single disk and that fails you're still up the same creek and in effect your only halving what is a very minimal risk. I exercise regular backups to a third disk but would do that if even on a single disk setup.
The extra pep you get in your machine on everyday tasks is truly beyond what you'd get from faster memory or overclocking (apart from maybe gaming).
Try it, it's totally worth it!
Paul Gwynn on May 27, 2009 5:20 AMCheck out this vid of ZFS in action using simple USB thumb drives and you get an idea of how cool ZFS can be..
http://video.google.com/videoplay?docid=8100808442979626078
CK on May 27, 2009 5:29 AM"If you take four hard drives, stripe the two pairs, then mirror the two striped arrays -- why, you just created yourself a magical RAID 10 concoction!"
Create X mirrored pairs, then strip the sets. If you stripe first a single drive loss results in one stripe going down, a second drive failure always results in the loss of the array. If you mirror then stripe you can survive a second drive loss provided it is not the partner of the first drive failure. You also don't lose as much performance as the two stripes are both still online. With modern arrays a spare drive will automatically cover failed drives so don't forget that aspect of things.
I've been using Raid-5 in my laptop for some time (Yes, it's a rather big laptop that I use as my mobile desktop). It's fast, it's pretty secure and I wouldn't go back to JBOD quickly.
I remember the times when Oracle specialists insisted on *not* using Raid-5 on database servers. Times are changing.
Gabri van Lee on May 27, 2009 5:33 AMI'd like to emphazise that RAID6 has a much bigger performance impact than RAID5: XOR for RAID5 can be done very easily, even in hardware, but RAID6 needs Reed-Solomon-Codes that are still very hard to implement nowadays.
Manuel on May 27, 2009 5:36 AMYou forgot to mention that a RAID 1 can have superior read performance (if the controller/driver do it right) since two different read requests can be executed at the same time (2 disks with identical data, after all). In that regard, RAID 1 can even be faster than RAID 0 for some special cases.
Moe on May 27, 2009 6:30 AM> Backup are important but sometimes it can be
> impractical to backup 100's of GBs multiple times a day.
That's what Rsync is for. Unless you're generating 100s of GB of new data every day...
Anonymous on May 27, 2009 6:35 AMHaving a RAID on the desktop is a very good thing, provided you don't do it using one of those shitty built-in desktop motherboard chipsets. You need a decent RAID controller for it to be worthwhile. I've seen bad desktop RAID setups run at *less than half the speed* of a 5400rpm hard drive plugged into USB on the same PC. Also, for anybody who wants to jump on the SSD (SLC please!) bandwagon, having a decent RAID controller is the only way you're going to get awesome throughput *and* capacity.
Mark on May 27, 2009 6:52 AMI see the DROBO advertised here and there and that looks interesting, though it definitely isn't cheap. I wonder if anyone here has any experience with them.
itsmatt on May 27, 2009 6:57 AMDoesn't RAID1 theoretically also allow more disks seeks per second, and double read data rate? But even without that, IMO using RAID1 is useful even on the desktop: when one drive breaks, you can still keep working until the replacement drive has arrived, and don't have the hassle of restoring from backup.
Btw. "many huge data centres each with hundreds of racks containing dozens of Sun Sunfires filled exclusively with IBM Deathstar 75GXP drives and plastic explosive": nice comparison :-D
You should look at 3PAR Luns.
Jeremy on May 27, 2009 7:04 AMFor the MTV generation in us all, maybe start with your point...
a new RAID product that you think is pretty cool.
The intro-to-raid primer seems a bit textbook for your post
as there appeared to be little of lessons learned until you mention RAID-Z and ZFS
"RAID's various designs all involve two key design goals: increased data reliability or increased input/output performance."
No - RAID 0 is litterally no redundant array. RAID = Redundant Array of Independent Disks.... RAID 0 is an Array but with no redundancy and therefore no increased data reliability. In fact, RAID 0 is far less reliable because as soon as any 1 drive fails, you lose the entire array.
RAID0 - No hard drives lost - just "joined" in an Array.
RAID1 loses (normally) - 1 hard drive for every pair.
- No pair of drives can be lost.
RAID10 - Both RAID 1 and 0 - no pair can be lost or the entire array is lost.
RAID5 loses at MINIMUM - 1 hard drive for every array.
- Rebuilds take forever because every hard drive must be read in order to recover the information lost from 1 hard drive.
- no two hard drives can be lost or all information is lost.
- worst possible performance for writing
RAID5 losses at MAXIMUM - 1 hard drive for every three.
- In the situation it is possible to lose multiple hard drives at once without losing all information, as long as no two hardrives from a triplet are lost.
- Rebuilds are much faster.
Most other RAIDs are vairants on this - although some are quite creative. I've always found hardware faster in side-by-side comparisons.
Philip on May 27, 2009 7:18 AMmmm...
Thanks Jeff for the hardware porn. Nothing like a little hardware review to perk up a morning full of document authoring.
RickCabral on May 27, 2009 7:20 AMForget RAID on the desktop.
Go with RAIPCs (Redundant Array of Inexpensive PCs).
With all of this fawning over RAID technology, everyone forgets that everything else is a serious weak link: The motherboard, the power supply, the DRAM, the controller cards, the OS, etc. In all the years I've been working with PCs I've had exactly 2 hard drives die on me. Compare that to the dozens of power supplies and motherboard that have died on me in the same time period and you see where I'm coming from.
Take my advice for SOHOs: Forget all of the massively expensive RAIDs with the 48 drive house furnace. Instead buy 3 identical cheap PCs. Configure them like this:
#1. This is your on-site server.
#2. This is your on-site hot-backup. Set up everything to be exactly like #1 except the host name/IP address. Create an automatic data backup schedule (daily or even hourly).
#3. This is your off-site backup. Just like #2 except its automatic backup schedule will probably be nightly or perhaps every other day.
If #1 fails for any reason, switch to #2. (Simply change the host name/IP address and reboot.) Although you'll lose just the data that hadn't made it to #2 yet, there's a good chance it'll be non-critical data anyway.
If #2 or #3 fails for any reason, get it fixed/replaced ASAP.
John W on May 27, 2009 7:58 AMI use a mirrored raid on my desktop; mostly because several of the VM's are mission critical and are remote logged into by several people. (Who knew a Unix command prompt could be useful to a programmer?)
The only difficulty is that after a few weeks to months when Windows inevitably locks up, I have to dirty boot the system (incidentally, the VMs still work flawlessly - go figure!) and the Raid rebuilding really slows things down for a 4 or 5 hours.
p.s. My post script failed now that "orange" isn't the captcha key.
Jeff your earlier piece on raid-on-the-desktop that you linked to is pretty silly. It would be insane to use RAID 0 with any data you cared about; this is common knowledge of the highest order. You run RAID 1 on your desktop, for the same reason you run it in a server. My own desktop has two such arrays, a pair of WD Raptors for the system and apps, and a pair of 500gb disks for data. I run these on an Adaptec 31205 which is a true hardware RAID controller unlike the onboard crap endemic to newer motherboards.
Noah Yetter on May 27, 2009 8:05 AM>Due to the license of ZFS they couldn't use the ZFS in linux.
>...
>Joepie on May 27, 2009 1:53 AM
Due to the license of the Linux kernel, they couldn't use ZFS in Linux.
Fixed that for you ;-)
ZFS is compatible with other Free and OpenSource operating systems. It is already available in NetBSD, FreeBSD, OSX 10.5+ and of course, OpenSolaris. On Linux it requires FUSE in order to get around some of the limitations of the GPL license which is used in the linux kernel.
bnitz on May 27, 2009 8:13 AMI tend to agree with several people's assessment of ZFS. This is one cool technology. Add in the ability to use SSDs as cache sweetens the pot. I've been running ZFS on a home server for over three years (RAID-1 and RAID-Z). The best promotion I can give is how it quietly protected my data even with occasional write errors caused by a flaky SATA driver (that was subsequently fixed in a later OpenSolaris build). The darn thing didn't even break a sweat.
Couple that with the native CIFS service that works seamlessly with ZFS. It is orders of magnitude faster and less resource consumptive than the Samba service it replaced.
As a comp-sci educated consultant-sysadmin type person I just want to say that articles like this, where hardened programmers learn about and share info on how computers really work, are a cause for great joy. :)
(The number of coversations we have to have with folk who work in Java, Ruby, Python, whatever, where our replies start with "well, it doesn't really work like that" and proceed to "what are you REALLY trying to do?" and end with "oh, that's easy"/"oh, that's provably impossible".. ;)
ZFS is nice, but will ultimately die. Scott McNealy is an arrogant fopol and he wrecked his company in particular and IT in general with his Java bullshit. Java was intended to be a viable alternative to Windows, but Gates/Microsoft best him like a rented mule. Write Once Run Anywhere turned into Write Once Debug everywhere. Just like there used to be an endless string of not really compatible Unix (Alpha/Ultrix/Irix/SunOS/AIX/ ..., so the same paradigm came into software. McNealy took his stockholders down the rabbit hole (all the while selling at the SEC allows every quarter), Oracle bought Sun for two reasons 1) it was a fire sale 2) save their "investment" in Java.
ZFS is already dead unless it goes to Linux/BSD/Windows.
McNealy like all his predecessors lost their and their shareholders asses by trying to do hardware and software. Only IBM wins that game. The list of those that died trying is long and distinguished: Dec, Prime, Apollo, Data General, Siolicon Grpahics, Intergraph. Must I go on? This is a business, not a party, the dot come era is over. Get used to it.
There are only two operating systems Windows and BSD everything else is doomed to failure. Linux is a microkernel nothing more.
george on May 27, 2009 8:26 AMNoah Yetter you are wonderfully optimistic or oblivious to the endless September.
Newbies will always be in great supply and will constantly be asking "what drive should I choose to do RAID 0 for my PC?" some of the newbies like the word stripe better than RAID 0 but it is the same question and I see it month after month after month on the tech sites I frequent.
dhanson865 on May 27, 2009 8:33 AMSo Jeff's discovered ZFS and RAID. It seems like every time you poke your head out of the Windows development microcosm you find something cool that you like.
Keep reading and we'll see Jeff discover the joys of a decent shell and POSIXish userland, Solaris zones, ssh, vi ...
I'll be coming back in a year and SO will be rewritten in Perl running on a FreeBSD box under Postgres with Jeff expounding the benefits of bash job control and truss.
Keep going Jeff, you'll shed that abstracted MS "PC" developer world-view soon ;)
Sam on May 27, 2009 9:20 AMWe did tons of benchmarking/testing with various RAID levels and controllers across Progress and MSSQL database servers. RAID 5 for your data array saw a 30% decrease in performance on Progress boxes alone. RAID 10 for data, RAID 1 for logs, OS and TempDB was pretty optimal on an average sized system.
Dave on May 27, 2009 9:24 AM"Raid array"
Tee hee!
"Forget RAID on the desktop.
Go with RAIPCs (Redundant Array of Inexpensive PCs)."
Say that out loud and tell me which you'd rather go with: RAID or RAIPCs.
Wait. On second thought, don't tell me. :)
Randolpho on May 27, 2009 9:33 AMWTF are you all talking about? Get a decent SATA/SAS controller (LSI MegaRaid 8888ELP or 8880EM2, AKA Dell PERC6) and there's NO write penalty, and read numbers are also scary on wide RAID5 arrays.
DMB on May 27, 2009 10:00 AMWhat would happen to RAID when disks go to solid state flash mem?
I don't see how this is coding related. Shouldn't it belong on serverhorror.com?
*post closed*
joe on May 27, 2009 10:28 AM@George
Ummmm ZFS is already available in several distros of BSD...
As bnitz posts a couple before you,
"Due to the license of the Linux kernel, they couldn't use ZFS in Linux."
"ZFS is compatible with other Free and OpenSource operating systems. It is already available in NetBSD, FreeBSD, OSX 10.5+ and of course, OpenSolaris. On Linux it requires FUSE in order to get around some of the limitations of the GPL license which is used in the linux kernel."
I think ZFS will be around longer than you think :)
Dave on May 27, 2009 11:03 AMIn High School I sprayed some Raid on parsley and sold it to someone who smoked it.
Sammy on May 27, 2009 11:25 AMJeff,
Correction: RAID is Redundant Array of Independent Disks
Sami on May 27, 2009 11:34 AM@Phillip: "Congratulations on another pointless post." - Congrats on a pointless reply. Posts like this are equivalent to my co-worker coming up to me and saying "Hey! Have you heard of ZFS!? It's pretty cool!" which I very much like.
Steve-O on May 27, 2009 11:41 AMZFS is compatible with some other Free and OpenSource operating systems. It is already available in NetBSD, FreeBSD, OSX 10.5+ and of course, OpenSolaris. but intentionally incompatible with any using the GPL
ZFS uses the CDDL licence which was deliberately designed to be incompatible with the GPL, and written to licence ZFS ... there fixed that for you
Tim: "What are your opinions on ecc memory?"
I'm not sure it's worth the extra cost involved. If gap could be half as much I would seriously consider an upgrade to my current controller. I think the cost of ECC is really too high and I'm not even sure it's worth it in the first place because I still don't understand from what ECC memory should protect me.
A Gamma-Ray striking that very same cell? Worth some extra cost. Still not worth the cost involved considering the probability this happens.
After seeing the Sun monster on SATA disks, I would like to ask another thing:
What are your opinions on SCSI (all versions included)?
It is my opinion that SCSI is definetly overestimated and I find the Sun monster as confirmation.
I'd like to read your opinions on that.
Let me make clear that storage is the last of my concerns.
MaxDZ8 on May 27, 2009 12:33 PMJeff -
RAID 0 is essential in scientific applications where the data rates are enormous (astronomy, seismology, etc.). There seems to be no other option given the input data rates and the write speed of the currently available disks.
I'm amazed/surprised/saddened spinning platters are still with us. I've been looking forward to 3-dimensional solid-state memory for 30+ years. Still waiting...
- Lepto
Lepto Spirosis on May 27, 2009 12:34 PMAnd if you're really paranoid, you assume that you will lose more than one drive at any time. From my experience, once you blow out one drive in a RAID-5 setup the extra load rebuilding may push another drive over the cliff.
A study was done regarding the Mean Time To Failure (MTTF) for real-world environments. A fascinating read:
http://www.usenix.org/events/fast07/tech/schroeder.html
What is "Parity"? Put simply XOR.
Assume the minimum Three disk setup
Disk1 = 44 - 00101100
Disk2 = 68 - 01000100
Disk3 = Parity information is XOR of Disk 1 and 2:
Disk3 = Disk1 XOR Disk2 = 44 XOR 68
Disk3 = 104 - 1101000
Recover Disk1 with only Disk2 and 3 is accomplished by XORing together Disk2 and 3:
Disk1 = Disk2 XOR Disk3 = 104 XOR 68
Disk1 = 44
And doing it with more disks:
Disk1 = 44
Disk2 = 68
Disk3 = 35
Disk4 = 92
Disk5 = PARITY (Xor of all) = 23
Assume Disk3 dies
Xor the others:
Disk3 = 44 XOR 68 XOR 92 XOR 23
Disk3 = 35
Okay - there is a lot more to the implementation. But even so, such a simple concept works so well in practice.
Philip on May 27, 2009 12:55 PMYou forgot to mention that the ordering of a 'combined' RAID system is important, so that RAID 10 is different to RAID 01. The difference is in the order the operations are applied (so RAID 10 is striped then mirrored across 4 disks and RAID 01 is mirrored then striped).
I'm in the process of having to help set up some systems like this as well though, and getting into the details of RAID when I've always ignored it in the past.
workmad3 on May 27, 2009 1:04 PMFunny.. we have a lot of xw8600 Workplaces with 15k RAID 5 setups - and you are so foolish to think it is not needed as desktops? Come out of your hole more often.
DawnOfWar on May 27, 2009 1:36 PMGeorge said:
>> McNealy like all his predecessors lost their and their shareholders
>> asses by trying to do hardware and software. Only IBM wins that
>> game. The list of those that died trying is long and distinguished:
>> Dec, Prime, Apollo, Data General, Siolicon Grpahics, Intergraph.
>> Must I go on?
What about apple?
Joseph Cooney on May 28, 2009 3:47 AMThat would come in handy when I am playing with old computer parts. Configure old drives from pentium 1's and 2's into a RAID. That would be fun some day for something to do. Man, some of those drives are pretty slow.
pinkturbokitty on May 28, 2009 5:06 AMJeff: You should take a read of Jim Gray's 1981 paper "The Transaction Concept: Virtues and Limitations" and in particular the chapter "UPDATE IN PLACE: A poison apple?".
The idea behind ZFS and WAFL before it have been around for almost 30 years. This is nothing new.
http://research.microsoft.com/en-us/um/people/gray/papers/thetransactionconcept.pdf
Ausmith1 on May 28, 2009 7:09 AMHave you ever dug into what it takes to make these "dumb hunks of spinning rust?" Hard drive design and manufacturing is a real feat of engineering... control-systems, materials science, tribology, signal processing, low-level firmware... all manufactured in huge volumes and sold at tiny prices. Hard drives offer a variety of very tough challenges and satisfying careers for engineers. You should look into it - it's fascinating stuff.
I used RAID-0 when I was coming up with a low-cost streaming platform for Digital Cinema. Otherwise I wouldn't get the 250Mbs throughput from disk to the projector.
As for ZFS fading away... That would be a shame on a number of different levels, especially since it's a real breath of fresh air. Forget about all the times it already saved my data. Having the ability to throw new disks to increase my storage without dealing with partitions, mounts, logical volumes, etc. is a dream come true.
I believe that the licensing could be moved to GPLv2 and work had been done towards that goal at some point, but that won't help since Linus is hard-nosed against that too.
There is a point at which backup becomes unimportant with redundant disks in the broadest sense. Particularly so when you can eliminate all the single points of failure (such as having all your servers in one physical location.) For example, I doubt very much that Google backs up their database.
When would you consider offline back up to become redundant?
I can't believe the number of people here posting opinions after making the disclaimer, "I've never actually had any hands-on experience with RAID"... but I guess that this is a developer-oriented blog, so it should figure that a lot of you don't really care much about the hardware on which your apps run - you just expect the IT monkeys locked in the datacenter to make it work like magic.
Of course, you've walked into one of the biggest firestorms of IT engineering and datacenter design by posting your opinion, however green, on the matter. Better yet, you weighed in, inadvertently, on things like diskless backup and emerging commodity SATA RAID pool technology. (It may have been around in one form or another for a decade or more, but it is only now gaining momentum, and in my mind, remains largely unproven in the data-center, so, someone else can go first on adopting it - I'm sticking with traditional SANs for now).
Nice job.
PiddlyD on May 28, 2009 9:46 AM@Philip - What are you talking about? You quoted a correct quote, said no, then reiterated many of the posts points.
JC on May 28, 2009 9:52 AMOne tip I read to decrease the chances of multiple drive failures is to buy hard drives from separate lots.
Who is your target audience? This stuff is basic review for anyone who went to college for anything related to hardware or software. Are there vast legions of Windows developers who are high-school educated and find this novel or informative? I'm really quite curious. Can anyone suggest a good programming blog by someone who would consider this type of article beneath their target audience, because I want to be reading that instead.
Randy on May 29, 2009 3:27 AMPeter,
Should mention that during a two-drive outage and recovery from same, RAID 6 performance will get terrible. Depending on the I/O load you need to service, you might not be able to run production, even if ther is no data loss. Requiring two drives to fail to get to that state is a big improvement over RAID 5, of course.
As the disk system gets larger, same-mirror RAID 10 failures can be reduced by the dead half of the mirror being taking over by a hot spare. Doesn't eliminate the problem, but closes the window of vulnerability automatically and quickly.
Mike
Mike Hates Music on May 29, 2009 4:29 AMI bought 3 hard drives for my home computer, but instead of setting them up in a RAID5 (which has pretty bad performance) I set the first two drives as a RAID0, with daily backups of important data to the third drive. Knocked about a third off my load times in games when I benchmarked it (that ANANDtech article linked to was smoking crack - read http://www.hardwaresecrets.com/article/394/6 for example), and it also helps to reduce stuttering in games when its paging in new data, which is one of the most important things in getting a fast game experience.
Built it in January 2005, and the drive benchmark for my RAID0 in Sandra is still faster than their uploaded results for SSDs.
Bill on May 30, 2009 6:57 AM"If you take four hard drives, stripe the two pairs, then mirror the two striped arrays -- why, you just created yourself a magical RAID 10 concoction!"
No, that's RAID 0+1. RAID 1+0 is a stripe of mirrors and RAID 0+1 is a mirror of stripes. Reread the wikipedia article you linked to.
Brian on May 30, 2009 12:23 PMPeople that read these posts and then say something along the lines of...
>> Congratulations on another pointless post.
>> Phillip on May 26, 2009 9:51 PM
deserve to be chemically castrated for the good of all humanity. I mean, seriously if there's something you don't enjoy reading... how about (/smacks self on head w/ microphone) NOT READING IT!!! I'd have to be a world class idiot to keep coming back to something i feel needs criticizing that much. Its like buying a playboy to make fun of how fat you think the models are. Seriously man just get over yourself and find something else to read.
Aru on May 30, 2009 12:52 PM>> Some would say that RAID 10 is so good it completely obviates any need for RAID 5, and I for one agree with them.
Only problem here is that you are "losing" 50% of your disks rather than a single disk.
Sure, the performance is better, but you 50% is a pretty big cost.
Taylor Marshall on May 31, 2009 11:21 AMFor RAID1 you have:
> Data is written across (n) drives, [...] at the cost of half your overall storage.
Shouldn't that be "at the cost of (n-1)/n of your overall storage"? You said that if any one drive survives, the data survives; so the data must be duplicated n times (once on each of n drives). Thus if you have 4 drives, the cost of redundancy is 3/4 of your drives.
Adam DiCarlo on May 31, 2009 11:50 AMI disagree that RAID isn't useful for desktop systems - but personally I prefer mirroring (and as an addition to, not replacement for, backups). Backups only run every 4 hours here, and I'd hate to lose 4 hours of work. And I'd hate having to wait for a replacement drive and rebuilding the system, mirroring also saves me from that.
Striping is pretty meh. Sure, you get wonderful linear performance, but it doesn't really help wrt random I/O... which is what you're doing most of the time, unless you have pretty specific needs. And it doesn't help with all game loadtimes. Heck, I even tried placing FarCry2 in a ramdisk, and it only gave a minimal performance boost compared to cold-loading from disk. Depends on what the game is doing during the loading phase, of course. But check Process Explorer I/O read stats while loading and you'll see that a lot of games don't even utilize single-drive bandwidth.
f0dder on June 1, 2009 8:34 AMJessica Boxer,
In reading your posting, I KNOW I have spoken with you before and would like to talk to you, privately. You have my e-mail addresses and Yahoo IM link. PLEASE get in touch!! I would greatly appreciate it! Thanks!
John Lanigan on June 4, 2009 3:02 AMIt's worth noting that the global credit crisis (which preceded the global financial crisis) was caused by the economic equivalent of RAID-Z.
In brief, investment firms believed that when individually risky elements are clustered together (in this case, home loans), the outcome is low risk. And they believed you could lower the risk to zero by scaling up the number of elements.
The crisis was caused when the demand for investing in home loans was so great that they started accepting riskier and riskier loans as long as the interest rate scaled with the risk.
So what they ended up with was equivalent of many huge data centres each with hundreds of racks containing dozens of Sun Sunfires filled exclusively with IBM Deathstar 75GXP drives and plastic explosive.
Simon Wright on February 6, 2010 11:13 PM>"Congratulations on another pointless post."
Stop congratulating yourself.
Anon on February 6, 2010 11:13 PM> You still need those remote off-site backups to protect against
> fire/flood/hackers/mad-axeman/police-confiscation/etc..
Very important. JournalSpace, AVSim, and Ma.gnolia all suffered complete loss of all data recently. In each case, they had online backups where both copies were destroyed. No offsite backups, no tape backups.
Ron Ruble on February 6, 2010 11:13 PMJeff: Thanks for writing a post that directly relates to what I do everyday as I write firmware for a SAN.
Wilson on February 6, 2010 11:13 PM@Jessica Boxer
If the data has lasting value (i.e., doesn't expire), then you should have an archived/read-only copy of it. Even if you all but guarantee yourself against hardware failures, you still need to protect yourself from human error or maliciousness. So in a sense, you will always need an 'offline backup'; it doesn't have to literally be 'offline', it just has to be un-modifiable. For example, you could keep all your data in your live ZFS file system and take periodic read-only snapshots.
However, if the data expires or needs to be updated on a regular basis anyway, then it probably doesn't matter. I'm guessing a lot of Google's data falls into that category.
The comments to this entry are closed.
|
|
Traffic Stats |