My “fun” weekend nerd project: Now that my iSCSI server is dishing out respectable performance numbers and supports virtual space allocation (The physical location of data stored is decoupled from the “virtual” location of the data, allowing for “hot moves” of data without bringing virtual disks offline.) My next step is… PARITY. Yes… I’m going to build RAID-5-style parity into the system.
In a nutshell, RAID-5 works on a simple XOR principle to give you full redundancy, as-if you mirrored your hard-drive, but doesn’t require you to duplicate ALL the data on ALL drives. How does it work?
Well for the non-technical readers, if there are any, Xor, is a concept not really used in daily human life, where we typically only use “and” and “or”. If you consider a 1 to mean “true” and 0 to mean “false” then the following is fundamental to computing:
1 and 1 = 1
0 and 1 = 0
1 and 0 = 0
0 and 0 = 0
simple enough, right?
1 or 1 = 1
1 or 0 = 1
0 or 1 = 1
0 or 0 = 0
again, nothing terribly out of the ordinary.
Xor, however, is really only used in computing and its use is very valuable.
1 xor 1 = 0
0 xor 1 = 1
1 xor 0 = 1
0 xor 0 = 0
Xor is at the core of what makes what I’m about to attempt work. An interesting characteristic of using Xor is that it can help you find missing pieces of data. Maybe you could even apply it to real-life.
Take 3 scraps of paper, on one label it “A” and write a “0”. Label another “B” and write a “1”. Label another “P” (for parity) and write the “exclusive or” value of A and B. 1 xor 0 = 1… so write a 1 on P.
Now throw away, burn, or ingest one of the two scraps of paper so that you’re left with only 1 scrap. You can throw away A, B, or P. But only throw away 1.
How to you determine the value of the missing scrap, assuming you don’t just remember it?
You can always find the answer by xor-ing the two remaining scraps of paper.
if B was missing: 1 xor 1 = 0… so B was 0
if A was missing: 0 xor 1 = 1 … so A was 1
if P was missing: 1 xor 0 = 1 … therefore P must have been 1.
The cool thing is that this works with more than just 3 scraps of paper. It will work with any number of scraps. But the one limitation is that it only protects you from losing exactly one scrap of paper. So if you try it with 50 scraps of paper… you’d better be able to recover 49 out of 50.
Raid-5 disk arrays take this concept and apply it to how your data is stored so that if one of your disk drives fails, there’s always enough parity information available to reconstruct the missing data without requiring you to specifically duplicate all the data on a backup disk.
You need at least 3 drives to do Raid-5 and your usable space will always be 1-disk less than the number of disks you have. So in a 3 drive configuration, 33% of your space is wasted (which is better than the 50% waste you’d get if you mirrored the drives). If you have 4 drives… your waste is only 25%. 5 drives 20%. 10 drives 10%.
Additionally, since your drives are working together, in theory, you’ll get faster performance because each drive is storing part of the data, increasing your theoretical throughput.
Raid-5 is nothing new. It has been around for decades really. However, my weekend project intends to take it one step further. Typically when you create a Raid-5 partition, you use every drive you have, typically they’re identical drives with identical sizes and identical performance characteristics. 5x 1TB drives will give you 4TB of usable space. But what if you have a motley collection of disks. Say you have a 120GB, 500GB and 1TB drive and you want to maximize the redundancy and minimize the wasted space?
Well… in a typical RAID-5 setup… the biggest RAID disk you could create from this configuration is 240GB, consuming 120GB from each of the 3-disks. I think this is crap. I can do better than that! Here’s how.
In my iSCSI server… my data is dynamically/virtually allocated. It looks like a big disk to the operating system, but where the data is stored is completely hidden and abstracted.
Space is allocated as-needed from a “payload pool”. So even if I create a drive that is 4TB… I don’t actually need to provide a physical location for this data until I actually write 4TB out to the virtual disk.
Each time I need a new chunk of data I allocate a new “big block” which puts a new entry in the virtual address table. Under my current config, the Big Blocks are around 32MB in size.
When I need a new big block… I typically look at all the payloads, their sizes, and determine the best location where the new big block should reside. Each payload location has parameters that I can configure while the disk is hot and in use. I can, for example, set the quota for one of the locations to 0, which will cause the server to drain the physical data from that location and move it to other available locations without affecting running applications.
To upgrade this system to support Raid-5, I simply need to come up with a system where, instead of choosing one location for all data, choose all eligible locations, then format the specific 32MB block using Raid-5 style redundancy as opposed to the ENTIRE disk… wasting valuable space.
This means that under my scenario, with 120, 500 and 1000GB disks…
240GB of space will have 33% overhead
380GB of space will have 50% overhead (mirrored)
500GB will not be fault-tolerant, but can be used if necessary.
My total capacity will be 1120GB… around 4 times the capacity of traditional RAID-5.
This is a bit unfair, however, because if I lose the non-redundant part of the data, the whole partition might likely collapse. So a more fair number, including ONLY the fault-tolerant portion of the disk would be 620GB. Still quite an improvement.
Lets run these numbers against my ACTUAL physical disk configuration. My server contains a motley of disks:
Under this scenario, the biggest traditional RAID-5 partition I could create would be 6TB. Using 1TB from 7 disks to get 6TB of space with a 16% waste overhead.
Doing this dynamically, however….
I could allocate 8.12TB fully redundant plus have an extra 1TB of non-redunant space.
This would include
840GB spanned across 8 disks
5280GB spanned across 7 disks
1500GB spanned across 4 disks
500GB mirrored on 2 disks
1000GB slack space, leftover, not redundant
I’ll let you know how it goes!