## Saturday, May 03, 2014

### Comparing consumer drives in small-systems RAID

This stems from an email conversation with a friend: why would he be interested in using 2.5" drives in RAID, not 3.5"?

There are two key questions for Admins at this scale, and my friend was exceedingly sceptical of my suggestion:
• Cost/GB
• 'performance' of 2.5" 5400RPM drives vs 7200RPM drives.
I've used retail pricing for the comparisons. Pricelist and sorted pricelist.

Prices used

4TB WD Red NAS 3.5" drive: $235, at 5.87 cents/GB 4TB Seagate (desktop) 3.5" drive:$189
at 4.73 cents/GB

1TB Hitachi 5400RPM 2.5"drive: $80: 4x$80 = $320 for 4TB, at 8.00 cents/GB 1.5TB Hitachi 5400RPM 2.5" drive:$139: 3x $139 =$417 for 4.5TB [12.5% more]
at 9.27 cents/GB
1TB Hitachi 7200RPM 2.5"drive: $93: 4x$93 = $372 for 4TB, at 9.30 cents/GB I’ve not said 2.5” drives are cheapest now on raw $$/GB. I do expect in the future they could be. On drive count The minimum drives needed in RAID-3/5 is 3 (2 data + 1 parity) For 4TB drives, that’s ~700 (3x235), giving you 8TB usable. [Or 3x189 for 570] For the same capacity with 1TB drives, you only need 1 extra drive over usable capacity, ie. 9 drives = 9x80 = 720. The cost/GB is the same for 1.5TB 2.5" drives is 13.5% more, but the capacity ratios aren’t the same. If you go for 5 data + 1 parity, then you get usable 7.5TB (6% less) for 834 - appreciably more expensive than1TB drives, but with lower IO/sec. Including a spare, hot or cold: A spare takes the 3.5” to 4x235 = ~940 (or 756) and 2.5” to 10x80 = 800. This more conservative approach would almost never be used at minimum scale. You’d use RAID-10 (RAID-0 stripes of two RAID-1 volumes). It gives better performance and no RAID-parity overhead. [worked example at end] For 8 vs 32 drives, a more realistic comparison, you spend 1,880 (1,512) vs 2,560. You’d probably want to run 2 parity drives if this was a business (6 data = 24TB) vs (30 data = 30TB) Price /TB = 78.33 (63) vs 85.33, or 9% higher, while usable capacity is 25% more for 2.5” drives [caveat below on 24 vs 32 drives in a box] and I’m ignoring extra costs for high-count SAS/SATA PCI cards. [possibly 750] You have to add in the price of power and replacement drives over a 5-year lifetime. 8-drives will consume ~800kWHr/year, or 4,000kWHr in a lifetime = 800 @ 20c/kWHr. The 2.5” drives use half that much power - a 400 saving, but not a slam dunk (we’re looking for ~650 (or 1000) in savings). Over 5 years you’d expect to lose 10%-12% of drives: 1x3.5” and 3x2.5” = 235 vs 160. Close, but still no clear winner. On rebuild time after failure. • a 4TB drive will take between 3-4 times longer to rebuild than a 2.5” drive. [only lose one drive at a time] • if the rebuild fails, you’ve lost 4TB at a time, not 1TB [or the whole volume. Errors are proportional to bits read.] • this will matter in an 8-drive array in heavy use: • performance gets degraded hugely during rebuild, further slowing rebuild with competition for access. What I can't calculate is actual RAID rebuild times - this is dependant on the speed of shared connections to the drives and the normal work load on the array, where competition slows down rebuilds. Some RAID arrays are documented as taking more than a week (200hrs) to rebuild a single failed drive. At the very best, a single 4TB drive will take 32,000 seconds (8+ hours) to read or write. A 5400RPM 1TB drive will take around one-third (3 hours). If you're heavily dependant on your RAID, or if it runs saturated for significant periods, then reducing rebuild times is a massive win for a business: the hourly wages cost or company turn-over must be 100+/hr even for small businesses. It doesn't take much of an As the RAID rebuild time increases, the likelihood of a second drive failure, or more likely, an unrecoverable read error, occurring rises. This results in a business disaster, the total loss of data on the RAID array, requiring recovery from backups, if they've been done. On Performance. Yes, individually 5400RPM drives are slower than 7200RPM, but for RAID volumes, it comes down to the total number of rotations available (seek time is now much faster than rotation), especially for random I/O loads, like databases. Looking at IO/sec [twice the revs per second, seeks are “half a rotation” on average] Hz below, or Hertz, is "revolution per second": • 7200RPM = 120Hz = 240 IO/sec, per drive = 1.00 per IO/sec for 4TB NAS 3.5” drives • 5400ROM = 90Hz = 180 IO/sec, per drive = 0.45 per IO/sec for 1TB 2.5” drives In the 3x3.5” 8TB RAID-5, you’ve 3x240 IO/sec = 750 IO/sec read or total vs 9x2.5” = 9x180 = 1620 IO/sec, 116% faster [2.16 times the throughput] In the 24TB 8x3.5” drive = 1920 IO/sec vs 5760 IO/sec [3 times the throughput] If you’re running Databases in a few VM’s and connecting to a modest RAID-set or a NAS, 32x2.5” drives will be ~20% more expensive per GB, but give you 25% more capacity and three times the DB performance. What I've calculated is raw IO performance, not included RAID-parity overheads. My understanding is you only get is 24x2.5” drives, not 32, in the same space as 8x3.5” drives. Manufacturers haven’t really pushed the density of 2.5” drives yet. Even if you have modest performance needs, 2.5” drives are a slam-dunk - they deliver twice the IO/sec per dollar and in RAID’s of modest size, three times the throughput for 17% more in /GB. So, I agree with you, at the low end, 3.5” drives are still the drive of choice. We haven’t reached the tipping point yet, but every year the gap is closing, I think that’s the big thing I’ve come to understand. Just like when we moved from 5.25” drives to 3.5” drives on servers: it didn’t happen overnight, but one day when it came to replace an old server, like-for-like drives were not even considered. Addendum 1 - Calculating the random write (update) performance of RAID. The write_ performance of a RAID array with 24TB usable on7200RPM drives in a RAID-5 config (6+1) vs 24+1x1TB for 2.5” drives. RAID-6 has a higher compute load for the second parity, but same IO overheads. Because 3 drives, not 2, are involved in every update, array throughput (for random IO updates) is reduced by 50%, read performance is unaffected. The problem with distributed parity is that every update incurs a read/write-back cycle on two drives (data + parity) and you force a full revolution delay. Instead of 1/2 revolution for the read and 1/2 revolution for the write, you have 1 full revolution for the write: 3/2 revs per update for each drive, or 6 IO/sec of array capacity. [9 IO/sec for RAID-6]. If you’re prepared to buffer whole stripes during streaming writes (large sequential writes), RAID-5 & RAID-6 performance can be close to raw IO of drives, because the read/write-back cycle can be avoided. This addendum is about a typical Database load, random IO in 4kB blocks, not about streaming data. 7x3.5" drives can deliver 240 IO/sec per drive = 1680 IO/sec reads and 280 IO/sec RAID-5 writes. [vs 187 IO/sec for RAID-6] 25x2.5” drives can deliver 180 IO/sec per drive = 4500 IO/sec reads and 750 IO/sec RAID-5 writes. RAID-1 (2 mirrored drives) delivers twice the IO/sec for reads, but the IO/sec of a single drive for writes as both drives have to write same to same block. RAID-1 isn't limited to mirroring on a pair of drives, any number of mirrors can be used. The same rule applies: read IO/sec increase 'N' times, while write IO/sec remains static. Which is why RAID-10 (stripes of mirrored drives) are popular. Additionally, the data on a drive can be read outside the RAID, blocks aren’t mangled or added, leaving a usable image of filesystem or database, if it fits fully on a drive. For 24TB in RAID-10, you need 12 drives, an additional 4 drives over RAID-6 [~850 (50%) more than ~1,900] read throughput is: 12x240 = 2880 IO/sec [just over half the 25x2.5” drive RAID-5 performance] write throughput is: 6x240 = 1440 IO/sec [twice the 25x2.5” drive RAID-5 performance] Note: For 12x3.5” drives delivering 24TB in RAID-10, cost is 2,868 (2,268), with half the read IO & but twice the write IO of RAID-5. 25x2.5” drives delivering 24TB as RAID-5 cost 2,000, or 30% (10%) less$$/GB for a higher IO throughput in normal 80%:20% read/write loads. Power required for 2.5” drives is around 25% (not less): 30W vs 120W. Over 5 years, that’s$1100 vs $300 in electricity, another$800 saving. Cooling is around 20% of direct power cost, not included.

To do a full TCO, you need to factor in the cost of replacement drives, the labour cost of replacing them, any business impact (lost sales or extra working hours of users), the cost of the "footprint” [floor space used] as well as the power/cooling costs.

The larger sizes of 3.5” drives increases the changes of RAID rebuilds failing and suffering a complete Data Loss event (Bit Error Rate is the same, regardless of capacity. 4 times larger drives = 4 times chance of errors during rebuild).

RAID-6, with its performance penalty and 12% extra cost is mandatory for 24TB RAID to achieve reasonable chance of Data Loss during rebuilds. The RAID-5 comparison is very generous. In Real Life, it’d be 8x.35” RAID-6 vs 25x.5" RAID-5, slanting more to 2.5” drives.

Note: We are past the break-even point for RAID-1/10 of 3.5” drives vs RAID-5 of 2.5” drives.

For 8TB (2x2 =4x3.5” drives vs 8+1 2.5” drives),
raw drive costs are: $956 and$720.
Power consumption is: 40W vs 11W.
Read IO: 4x240 = 960 IO/sec vs 9x180 = 1440 IO/sec [50% higher]
Write IO: 2x240 = 480 IO/sec vs (9x180)/3 = 240 IO/sec [50% lower]

3.5” = 0.8x960  + 0.2x480 = 864 IO/sec
2.5”  = 0.8x1440 + 0.2x240 = 1200 IO/sec

I’ve left out streaming IO performance because that implies data has to be streamed via the external interface.
If we’re talking low-end servers, they’ll have a 1Gbps LAN, at best 2x1Gbps, easily achieved by 2 drives of any size.

If you’re talking 10Gbps interfaces, even SFP+ @ $50/cable and$300-\$500/PCI card (guess) + 5 times the price/port for switches, you’ve got a much more expensive installation and are probably looking to maximise throughput-per-dollar, not primarily capacity.

These are notes on RAID-3/4, single parity drive, not distributed parity as in RAID-5/6,normally it’s avoided because the parity drive becomes the limiting factor.

The really interesting thing with dedicated parity drives is performance: the data drives are, not seem to be, RAID-0 (stripes).

You get the full raw performance of the drives delivered to you in many situations. The cost of Data Protection is for CPU and RAM for the RAID-parity and at least one drive overhead to store it. I suggest below that two parity drives can be used for 100+ drives in a RAID-group. Small extra cost and protect against worst case rebuild.

It does have a performance slowdown on random updates, read, then write-back, but only for isolated blocks.
If you’re streaming data with large sequential writes, you don’t need to read/write-back data or parity, and the parity drive keeps up with the data drives.

You can do 3 interesting things with a dedicated parity drive for Random IO:
• use an SSD for parity. Their random IO/sec is 500-1,000 times faster than a single drive. Easy to do, highest performance, lowest overhead. SSD’s are ‘only’ about 2-3 times the cost of HDD’s, making this attractive for large RAID groups (100+), but not so for 3-8 drives.
• use a HDD for parity and a small SSD buffer to convert Random I/O into sequential ‘scan’ from one side of the disk to the other. Need to buffer new & old data blocks as well as parity blocks.
• Sequential IO is 100-150 times faster than Random IO on current drives, logging or journaling writes can leverage that difference.
• use two HDD’s for parity and a buffer. Read current parity from one drive, recalc parity and write new parity to other drive. Then swap which drive is ’new’ and ‘old' (in a ‘ping-pong’ fashion) and repeat. It may take 20-30 minutes on a 2.5” drive per scan, but you can use an SSD to store changes or even log update to a HDD. The number of IO’s to store, even on a very large RAID (300-500 drives) isn’t that large, 750GB for 4kB blocks for 100% write load and no overwrites.
For more realistic loads/sizes, 120 drives and 20% write, 0% overwrite, 75GB for 4kB blocks.

There’s a performance benefit to this long-range buffering, what may look like a bunch of isolated updates can become “known data” over 30 minutes, meaning you avoid the read/write-back cycle.