Sunday, November 20, 2011

Building a RAID disk array circa 1988

In "A Case for Redundant Arrays of Inexpensive Disks (RAID)" [1988], Patterson et al of University of California Berkeley started a revolution in Disk Storage still going today. Within 3 years, IBM had released the last of their monolithic disk drives, the 3390 Model K, with the line being discontinued and replaced with IBM's own Disk Array.

The 1988 paper has a number of tables where it compares Cost/Capacity, Cost/Performance and Reliability/Performance of IBM Large Drives, large SCSI drives and 3½in SCSI drives.

The prices ($$/MB) cited for the IBM 3380 drives are hard to reconcile with published prices:
 press releases in Computerworld  and IBM Archives for 3380 disks (7.5Gb, 14" platter, 6.5kW) and their controllers suggest $63+/Mb for 'SLED' (Single Large Expensive Disk) rather than the
"$18-10" cited in the Patterson paper.

The prices for the 600MB Fujitsu M2316A ("super eagle") [$20-$17] and 100Mb Conner Peripherals CP-3100 [$10-$7] are in-line with historical prices found on the web.

The last table in the 1988 paper lists projected prices for different proposed RAID configurations:
  • $11-$8 for 100 * CP-3100 [10,000MB] and
  • $11-$8 for 10 * CP-3100 [1,000MB]
There are no design details given.

1994, Chen et al in "RAID: High-Performance,Reliable Secondary Storage" use two widely sold commercial system as case studies:
The (low-end) NCR device was more what we'd call a 'hardware RAID controller' now, ranging from 5 to 25 disks. Pricing $22-102,000. It provided a SCSI interface and didn't buffer. A system diagram was included in the paper.

The StorageTek's Iceberg was high-end device meant for connection to IBM mainframes. Advertised as starting at 100GB (32 drives) for $1.3M, up to 400Gb for $3.6M, It provided multiple (4-16) IBM ESCON 'channels'.

For the NCR, from InfoWorld 1 Oct 1990, p 19 in Google Books
  • min config: 5 * 3½in drives, 420MB each.
  • $22,000 for 1.05Gb storage
  • Add 20*420Mb to 8.4Gb list $102,000. March 1991.
  • $4,000/drive + $2,000 controller.
  • NCR-designed controller chip + SCSI chip
  • 4 RAID implementations: RAID 0,1,3,5.
The StorTek Iceberg was released in late 1992 with projected shipments of 1,000 units in 1993.  It was aimed at replacing IBM 'DASD' (Direct Access Storage Device): exactly the comparison made in the 1988 RAID paper.
The IBM-compatible DASD, which resulted from an investment of $145 million and is technically styled the 9200 disk array subsystem, is priced at $1.3 million for a minimum configuration with 64MB of cache and 100GB of storage capacity provided by 32 Hewlett-Packard 5.25-inch drives.

A maximum configuration, with 512MB of cache and 400GB of storage capacity from 128 disks, will run more than $3.6 million. Those capacity figures include data compression and compaction, which can as much as triple the storage level beyond the actual physical capacity of the subsystem.
Elsewhere in the article more 'flexible pricing' (20-25% discount) is suggested:
with most of the units falling into the 100- to 200GB capacity range, averaging slightly in excess of $1 million apiece.
Whilst no technical reference is easily accessible on-line, more technical details are mentioned in the press release on the 1994 upgrade, the 9220. Chen et al [1994] claim "100,000 lines of code" were written.

More clues come from an feature, "Make Room for DASD" by Kathleen Melymuka  (p62) of CIO magazine, 1st June 1992 [accessed via Google Books, no direct link]:
  • 5¼in Hewlett-Packard drives were used. [model number & size not stated]
  • The "100Gb" may include compaction and compression. [300% claimed later]
  • (32 drives) "arranged in dual redundancy array of 16 disks each (15+1 spare)
  • RAID-6 ?
  • "from the cache, 14 pathways transfer data to and from the disk arrays, and each path can sustain a 5Mbps transfer rate"
The Chen et al paper (pg 175 of CACM,, Vol 26, No 2) gives this information on the Iceberg/9200:
  • it "implements an extended RAID level-5 and level-6 disk array"
    • 16 disks per 'array', 13 usable, 2 Parity (P+Q), 1 hot spare
    •  "data, parity and Reed-Solomon coding are striped across the 15 active drives of an array"
  • Maximum of 2 Penguin 'controllers' per unit.
  • Each controller is an 8-way processor, handling up to 4 'arrays' each, or 150Gb (raw).
    • Implying 2.3-2.5Gb per drive
      • The C3010, seemingly the largest HP disk in 1992, was 2.47Gb unformatted and 2Gb formatted (512by sectors), [notionally 595by unformatted sectors]
      • The C3010 specs included:
        • MTBF: 300,000 hrs
        • Unrecoverable Error Rate (UER): 1 in 10^14 bits transferred
        • 11.5 msec avg seek, (5.5msec rotational latency, 5400RPM)
        • 256Kb cache, 1:1 sector interleave, 1,7 RLL encoding, Reed-Solomon ECC.
        • max 43W 'fast-wide' option, 36W running.
    • runs up to 8 'channel programs' and independently transfer on 4 channels (to mainframe).
    • manages a 64-512Mb battery-backed cache (shared or per controller not stated)
    • implements on-the-fly compression, cites maximum doubling capacity.
      • and dynamic mapping necessary CKD (count, key, data) for variable-sized IBM blocks onto the fixed blocks internally.
      • a extra (local?) 8Mb of non-volatile memory is used to store these tables/maps.
    • Uses a "Log-Structured File System" so blocks are not written back to the same place on the disk.
    • Not stated if the SCSI buses are one-per-arry or 'orthogonal'. i.e. Redundancy groups are made up from one disk per 'array'.
Elsewhere, Katz, one of the authors, uses a diagram of a generic RAID system not subject to any "Single Point of Failure":
  • with dual-controllers and dual channel interfaces.
    • Controllers cross-connected to each interface.
  • dual-ported disks connected to both controllers.
    • This halves the number of unique drives in a system, or doubles the number of SCSI buses/HBA's, but copes with the loss of a controller.
  • Implying any battery-backed cache (not in diagram) would need to be shared between controllers.
From this, a reasonable guess at aspects of the design is:
  • HP C3010 drives were used, 2Gb formatted. [Unable to find list prices on-line]
    • These drives were SCSI-2 (up to 16 devices per bus)
    • available as single-ended (5MB/sec) or 'fast' differential (10MB/sec) or 'fast-wide' (16-bit, 20MB/sec). At least 'fast' differential, probably 'fast-wide'.
  • "14 pathways" could mean 14 SCSI buses, one per line of disks, but it doesn't match with the claimed 16 disks per array.
    • 16 SCSI buses with 16 HBA's per controller matches the design.
    • Allows the claimed 4 arrays of 16 drives per controller (64) and 128 max.
    • SCSI-2 'fast-wide' allows 16 devices total on a bus, including host initiators. This implies that either more than 16 SCSI
  • 5Mbps transfer rate probably means synchronous SCSI-1 rates of 5MB/sec or asynchronous SCSI-2 'fast-wide'.
    • It cannot mean the 33.5-42Mbps burst rate of the C3010.
    • The C3010 achieved transfer rates of 2.5MB/sec asynchronously in 'fast' mode, or 5MB/sec in 'fast-wide' mode.
    • Only the 'fast-wide' SCSI-2 option supported dual-porting.
    • The C3010 technical reference states that both powered-on and powered-off disks could be added/removed to/from a SCSI-2 bus without causing a 'glitch'. Hot swapping (failed) drives should've been possible.
  • RAID-5/6 groups of 15 with 2 parity/check disk overhead, 26Gb usable per array, max 208Gb.
    •  RAID redundancy groups are implied to be per (16-disk) 'array' plus one hot-spare .
    • But 'orthogonal' wiring of redundancy groups was probably used, so how many SCSI buses were needed per controller, in both 1 and 2-Controller configurations?
    • No two drives in a redundancy group should be connected via the same SCSI HBA, SCSI bus, power-group or cooling-group.
      This allows live hardware maintenance or single failures.
    • How were the SCSI buses organised?
      With only 14 devices total per SCSI-2 bus, a max of 7 disks per shared controller was possible.
      The only possibly configurations that allow in-place upgrades are: 4 or 6 drives per bus.
      The 4-drives/bus resolves to "each drive in an array on a separate bus".
    • For manufacturing reasons, components need standard configurations.
      It's reasonable to assume that all disk arrays would be wired identically, internally and with common mass terminations on either side, even to the extent of different connectors (male/female) per side.
      This allows simple assembly and expansion, and trivially correct installation of SCSI terminators on a 1-Controller system.
      Only separate-bus-per-drive-in-array (max 4-drives/bus), meets these constraints.
      SCSI required a 'terminator' at each end of the bus. Typically one end was the host initiator. For dual-host buses, one at each host HBA works.
    • Max 4-drives per bus results in 16 SCSI buses per Controller (64-disks per side).
      'fast-wide' SCSI-2 must have been used to support dual-porting.
      The 16 SCSI buses, one per slot in the disk arrays, would've continued across all arrays in a fully populated system.
      In a minimum system, 32 drives, would've been only 2 disks per SCSI bus.
  • 1 or 2 controllers with a shared 64M-512M cache and 8Mb for dynamic mapping.
This would be a high-performance and highly reliable design with a believable $1-2M price for 64 drives (200Gb notional, 150Gb raw):
  • 1 Controllers
  • 128Mb RAM
  • 8 ESCON channels
  • 16 SCSI controllers
  • 64 * 2Gb drives as 4*16 arrays, 60 drives active, 52 drive-equivalents after RAID-6 parity.
  • cabinets, packaging, fans and power-supplies
From the two price-points, can we tease out a little more of the costs [no allowance for ESCON channel cards]:
  • 1 Controller + 32 disks + 64M cache = $1.3M
  • 2 Controllers + 128 disks + 512M cache = $3.6M
As a first approximation, assume that 512M RAM costs half as much as 2 Controllers for a 'balanced' system. Giving us a solvable set of simultaneous equations:
  • 1.0625 Controllers + 32 disks  = $1.3M
  • 2.5 Controllers + 128 disks = $3.6M
  • $900,000 / Controller [probably $50,000 high]
  • $70,000 / 64M cache [probably $50,000 low]
  • $330,000 / 32 disks ($10k/drive, or $5/MB)
High-end multi-processor VAX system pricing at the time is in-line with this $900k estimate, but more likely an OEM'd RISC processor (MIPS or SPARC?) was used.
This was a specialist, low-volume device: expected 1st year sales volume was ~1,000.
In 1994, they'd reported sales of 775 units when the upgrade (9220) was released.

Another contemporary computer press article cites the StorageTek Array costing $10/Mb compared to $15/MB for IBM DASD. 100Gb @ $10/Mb  is $1M, so congruent with the claims above.

How do the real-world products in 1992 compare to the 1988 RAID estimates of Patterson et al?
  • StorageTek Iceberg: $10/Mb vs $11-$8 projected.
    • This was achieved using 2Gb 5¼in drives not the 100Mb 3½in drives modelled
    • HP sold a 1Gb 3½in SCSI-2 drive (C2247) in 1992. This may have formed the basis of the upgrade 9220 ~two years later.
  • Using the actual, not notional, supplied capacity (243Gb) the Iceberg cost $15/Mb.
  • The $15/Mb for IBM DASD compares well to the $18-$10 cited in 1988.
    • But IBM, in those intervening 5 years, had halved the per-Mb price of their drives once or twice. The 1988 "list price" from the archives of ~$60/Mb are reasonable.
  • In late 1992, 122Mb Conner CP-30104 were advertised for $400, or $3.25/Mb.
    These were IDE drives, though a 120Mb version of the SCSI CP-3100 was sold, price unknown.
The 8.4Gb 25-drive NCR  6298 gave $12.15/Mb, again close to the target zone.
From the Dahlin list, 'street prices' for 420Mb drives at the time, were $1600 for Seagate ST-1480A and $1300 for 425Mb Quantum or $3.05-$3.75/Mb.

The price can't be directly compared to either IBM DASD or StorageTek's Iceberg, because the NCR 6298 only provided a SCSI interface, not an IBM 'channel' interface.

The raw storage costs of the StorageTek Iceberg and NCR are roughly 2.5:1.
Not unexpected due to the extra complexity, size and functionality of the Iceberg.

    No comments: