Wednesday, February 01, 2012

Z-gen hard disks: shingled writes

We are approaching the limits of magnetic hard disk drives, probably before 2020, with 4-8TB per 2.5 inch platter.

One of the new key technologies proposed is "shingled writes", where new tracks partially overwrite  an adjacent, previously written track, making the effective track-width of the write heads much smaller. Across the disk, multiple inter-track blank (guard) areas are needed to allow the shingling to start and finish, creating "write regions" (super-tracks?) as the smallest recordable area, instead of single tracks. The cost of discarding the guard distance between tracks is higher cross-talk, requiring more aggressive low-level Error Correction and Detection schemes.

In the worst case, a single sector update, the drive has to read the whole write-region into local memory, update the sector and then rewrite the whole write-region. These multi-track writes, with one disk revolution per track, are not only slow and make the drive unavailable for the duration, but require additional internal resources to perform, including memory to store a whole region. Current drive buffers would limit the size of regions to a few 10's of MB, which may not yield a worthwhile capacity improvement.

The shingled-write technique has severe limitations for random "update-in-place" usage:
  • write-regions either have to be very small with many inter-region gaps/guard areas, considerably reducing the areal recording density and obviating its benefits, or
  • have relatively few very large write-regions to achieve 90+% theoretical maximum areal recording density at the expense of update times in the order of 10-100 seconds and significant on-drive memory. This substantially increases cost if SRAM is used and complexity if DRAM is used.
Clearly, shingled writes are not an optimum solution for drives used for random writes and update-in-place, they are perhaps the worse solution for this sort of work load.

A tempting solution is to adopt a "log structured" approach, such as used in Flash Memory in the SSD FTL (Flash Translation Layer), and map logical sectors to physical location:
write sector updates to a log-file, don't do updates-in-place and securely maintain a logical-to-physical sector map.

Contiguous logical sectors are initially written physically adjacent, but over time as sectors are updated multiple times, contiguous logical sectors will be spread widely across the disk. Thereby radically slowing streaming read rates, leaving many "dead" sectors reducing the effective capacity and requiring active consolidation, or "compaction".

The drive controllers still have to perform logical-to-physical sector mapping, optimally order reads and reassemble the contiguous logical stream.

Methods to ameliorate disruption of spatial proximity must trade space for speed:
  • either allow low-density (non-shingled) sets of tracks in the inter-region gaps specifically for sector updates, or
  • leave an update area at the end of every write-region.
    Larger update areas lower effective capacity/areal density, whilst smaller areas are saturated more quickly.
Both these update-expansion area approaches have a "capacity wall", or an inherent hard-limit:
 what to do when the update-area is exhausted?

Pre-emptive strategies, such as predictive compaction, must be scheduled for low activity times to minimise performance impact, requiring the drive to have significant resources, a good time and date source and to second-guess its load-generators. Embedding additional complex software in millions of disk drives that need to achieve better than "six nines" reliability creates an administration, security and data-loss liability nightmare for consumers and vendors alike.

The potential for "unanticipated interactions" is high. The most probable are defeating Operating System block-driver attempts to optimally organise disk I/O requests and duplicating housekeeping functions like compaction and block relocation, resulting in write avalanches and infinite cascading updates triggered when drives near full capacity.

In RAID applications, the variable and unpredictable I/O response time would trigger many false parity recovery operations, offsetting the higher capacity gained with significant performance penalties.

In summary, trying to hide the recording structure from the Operating System, with its global view and deep resources, will be counter-effective for update-in-place use.

Shingled-write drives are not suitable for high-intensity update-in-place uses such as databases.



There are workloads that are a very good match to large-region update whole-disk structures:
  • write-once or very low change-rate data, such as video/audio files or Operating System libraries etc,
  • log files, when preallocated and written in append-only mode,
  • distributed/shared permanent data, such as Google's compressed web pages and index files,
  • read-only snapshots,
  • backups,
  • archives, and
  • hybrid systems designed for the structure, using techniques like Overlay mounts with updates written to more volatile-friendly media such as speed-optimised disks or Flash Memory.
Shingled-write drives with non-updateable, large write-regions are a perfect match for an increasingly important HDD application area: "Seek and Stream".

There are already multiple classes of disk drives:
  • cost-optimised drives,
  • robust drives for mobile application,
  • capacity-optimised Enterprise drives,
  • speed-optimised Enterprise drives, and
  • "green" or power-minimised variants of each class.
Pure shingled-write drives could be considered a new class of drive:
  • capacity-optimised, write whole-region, never update. Not unlike CD-RW or DVD-RAM.
With Bit-Patterned-Media (BPM), another key technology needed for Z-gen drives, a further refinement is possible for write whole-region drives:
  • continuous spiral tracks per write-region, as used by Optical drives.
Lastly, an on-drive write-buffer, of Flash memory, of 1 or 2 write-regions size would, I suspect, improve drive performance significantly and allow additional optimisations or Forward Error Correction in the recording electronics/algorithms.

For a Z-gen drive with 4TB/platter and 4Gbps raw bit-rates, 2-8Gb write-regions may be close to optimal. Around 1,000 regions per drive would also fit nicely with CLV (Constant Linear Velocity) and power-reducing slow-spin techniques.

A refinement would be to allow variable size regions to precisely match the size of data written, in much the same way that 1/2 inch tape drives wrote variable sized blocks. This technique allows the Operating System to avoid wasted space or complex aggregation needed to match file and disk recording-unit sizes. This is not quite the "Count Key Data" organisation of old mainframe drives (described by Patterson et al in 1988 as "Single Large Expensive Drives").

Like Optical disks, particularly the CD-ROM "Mode 1" layer, additional Forward Error Correction can be cheaply built into the region data to achieve protection from burst-errors and achieve unrecoverable bit-error rates in excess of 1 in 1060 both on-disk and for off-disk transfers.

For 100-year archival data to be stored on disks, it has to be moved and recreated every 5-7 years, forcing errors to be crystallised each time. Petabyte RAID'd collections using drives with 1 in 1016 bits-in-error only achieve a 99.5% probability of successful rebuild with RAID-6. Data migrations are effective RAID rebuilds. Twenty consecutive rebuilds have a 10% probability of complete data loss, an unacceptably high rate. Duplicating the systems reduces this to a 1% probability of complete data loss, but at a 100% overhead. The modern Error Correction techniques suggested here require modest (10-20%) overheads and would improve data protection to less than 0.1% data loss, though not obviating the problems of failing hardware.

In a world of automatic data de-duplication and on-disk compression, data protection/preservation becomes a very high priority. Storage efficiency brings the cost of "single point of failure". This adds another impetus to add good Error Correction to write-regions.

A 4GB region, written at 4Gbps would stream in 8-10 seconds. Buffering an unwritten region in local Flash Memory would allow fast access to the data both before and during the commit to disk operation, given sufficient excess Flash read bandwidth.
Note, this 4-8Gbps bandwidth, is an important constraint on the Flash Memory organisation. Unlike SSD's, because of the direct access and sequential-write, no FTL is required, but bad-block and worn-block management are still necessary.

4GB, approximately the size of a DVD, is known to be a useful and manageable size with a well understood file system (ISO 9660) available, it would match shingled-disk write-regions well. Working with these region-sizes and file system is building on well-known, tested and understood capabilities, allowing rapid development and a safe transition.

Using low-cost MLC Flash Memory with a life perhaps of 10,000 erase cycles would allow the whole platter (1,000 regions) to be rewritten at least 10 times. Allowing a 25% over-provisioning of Flash may improve the lifetime appreciably as is done with SSD's, which could be a "point of difference" for different drive variants.
Specifically, the cache suggested is write-only, not a read-cache. The drive usage is intended to be "Seek and Stream", which does not benefit from an on-drive read-cache. For servers and disk-appliances, 4GB of DRAM cache is now an insignificant cost and the optimal location for a read cache.

Provisioning enough Flash Memory for multiple uncommitted regions, even 2 or 3, may also be a useful "point of difference" for either Enterprise or Consumer applications. Until this drive organisation is simulated and Operating and File Systems are written and trialled/tested against them, real-world requirements and advantages of larger cache sizes are uncertain.

Depending on the head configuration, the data could be read-back after write as a whole region and re-recorded as necessary. The recording electronics perhaps adjusting for detected media defects and optimising recording parameters for the individual surface-head characteristics in the region.

If regions are only written at the unused end of drives, such as for archives or digital libraries, maximum effective drive capacity is guaranteed, there is no lost space. Write-once, Read-Many is an increasingly common and important application for drives.
A side-effect is that individual drives will, like 1/2 inch tapes of old, vary in achieved capacity, though of the same notional size, but you can't know until the limit is reached. Operating and File Systems have dealt with "bad blocks" for many decades and can potentially use that approach to cope with variable drive capacity, though it is not a perfect match. Artificially limiting drive capacity to the "Least Common Denominator", either by the consumer or vendor, is also likely. "Over-clocking" of CPU's shows that some consumers will push the envelope and attempt to subvert/overcome any arbitrary hardware limits imposed. If any popular Operating System can't cope easily with uncertain drive capacity and variable regions, this will limit their uptake in that market, though experience suggests not for long if their is an appreciable price or capacity/performance differential.

When re-writing a drive, the most cautious approach is to first logically erase the whole drive and then start recording again, overwriting everything. The most optimistic approach is to logically erase 2 or 3 regions, the region you'd like to write and enough of a physical cushion to allow defects etc not to cause an unintended region overwrite.

This suggests two additional drive commands are needed:
  • write region without overwriting next region (or named region)
  • query notional region size available from "current position" to next region or end-of-disk.
This raises an implementation detail beyond me:
Are explicit "erase region" or "free region" operations required?
Would they physically write to every raw bit-location in a region or not?



On Heat Assisted Magnet Recording (HAMR), drive vibration, variable speed and multiple heads/arms.

HAMR is the other key technologies (along with BPM and shingled-writes) being explored/researched to achieve Z-gen capacities.
It requires the heating of the media, presumably over the Curie Point, to erase the existing magnet fields. Without specific knowledge, I'm guessing those heads will be bigger and heavier than current heads, and considerably larger and heavier than the read heads needed.

Large write-regions with wide guard areas between, would seem to be very well suited to HAMR and its implied low-precision heating element(s).  Relieving the heating elements of the same precision requirements as the write and read heads may make the system easier to construct and control and hence record more reliably. Though this is pure conjecture on my part.

Dr. Sudhanva Gurumurthi and his students have extensively researched and written about the impact of drive rotational velocity, power-use and multiple heads.  From the timing of their publications and the release of slow-spin and variable-speed drives, its reasonable to infer that Gurumurthi's work was taken up by the HDD manufacturers.

Being used for "Seek and Stream", not for high-intensity Random I/O, HDD's will exhibit considerably less head/actuator movement resulting in much less generated vibration if nothing else changes. This improves operational reliability and lowers induces errors greatly by removing most of the drive-generated vibration. At the very least, less dampening will be needed in high-density storage arrays.

Implied in the "Seek and Stream" is that I/O characteristics will be different, either:
  • nearly 100% write for an archive or logging drive with zero long seeks, or
  • nearly 100% read for a digital library or distributed data with moderate long seeking.
In both scenarios, seeks would reduce from the current 250-500/sec to 0.1-10/sec.
For continuous spiral tracks, head movement is continuous and smooth for the duration of the streaming read/write, removing entirely the sudden impulses of track seeks. For regions of discrete, concentric tracks, the head movements contain the minimum impulse energy. Good both for power-use and induced vibration.

Drawing on Gurumurthi's work on multiple heads compensating for performance of slow-spin drives, this head/actuator arrangement for HAMR with shingled-write may be benefical:
  • Separate "heavy head", either heating element, write-head or combined heater-write head.
  • Dual light-weight read heads, mounted diagonally from each other and at right-angles to the "heavy head".
Because write operations are infrequent in both scenarios, the "heavy head" will be normally unloaded, even leaving the heating elements (if lasers aren't used) normally off. The 2-10 second region-write time, possibly with 1 or 2 rewrite attempts, means 10 msec heater ramp-up would not materially affect performance.

A single write head can only achieve half the maximum raw transfer rate of dual read heads. Operating Systems have not had to deal with this sort of asymmetry before and it could flush out bugs due to false assumptions.

Separating the read and "heavy" heads reduces the arm/bearing engineering requirements and actuator power for the usual dominant case - reading. By slamming around lighter loads, lower impulses are produced. Because the drives are not attempting high-intensity random I/O, lower seek performance is acceptable. The energy used in accelerating/de-accelerating any mass is proportional to velocity2. Reducing the arm seek velocity by 70% halves the energy needed and the impulse energy needing to be dissipated. (Lower g-forces also reduce the amplitude of the impulse, though I can't remember the relationship.)

With "Seek and Stream" mode of operation, for a well tuned/balanced system, the dominant time factor is "streaming". The raw I/O transfer rate is of primary concern. The seek rate, especially for read, can be scaled back with little loss in aggregate throughput.

Optimising these factors is beyond my knowledge.

By using dual opposing read heads, impulses can be further reduced by synchronising the major seek movements of the read heads/arms.
As well, both heads can read the same region simultaneously, doubling the read throughput. This could be as simple as having each head read alternate tracks, or in a spiral track, the second head starts halfway through the read area, though to simply achieve maximum bandwidth, the requesting initiator may have to be able to cope with two parallel streams, then join the fragments in its buffer. Not ideal, but attainable.

Assuming shingled-writes, dual spiral tracks would allow simple interleaving of simultaneous read streams, but would either need two write heads similarly diagonally opposed or a single device with two heads offset by a track width and possibly staggered in the direction of travel to be assembled. Would the a single laser heating element suffice two write heads?? This arrangement sounds overly complicated, difficult to consistently manufacture to high-precision and expensive.

For a single spiral track with dual read-heads, a dual spiral can be simulated, though achieving full throughput requires more local buffer space.
The controller moves the heads to adjacent tracks and reads a full-track from each into a first set of buffers, it then concatenates the buffers and streams the data stream. After the first track, the heads are leap-frogged and stream to an alternate set of buffers, which are then concatenated and streamed while the heads are leap-frogged and switch back to the first set of buffers, etc.

This scheme doesn't need to buffer an exact track, but something larger than the longest track at a small loss of speed. If a 1MB "track" size is chosen, then 4MB of buffer space is required. Data can begin being streamed from the first byte of track 0, though only after both buffers are full can full-speed transfers happen.

It's possible to de-interleave the data when written and reorder before writing with alternate sectors offset by half the now write buffer size (2MB for a 4MB buffer). On reading, directly after the initial seek to the same 4MB segment but offsets zero and 2MB, heads will read alternate sectors which can then be interleaved easily and output at full bandwidth. When a head reaches the end of a segment (4MB), they jump to the next segment and start streaming again.  Some buffering will be required because of the variable track size and geometric head offsets. I'm not sure if either scheme is superior.



Summary:
  • shingled-write drives form a new class of "write whole-region, never update" capacity-optimised drives. As such, they are NOT "drop-in replacements" for current HDD's, but require some tailoring of Operating and File Systems.
  • abandon the notional single-sector organisation for multi-sector variable blocking similar to old 1/2 tape.
  • large write-regions (2-8GB) of variable size with small inter-region gaps maximise achievable drive capacity and minimise file system lost-space due to disk and file system size mismatches. If regions are fixed-sector organised, lost space will average around a half-sector, under 1/1000th overhead.
  • Appending regions to disks is the optimal recording method.
  • Optimisation techniques used in Optical Drives, such as continuous spiral tracks and CD-ROM's high resilience Error Correction, can be applied to fixed-sectors and whole shingled-write regions.
  • Integral high-bandwidth Flash Memory write caches would allow optimal region recording at low cost, including read-back and location optimised re-recording.
  • Shingled-writes would benefit from purpose-designed BPM media, but could be usefully implemented with current technologies to achieve higher capacities, though perhaps exposing individual drive variability.
  • shingled-writes and large, "never updated" regions work well with HAMR, BPM, separated read/write heads and dual light-weight read heads.

No comments: