Monday, February 04, 2013

Storage: New Era Taxonomies

There are 3 distinct consumer facing market segments that must integrate seamlessly:
  • portable/mobile devices: "all flash".
  • Desktop (laptop) and Workstations
  • Servers and associated Storage Arrays.
We're heading into new territory in Storage:
  • "everything" is migrating on-line. Photos, Documents, Videos and messages.
    • but we don't yet have archival-grade digital storage media.
      • Write to a portable drive and data retention is 5 years: probable, 10 years: unlikely. That's a guess on my part, real-life may well be much worse.
    • Currently householders don't understand the problem.
      • Flash drives are not (nearly) permanent or error-free.
      • Most people have yet to experience catastrophic loss of data.
      • "Free" cloud storage may only be worth what you pay for it.
  • Disk Storage (magnetic HDD's) is entering its last factor-10 increase.
    • We should expect 5-10TB/platter for 2.5" drives as an upper bound.
    • Unsurprisingly, the rate of change has gone from "doubling every year" to 35%/year to 14%/year.
    • As engineers approach hard limits, the rate of improvement is slower and side-effects increase.
    • Do we build the first maximum-capacity HDD in 2020 or a bit later?
  • Flash memory is getting larger, cheaper and faster to access, but itself is entering an end-game.
    • but retention is declining, whilst wear issues may have been addressed, at least for now.
    • PCI-attached Flash, the minimum latency config, is set to become standard on workstations and servers.
      • How do we use it and adequately deal with errors?
  • Operating Systems, General Business servers and Internet-scale Datacentres and Storage Services have yet to embrace and support these new devices and constraints. 
David Patterson, author with Gibson & Katz of the 1989 landmark paper on "RAID", noted that every storage layer trades cost-per-byte with throughput/latency.
When a layer is no longer cheaper than a faster layer, consumers discard it. Tapes were once the only high-capacity long-term storage option

My view of FileSystems and Storage:
  • high-frquency, low-latency: PCI-flash.
  • high-throughput, large-capacity: read/write HDD.
  • Create-Once Read-Maybe snapshot and archival: non-update-in-place HDD.
    • 'Create' not 'Write'-Once. Because latent errors can only be discovered actively, one of the tasks of Archival Storage systems is regularly reading and rewriting all data.
What size HDD will become the norm for read/write and Create-once HDD's?
I suspect 2.5" because:

  • Watts-per-byte are low because aerodynamic drags increases near the fifth power of platter diameter and around cube of rotational speed.
    • A 3.5" disk has to spin at around 1700rpm to match a 5400rpm 2.5" drive in power/byte, and ~1950rpm to match a 7200rpm 2.5" drive.
    • All drives will use ~2.4 times the power to spin at 7200rpm vs 5400rpm.
    • Four 2.5" drives provide around the same capacity as a single 3.5" drive
      • Area of 2.5" platters are half a 3.5" platters.
      • 2.5" drives are half the thickness as 3.5" drives (25.4mm)
      • 3.5" drives may squeeze 5 platters, 25% better than 2.5".
  • Drives are cheap, but four smaller drives will always be more expensive than a single larger drive.
    • Four sets of heads will always provide:
      • higher aggregate throughput
      • lower-latency
      • more diversity, hence more resilience and recovery options
      • "fewer eggs in one basket". Impact of failures are limited to a single drive.
    • In raw terms, the cheapest, slowest, most error-prone storage will always be 3.5" drives. But admins build protected storage, not raw.
      • With 4TB 3.5" drives, 6 drives will provide 16TB in a RAID-6 config.
        • Note the lack of hot-spares.
      • With 1TB 2.5" drives, RAID-5 is still viable.
        • 24 drives as two sets of 11 drives + hot-spare, provide 20TB.
        • For protected storage, 3.5" drives only offer at best 3.2 times the density and many-fold less throughput and latency.



Here are some of my take-aways from LCA 2013 in Canberra.

* We're moving towards 1-10TB of PCI-Flash or other Storage Class Memory being affordable and should expect it to be a 'normal' part of desktop & server systems. (Fusion IO now 'high-end')

  - Flash isn't that persistent, does fade (is that with power on?).
    - How can that be managed to give decade long storage?
  - PCI-Flash/SCM could be organised as one or all  of these:
     - direct slow-access memory [need a block-oriented write model]
     - fast VM page store
     - File System. Either as
- 'tmpfs' style separate file system
- seamlessly integrated & auto-managed, like AAPL's Fusion LVM
- massive write-through cache (more block-driver?)

  - there was a talk on Checkpoint/Restart in the Kernel, especially for VirtMach, it allows live migration and the potential for live kernel upgrades of some sort...
    - we might start seeing 4,000 days uptime.
    - PCI-Flash/SCM would be the obvious place to store CR's and as source/destination for copies
    - nobody is talking about error-detection and data preservation for this new era: essential to explicitly detect/correct and auto-manage.

  - But handling read-errors and memory corruption wasn't talked about..
    - ECC won't be enough to *detect* let alone correct large block errors.
    - Long up-times means we'll want H/A CPU's as well to detect compute errors.
      Eg. triplicated CPU paths with result voting.

    - the 'new era' approach to resilience/persistence has been whole-system replication and 'network' (ethernet/LAN) connection, and away from expensive internal replication for H/A.


==> As we require more and more of 'normal' systems, they start to need more and more Real-time and H/A components.

==> For "whole-system" replication, end-end error detection of all storage transfers starts to look necessary. i.e. an MD5 or other checksum generated by the drive or Object store and passed along the
chain into PCI-Flash and RAM: and maybe kept for rechecking.

==> With multiple levels of storage with latency between them and very high compute rates in CPU's, we're heading into the same territory that Databases addressed (in the 80's?) with Transactions and ACID.


* Log-structured File Systems are perfect fit for large Flash-based datastores.
  - but log-structured FS may also be perfect for:
    - write-once data, like long-term archives (eg. git repos)
    - shingled write (no update-in-place) disks, effectively WORM.

==> I think we need an explicit Storage Error Detect/Correct layer between disks and other storage to increase BER from 10^14 or 10^16 to more like 10^25 or 10^30. [I need to calculate what numbers are actually needed.] Especially are everything gets stored digitally and people expect digital archives to be like paper and "just work" over many decades.