Thursday, January 17, 2013

Storage: FileSystems, Block/Object Storage and Physical Disk Management in 21st Century Systems

The central social contract filesystem and storage layers have with users is:
  • Don't lose data
  • Make it easy to get data in and out, preferably verifiably correct.
  • Performance is nice, but can never talks precedence over preserving data and replaying it correctly.
The approaches and paradigms that worked for Unix in 1970 won't work now. Its was a world of 5-10MB drives @ 1-10Mbps, 1MHz CPU's without cache and "off-line" storage was 6250bpi 9-track 0.5in tape (2400' ~120Mb).

Even nearly 20 years later in 1988, the year of the Patterson/Gibson/Katz RAID paper, streaming the full contents of a drive for a rebuild (100MB 5.25" SCSI drives) was ~100 seconds and ~1000 seconds for 1GB 8" Fujitsu Eagle drives preferred by the first Storage Arrays.

What's changed is the relative capacity and speeds of storage devices, the demands of "average users" and some additional layers of storage, like cache and Flash memory.

The old approaches are creaking and becoming more & more complex in attempts to handle performance (rate), volume and size. One "fast" filesystem, ReiserFS, was popular for a time but notorious with users for corrupting disks and losing data. Breaking the contract loses users...

The 10 TB/platter 2.5" drives expected by 2020 will only read 2-3 times faster than current 1TB drives (250-400MB/sec). That's 40,000 seconds to stream the whole drive: 10-12 hours. Increasingly, Jim Grays' millennial observation, "Tape is Dead, Disk is the new Tape" (meaning disks are good at streaming, poor at random I/O), is driving Storage designs. Enterprise Class Storage Arrays cannot compete with Flash memory for random I/O and to cover need increasingly long drive rebuild times (4-150 hours) have adopted slower, more inefficient/complex parity schemes.

We now have chips with 3 levels of cache, soon with on-chip DRAM, on-board DDR3 DRAM, PCIe Flash, SATA/SAS Flash and HDD drives and soon "no update-in-place" Shingled-Write drives.
SCM, Storage Class Memories, like Flash are hoped to provide the path to higher capacity devices, but to date, their are no obvious commercial technologies.

This in the context of at least 4 types of compute devices, each with different demands for Storage and data recovery and protections.
  • Mobile: smartphones and tablets. Not usually "content creators" but "viewers". Software from Firmware and vendor App Stores. Auto-sync config and data to "Cloud" or desktop.
  • Laptop/low-end Desktop: Limited "content creation". Restores via vendor products, erratic/random backups and data protection.
  • "Power user" Workstations: Professional platforms for content creation. Dedicated Storage Appliances, with problematic & erratic data protection.
  • Servers:
    • SOHO/SME, small ISP: single servers or small farms. Nil or problematic data protection.
    • SMP servers, business server farms: SAN's + Storage Arrays, H/A, multiple-sites, fail-over, ...
    • Clusters and large arrays: special filesystems, lots of storage, fast networks.
    • Internet-scale Data Centres: purpose built hardware and storage solutions.
On my average Mac desktop there are over 2M files after 5years, a scale not anticipated by the original Unix Directory-Inode-Links-Blocks design.

It's now possible and feasible for individuals to follow Gordon Bell and digitally record their entire lives. This is more than storing random snaps from smartphones, but creating a usable, accessible store.
In 10M recording seconds per year, individuals can create 100k files/year, 1TB at low data rates (100KB/sec) and view 1-10M files/web-pages.

This load for even the current 1-2B smartphone users (not the 6B cell phone services), whilst potentially being a boon for Network Operators & Storage vendors, requires new services and new approaches.  Especially:
  • Strong User Identification with many roles per individual, for work, interests and personal life.
  • Single Federated views of individual-Identity storage.
  • New Search, Indexing, tagging and annotation tools.
  • Integrated "point-in-time" file browsing and scanning.
  • Internet-scale data de-duplication and peer-peer Storage.
We already have definitive solutions for "point-in-time"  recovery of:
  • Text files via Version Control Systems like SVN, CVS, RCS, ...
  • Relational Databases with full-DB snapshot and "roll-forward" transaction logs,
  • but other important binary data types, {DB's, images, videos, sound, PDF-docs, geo-data, machine control, ...}, aren't born with verifiable digital signatures, nor their own change logs.
Metadata, both system generated like timestamps, Geo-locn and user GUID, and user-supplied data, like tags and text, are as important as the data changes.

Backups and Version Control Systems typically offer 3 sorts of versioning. A combination of these methodologies will be used at various levels:

  • Full Backup. 100% replication of all bits.
  • Incremental: Store only the bits changed since last Incremental.
    • Notionally, the minimum storage required.
    • Slowest to recover: all Incrementals must be applied sequentially, in order.
    • Most prone to error and data loss.
      • If one delta-file is deleted or corrupted, the entire set is useless.
  • Differential: Store all bits changed since last Full Backup.
    • Each differential is larger than the last, potentially up to the size of a Full Backup.
    • Fastest to recover.
    • Simplest to manage
    • Robust against errors and deletions, if the dataset was stored.

Work on non-Relational Databases is occurring, but there are important challenges for relational Databases a continuous-timeline view of storage, more than the current transactional/data-wharehouse duality/conversion:
  • limited data storage formats can be supported, "importing and conversion" 
  • indexing of data is a separate activity and stored/accessed differently.
  • Schemas and Database names have to survive changes.
  • Semantics of individual fields are as important as
The Wayback Machine, a.k.a. The Internet Archive, gives us a working model and informs us that people can tolerate a) retrieval delays, b) some datasets unavailable and c) some data loss.

It costs a lot less for "Best Effort" rather than "Guaranteed" storage services, suggesting multiple approaches, cost structures and service offerings in the marketplace. Hopefully consumers won't be inveigled to over-pay or complacently rely on inappropriate low-cost providers.

Will current Consumer Protection laws need to be extended to this area??
If you share data within a group (Family and Friends) and some people don't maintain their part of the archive - losing data for people that rely on them, do current laws apply or will new law be needed?

Will this lead to new businesses of "Archive Auditing"?

There are currently three "drop-dead" problems for these services, ignoring the current "unsupported file format" and "ancient system & run-time" issues:
  • Currently, there is no archival quality digital media.
    Hard Disks, Flash memory and CD/DVD's have limited lifetimes. They cannot be left on a shelf and be expected to work a couple of decades on... Data must be constantly scanned, rebuilt and migrated to new storage systems.
    • Acid-free paper and microforms will store documents for over 100 years.
    • Colour film is still the only archival media for movies and still images.
    • No good magnetic media exist for medium-long term storage of sound recordings.
  • Vendor longevity and professional misconduct or negligence, even systemic corruption.
    • When an Archival Storage Service goes bust, how do the owners of the data recover their data? Not over network links and if the facilities are locked and powered-off by administrators or sheriffs, not physically either.
    • There are around the world, just a few Telcos or Power Utilities that are 100 years old. Can we really expected profitable Storage to start now and last 5 times longer than Google without any commercial upsets? I'd argue "no".
    • Rogue admins and managers are the least of the problem, though they'll exist and cause problems.
    • Expecting ordinary, fallible owners, workers and managers to always resist temptation, bribery and sloth/negligence is more than naive and simplistic. Mistakes will happen, security breaches will occur and ordinary folk doing boring jobs will take shortcuts.
    • Valuable resources will always attract those wishing to steal it. These sorts of facilities must begin by never storing anything of value. Organised crime's only access must be via the individual users' system/device, not in a single, centralised resource.
  • Legal access issues: a whole new area of lucrative International Law awaits us...
    • Who has the right to look at data?
    • Can data "in default" (unpaid fees) be sold? To whom? At what price?
    • Can a Vendor move data from the Jurisdiction of origin, with or without permission?
    • Can Vendors share data across facilities in different Jurisdictions?
    • Can Storage custodians be forced to grant local Law Enforcement Offices access to individual or bulk data?
There are now three distinct views of the filesystem provided in the abstract model for user applications:
  • Current files
  • snapshots
  • archives
The O/S has to provide these services for each of those dimensional slices through the storage:
  • map names (paths) to inodes. Subsumes a "mount device/mount-point" model.
  • inodes (the immutable file, with metadata)
  • datablock link map, which reduces to start/end for contiguous allocation.
  • data blocks and free block list
  • Physical drive management, like LVM.
Systems have to address four different aspects of real-world storage access:
  • availability and connection paths
  • errors and rereads
  • erasures and failures
  • durability and longevity of data sets (protection and archive)
Overlaid on this are 4-5 distinct access patterns, similar to a metal working "temperatures":
  • "white-hot" region: read/write access on-board (RAM and PCIe Flash)
  • "red-hot" region: read/write access to direct-connect updatable HDD's
  • cool region: write once access to Big, Slow HDD's, probably non-update-in-place.
  • "blue" (cold) region: write once, seldom read HDD's. No update-in-place, append-only.
  • "black" (frozen) region: remote and archival storage. Rarely Accessed, Critical when needed.
There is a direct correspondence between different temperature regions and the filesystem abstraction they are providing.
  • Archives are read-only and live only in cool, cold and frozen regions.
  • Snapshots may be in a "red-hot" region, but otherwise in cool and cold regions.
    • Files are ever only moved to Archive from the Snapshot areas.
  • Current files will be migrated from, or cached into, the high-speed read/write regions on demand.
    • The link between Snapshots and Current files is: Snapshot[0] == Current filesystem.
My thesis is that the traditional Unix filesystem and O/S structure of Directories-Inodes-Block_maps-Data_blocks cannot serve all these demands well, but that we already have very good tools to handle them.

Schemes to handle inodes, Block_Maps & Linking and Block access for each "temperature" storage can be designed well for the specific trade-offs and performance expectations.

The major problem appears to me to be mapping File Names to Inodes:
  • It either requires very high performance and low-latency for the hottest I/O region, or
  • requires very large namespaces for snapshots and archives.
  • Indexes for Current & Snapshot views may be stored in low-latency storage, but the volume of names stored in long-term Archives means they cannot.
Neither of which is well served by the traditional "directory in a block", backed by O/S cache model.
But both are robustly handled by Database systems, albeit differently organised, indexed and tuned.

What is missing in normal systems is:

  • Filesystem or storage layer of "What's Changed?" (Deltas) via md5sums or change messages.
  • Swapping snapshot views between "Delta"and "Full" filesystem views:
    • 'rsync' identifies changed files, but users have to create full filesystem images themselves.
    • Apple's TimeMachine creates a full filesystem image at a point-in-time, but provides no "Delta" interface beyond a single file or directory.
There are two implications that fall out of this analysis:

  • Consumers will demand "Open" storage standards allowing them to swap devices, systems and Storage Vendors, not be locked into Proprietary standards, especially single-vendor solutions, and
  • a software solution model based on the Apache web-server or Linux kernel: co-operative Open Source backed by the GNU license. This allows all vendors to avoid license and patent issues, share work, leverage prior work, support and develop common standards, whilst also allowing market-differentiation by offering specific tools or hardware/software combinations.
The current Unix-like approaches of filesystems, O/S supported directory scanning (name to inode mapping), LVM handling {data protection, logical and physical volumes}, independent snapshot/archive facilities, independent hot-plug media and manual setup and operation of Archival stores cannot provide an Identity-keyed Federated Storage & Archive system.

Not all data stores or vendors will provide the same grade of service. Features that can be borrowed from:

  • NTP (Network Time Protocol): stratum level of server. Just how good are they?
  • IP Routing: "cost of routes". Preferentially chose the faster, cheaper services.
The main features required in an Identity-keyed Federated Storage & Archive system are:
  • Data access limited by Identity (data privacy as part of "Security")
    • Multiple Identities per user, based on role or use.
    • Multiple Users and Identities per Device.
    • Master Identity access to specified data, for work and families.
  • Automatic implementation of Policies
  • Addition and management of user-managed hot-plug media
  • Automatic integration across all single-Identity devices of local disk, local network storage, peer storage and multiple Vendor services
  • Policies set as targets:
    • Cost
    • Maximum size of store
    • Maximum data recovery time
    • Minimum and Maximum times between recovery points:
      • every minute for the last 36 hours
      • every hour for the last fortnight
      • every day for the last year
      • every week for the last decade
      • every month after that
    • normal performance: access rate, I/O per sec
    • By datatype, Data Resilience and Longevity (Probability Data Loss per period, Maximum data loss event size)
    • Warnings, Alerts and Alarms.
    • Default and specified Data Destruction dates

Wednesday, January 02, 2013

Storage: Specifying Data Resilience and Data Protection



In Communications theory, there are two distinct concepts:

  • Errors [you get a signal, but noise has induced one or more symbol errors], and
  • Erasures [you lost the signal for a time or it was swamped by noise]

Erasures are often in "bursts", so techniques are needed to not just recover/correct a small number of symbols, but

This is the theory behind the Reed-Solomon [Galois Field] encoding for CD's and later DVD's.
It uses redundant symbols to recreate the data, needing twice as many symbols to correct errors as recreate erasures. A [24,28] RS code encodes 24 symbols into 28, with 5 symbols/bytes of redundancy. This can be used to correct up to 2 errors (2*2 symbols used)  plus 1 erasure.

The innovation in CD's was applying 2 R-S codes successively, but between them using Cross Interleaving to handle burst errors by spreading a single L1 [24,28] frame across a whole 2352 sector [86?,98]. Only 1 byte of an erased L1 frame would appear in any single L2 sector.

DVD's use a related but different form of combining two R-S codes: Internal/External Parity.

CR-ROM's apply an L3 R-S Product Code on top of the L1&L2 RS codes + CIRC to get more acceptable Bit Error Rates (BER's) of ~10^15, vs 10^9. Data per frame goes down to 2048by (2Kb) fro 2352by.

With Hard Disks, and Storage in general, the last two big advances were:

  • RAID [1988/9, Patterson, Gibson, Katz]
  • Snapshots [Network Appliance, early 1990's]

RAID-3/4/5 was notionally about catering for erasures caused by the failure of a whole drive or component, such as a controller or cable.
This was done with low overhead by using the computationally cheap and fast XOR operation to calculate a single parity block.

But in use, the ability to correct both errors and erasures with parity blocks has been conflated...

RAID-3/4/5 is now generally though to be about Error Correction, not Failure Protection.

The usual metrics quoted for HDD's & SSD's are:
 - MTBF (~1M hours) or Annualised Failure Rate (AFR) 0.6-0.7%
 - BER (unrecoverable Bit Error Rate) 1 in 10^15
 - Size, Avg seek time, max/sustained transfer rate.

Operational Questions, Drive Reliability:

 - For a fleet, per 1000 drives, average drives fail per year?
    [1 year = ~8700 hrs, = ~8.5M hours/year/1000 drives = 8.5 drive
fails/year]
   Alternatively, AFR: 0.6-0.7% * 1000, = 6-7 drives/1000/year

 - What's the minimum wall-clock time to rebuild a full drive?
    [Size / sustained transfer rate: 4Tb @ 150MB/sec write = 7.5Hrs ]

 - what's the likelihood of a drive fail during a rebuild?
    7.5 hrs / 1M hrs = 0.001% [???] per drive.
   - for RAID-set of 10, (7.5/1M)/10 = 0.01%

 - probability data loss in rebuild (N = 10):
    Transfer / BER = 4TB * 10 = 32 * 10* 10^12 bits =
     3.2 * 10^14 / 10^15
   = .32 = 32% [suggests further protection is needed against data loss]

Data Protection questions. I don't know how to address these...

 - If we store data in RAID-6/7 units of 10-drive-equivalents
    with a lifetime of 5 years per set:
  - In a "lifetime" (60 year = 12 sets),
    what's the probability of Data Loss?

 - How many geographically separated replicas do we need to
    store data 100 years?


I think I know how to specify Data Protection: the same way (%) as AFR.

What you have to build for is Mean-Years-Between-Dataloss
and I guess that implies the degree of Dataloss: 1-by, 1-block (4Kb), 1MB?
And well as complete failure of a dataset-copy.

Typical AFR's are 0.4%-0.7%, as quoted by drive manufacturers based on
accelerated testing.

We know from those 2008(?) studies of large cohorts of drives, this is
optimistic by an order of magnitude...

An AFR of 1 in 10^6 results in a 99.99% 100YR-F-R.
(1 - .0000010) ^ 100

AFR of 1 in 10^5 is 99.9% 100YR-FR (CFR? Century Failure Rate)


AFR of 1 in 10^4 is 99.0% CFR.

So we have to estimate a few more probabilities:
 - site suffering natural disaster or fire etc.
 - site suffering war damage or intentional attack
 - country or economy crumbling [ every 40-50 yrs a depression ]
 - company surviving (Kodak lated 100yrs
 - admins doing their job competently and fully.
 - managers not scamming (selling disks, not provide service)

Are there more??