Wednesday, January 02, 2013

Storage: Specifying Data Resilience and Data Protection



In Communications theory, there are two distinct concepts:

  • Errors [you get a signal, but noise has induced one or more symbol errors], and
  • Erasures [you lost the signal for a time or it was swamped by noise]

Erasures are often in "bursts", so techniques are needed to not just recover/correct a small number of symbols, but

This is the theory behind the Reed-Solomon [Galois Field] encoding for CD's and later DVD's.
It uses redundant symbols to recreate the data, needing twice as many symbols to correct errors as recreate erasures. A [24,28] RS code encodes 24 symbols into 28, with 5 symbols/bytes of redundancy. This can be used to correct up to 2 errors (2*2 symbols used)  plus 1 erasure.

The innovation in CD's was applying 2 R-S codes successively, but between them using Cross Interleaving to handle burst errors by spreading a single L1 [24,28] frame across a whole 2352 sector [86?,98]. Only 1 byte of an erased L1 frame would appear in any single L2 sector.

DVD's use a related but different form of combining two R-S codes: Internal/External Parity.

CR-ROM's apply an L3 R-S Product Code on top of the L1&L2 RS codes + CIRC to get more acceptable Bit Error Rates (BER's) of ~10^15, vs 10^9. Data per frame goes down to 2048by (2Kb) fro 2352by.

With Hard Disks, and Storage in general, the last two big advances were:

  • RAID [1988/9, Patterson, Gibson, Katz]
  • Snapshots [Network Appliance, early 1990's]

RAID-3/4/5 was notionally about catering for erasures caused by the failure of a whole drive or component, such as a controller or cable.
This was done with low overhead by using the computationally cheap and fast XOR operation to calculate a single parity block.

But in use, the ability to correct both errors and erasures with parity blocks has been conflated...

RAID-3/4/5 is now generally though to be about Error Correction, not Failure Protection.

The usual metrics quoted for HDD's & SSD's are:
 - MTBF (~1M hours) or Annualised Failure Rate (AFR) 0.6-0.7%
 - BER (unrecoverable Bit Error Rate) 1 in 10^15
 - Size, Avg seek time, max/sustained transfer rate.

Operational Questions, Drive Reliability:

 - For a fleet, per 1000 drives, average drives fail per year?
    [1 year = ~8700 hrs, = ~8.5M hours/year/1000 drives = 8.5 drive
fails/year]
   Alternatively, AFR: 0.6-0.7% * 1000, = 6-7 drives/1000/year

 - What's the minimum wall-clock time to rebuild a full drive?
    [Size / sustained transfer rate: 4Tb @ 150MB/sec write = 7.5Hrs ]

 - what's the likelihood of a drive fail during a rebuild?
    7.5 hrs / 1M hrs = 0.001% [???] per drive.
   - for RAID-set of 10, (7.5/1M)/10 = 0.01%

 - probability data loss in rebuild (N = 10):
    Transfer / BER = 4TB * 10 = 32 * 10* 10^12 bits =
     3.2 * 10^14 / 10^15
   = .32 = 32% [suggests further protection is needed against data loss]

Data Protection questions. I don't know how to address these...

 - If we store data in RAID-6/7 units of 10-drive-equivalents
    with a lifetime of 5 years per set:
  - In a "lifetime" (60 year = 12 sets),
    what's the probability of Data Loss?

 - How many geographically separated replicas do we need to
    store data 100 years?


I think I know how to specify Data Protection: the same way (%) as AFR.

What you have to build for is Mean-Years-Between-Dataloss
and I guess that implies the degree of Dataloss: 1-by, 1-block (4Kb), 1MB?
And well as complete failure of a dataset-copy.

Typical AFR's are 0.4%-0.7%, as quoted by drive manufacturers based on
accelerated testing.

We know from those 2008(?) studies of large cohorts of drives, this is
optimistic by an order of magnitude...

An AFR of 1 in 10^6 results in a 99.99% 100YR-F-R.
(1 - .0000010) ^ 100

AFR of 1 in 10^5 is 99.9% 100YR-FR (CFR? Century Failure Rate)


AFR of 1 in 10^4 is 99.0% CFR.

So we have to estimate a few more probabilities:
 - site suffering natural disaster or fire etc.
 - site suffering war damage or intentional attack
 - country or economy crumbling [ every 40-50 yrs a depression ]
 - company surviving (Kodak lated 100yrs
 - admins doing their job competently and fully.
 - managers not scamming (selling disks, not provide service)

Are there more??

No comments: