Thursday, June 07, 2012

The New Storage Hierarchy and 21 Century Databases

The new hardware capabilities and cost/performance characteristics of storage and computer systems means there has to be a radical rethink of how databases work and are organised.

The three main challenges I see are:
  • SSD and PCI Flash memory with "zero" seek time,
  • affordable Petabyte HDD storage, and
  • object-based storage replacing "direct attach" devices.
These technical tradeoff changes force these design changes:
  • single record access time is no longer dominated by disk rotations, old 'optimisations' are large, costly, slow and irrelevant,
  • the whole "write region" can be held in fast memory changing cache requirements and design,
  • Petabtye storage allows "never delete" of datasets which pose new problems:
    • how does old and new data get physically organised?
    • what logical representations can be used to reduce queries to minimal collections?
    • how does the one datastore support conflicting use types? [real-time transactions vs data wharehouse]
    • How are changed Data Dictionaries supported?
    • common DB formats are necessary as the lifetime of data will cover multiple products and their versions. 
  • Filesystems and Databases have to use the same primitives and use common tools for backups, snapshots and archives.
    • As do higher order functions/facilities:
      • compression, de-duplication, transparent provisioning, Access Control and Encryption
      • Data Durability and Reliability [RAID + geo-replication]
  • How is security managed over time with unchanging datasets?
  • How is Performance Analysis and 'Tuning' performed?
  • Can Petabyte datasets be restored or migrated at all?
    • DB's must continue running without loss or performance degradation as the underlying storage and compute elements are changed or re-arragned.
  • How is expired data 'cleaned' whilst respecting/enforcing any legal caveats or injunctions?
  • What data are new Applications tested against?
    • Just a subset of "full production"? [doesn't allow Sizing or Performance Testing]
    • Testing and Developing against "live production" data is extremely unwise [unintended changes/damage] or a massive security hole. But when there's One DB, what to do?
  • What does DB roll-back and recovery mean now? What actions should be expected?
    • Is "roll-back" or reversion allowable or supportable in this new world?
    • Can data really be deleted in a "never delete" dataset?
      • Is the Accounting notion of "journal entries" necessary?
    • What happens when logical inconsistencies appear in geo-diverse DB copies?
      • can they be detected?
      • can they ever be resolved?
  • How do these never-delete DB's interface or support corporate Document and Knowledge Management systems?
  • Should summarises ever be made and stored automatically under the many privacy and legal data-retention laws, regulations and policies around?
  • How are conflicting multi-jurisdiction issues resolved for datasets with wide geo-coverage?
  • How are organisation mergers accomplished?
    • Who owns what data when an organisation is de-merged?
    • Who is responsible for curating important data when an organisation disbands?
XML is not the answer: it is a perfect self-containing data interchange format, but not an internal DB format.
Redesign and adaption is needed at three levels:
  • Logical Data layout, query language and Application interface.
  • Physical to Logical mapping and supporting DB engines.
  • Systems Configuration, Operations and Admin.
We now live in a world of VM's, transparent migration and continuous uninterrupted operations: DB's have to catch up.

They also have to embrace the integration of multiple disparate data sources/streams as laid out in the solution Jerry Gregoire created for Dell in 1999 with his "G2 Strategy":
  • Everything should be scalable through the addition of servers.
  • Principle application interface should be a web browser.
  • Key programming with Java or Active X type languages.
  • Message brokers used for application interfacing.
  • Technology selection on an application by application basis .
  • Databases should be interchangeable.
  • Extend the life of legacy systems by wrapping them in a new interface.
  • Utilize "off the shelf systems" where appropriate.
  • In house development should rely on object based technology - new applications should be made up of proven object puzzle pieces.
Data Discovery, Entity Semantics with range/limits (metadata?) and Rapid/Agile Application development are critical issues in this new world.

No comments: