Monday, November 30, 2009

DDR3: Memories ain't Memories

Purpose: Summary/Intro to DDR3 memory with Intel Nehalem (iCore 7, Xeon 3400, 3500, 5500, 7500). NOT a reference document. Dell biased. [Nov 2009]

Audience:
Those needing to configure memory sub-systems on Intel Nehalem/DDR3 systems:
  • Selecting between pre-configured systems.
  • Populating DDR3 memory on a new motherboard
  • Upgrading an existing system
  • Configuring a new server
Background: I got into trouble specifying some new Dell servers. DDR3 is different enough that you will most likely not get the result you intend.  This is about Intel, not AMD, processors though some concepts will transfer.

Selecting a DDR3 memory sub-system config requires knowing what you want to Optimise/Maximise:

Speed (Latency or Throughput) vs RAM Size vs Price vs Power vs H/A features

This note is skewed to Dell equipment, read their "DDR3 Technical Whitepaper" for Poweredge 11g servers" (PDF).
The Intel Xeon 3400 series architecture is described in this Intel PDF.
 

Dell's "Quick Reference Guide" for Installing and Upgrading DDR3 Memory in 11g servers contains an excellent summary of memory options for 5500 series processors. [There is an error in the last diagram. Text says 24Gb, diagram shows 36Gb]



Dell's "Transition Guide" for 11th Generation servers, (PDF), page 11, outlines their DDR3 memory restrictions by type, speed and ranks per MCH.


Reading the Intel doco, these acronyms were unclear to me:
UP = UniProcessor, DP = Dual Processor, MP = Multi Processor

Wikipedia has a good page on "DDR3 SDRAM", lots of technical info.
Other of their articles describe  "ranks" and the relationship to x4 and x8 (4-bit or 8-bit DIMM's): DDR_SDRAM, and DIMM#Ranking.
Dual-rank (DR) DDR3 is made with x8, Single-rank (SR) with x4 DIMM's.
Quad-rank (QR) modules exist, not sure what organisation.

Basic DDR3 data:
1Gb to 16Gb DIMM's possible. 1,2,4,8 Gb sold by Dell.
speed (Mhz) in 266Mhz steps: 800, 1066, 1333, ?1600? (not available from Dell in 2009)
64-bit transfers. 72-bits transferred for ECC.
800 = 10 ns,   6.4 GiB/s per channel (1,000,000,000 per Gi) B = bytes.
1066 = 7.5ns,  8.533 GiB/s
1333 = 6.0ns, 10.667 GiB/s
1600 = 5.0ns,  12.8 GiB/s

Intel describe Nehalem as a 'tock' (new architecture) in 45nm. A 'tick' (more speed) coming next.

32nm variants coming later.

Wikipedia on Intel Nehalem microarchitecture:
on-chip Memory Controller Hubs (MCH).
DDR3 memory or DDR2 "Fully Buffered" some chips.
No Northbridge chip on 'advanced' chips, with Southbridge collapsed into PCH (Peripheral Controller Hub).
QPI (Quick Path Interconnect) for NUMA: access non-local memory in multi-chip systems.
 Replaces FSB on high-end.

Wikipedia summary of the Xeon 5500 series (DP not UP). Note the 'basic', 'standard', 'advanced' ranges. DDR3 and QPI speeds change as well as CPU clock speed through the range.

There is an obvious throughput limiter: the common 8Mb L3-cache and the MCH interface.
Presumably it's "non-blocking". i.e for 3*MCH @ 1333Mhz, around 32Gb/sec (Intel X5570).


DDR3 memory sub-system Variables:

Prices and Sizes:
faster DIMM's are more expensive.
larger DIMM's are more expensive per Gb.
8Gb is largest from Dell and premium priced.
Registered DIMM's are more expensive as well as use more power.
Different sized DIMM's can be mixed - but same size & rank in every corresponding slot.

Speeds:
DIMM speed - not the whole story. Fastest bus speed possible.
Bus Speed. All buses clock at same rate, including with Dual Processor.
Processor Memory Controller Hub speed (MCH) - may be limited.
 5500 series (DP) has 3 MCH + 2 QPI links, but varies from 800Mhz to 1333
 3400 series (UP) has 2 MCH at 1333Mhz. May only support 2 slots/MCH.

NUMA:
Memory is tied to a chip and non-local memory is accessed across the QPI (direct MCH to MCH).
Without a second processor, only one set of DDR3 slots is accessible, for a DP board, halves the available slots.
Processors on a DP board have to be identical. Can't mix'n'match different models.

Slots:
Max DIMM Slots/channel: 3.
 Some processors only support
Total DIMMs: Processors * MCH * slots/MCH.
  Some  boards only supply 2 slots/MCH.
  Lower-end motherboards may only provide 2 of 3 available MCH
slot 0 - furtherest from MCH (?). Must be populated first. Implies geographic addressing.
Dell derates maximum bus speed:
 - slot 0: 1333
 - slot 1: 1066
 - slot 2:  800Mhz
I.e. if anywhere on the bus there is a chip plugged into any 'slot 2', the bus is limited to 800Mhz.

Ranks:
ranks - roughly 'chip select' or 'bank select'. A type of Interleaving.
 up to 8 ranks per Channel.
DIMM's come in 1-, 2-, 4-rank: SR, DR, QR
 single QR allowed (Dell), slot 0 only.
Diff ranked DIMM's can be mixed - but same size & rank in every corresponding slot.
 Dell don't supply SR or QR DIMM's as a server option. (2009, appears that way)

DRAM chips:
x4, x8 - DRAM chip organisation. 4- and 8-bit chips.
x4 is SR.
x8 for DR. and QR is ?

DIMM Types:
Unbuffered vs Registered (UDIMM, RDIMM)
Parity vs ECC
address parity only on RDIMM's.
RDIMM's use 1Watt more per DIMM.
Can't mix DIMM types on a bus?

Other:
Other memory configs available: Mirror, Lock-step and SDDC (Advanced ECC)

  • "Mirror" mode - 2 DIMM's per MCH, but in RAID 1 config.
      For high reliability. Even with ECC, caters for dead chips/connection.
      reports half installed memory to O/S.
  • Advanced ECC mode - ganged MCH's. 128-bit fetches.
    Implements SDCC - Single Device Data Correction for x8 DIMM's

Dell and Intel recommend "Balanced Configurations" and "Memory Optimised" modes.

Standard Memory organisations from Dell are:
  • High Performance (Slot 0 only used)
  • Balanced Performance
  • High Capacity (all slots populated)
  • Mirror Mode or Advanced ECC
  • Power Conscious
Intel Xeon Nehalem processor ranges:
  • 3400 ("Lynnefield") UP
  • 3500 ("Bloomfield") DP
  • 5500 ("Gainestown") DP
  • 7500 ("Beckton"?) MP. Not released (Q4-2009)
Notes:

The FSB has gone and is now replaced by Memory Controller Hubs (MCH) and QuickPath Interconnect (QPI). The word 'interleaving' is not used.

Each MCH can control just 3 DIMM's. But Dell's DDR3 bus clocks slower as you add more DIMM's... 1333 for 1 DIMM, 1066 for 2, 800Mhz for 3 DIMM's.
There are Single-, Dual- and Quad-rank DIMM's. Max 8 ranks per MCH.
[Connection charts exist that visualise the rules]

The QPI allows a chip to access memory via off-chip MCH's.
Of course, if you only install one processor chip in a two-proc board,half the DIMM slots are unusable. Need the MCH's of the 2nd chip to get to them.

RDIMM's use more power, but can be larger, faster and support 'address parity' and are required for more esoteric modes.

The Xeon 3400 series has 2 MCH's and often only 2 DIMM's per MCH on the motherboard.

The 5500 series has 3 MCH per chip. Some motherboards have only 2 DIMM's/MCH.
For a two-chip system, max 18-slots - with the caveat only 9 accessible if a single-chip is used.

Upshot:
  • If you want 'cheap', you'll use UDIMM's@1066
  • If you want 'fast', you'll use 1 RDIMM@1333 per MCH and a CPU to match.
  • For 'max memory' - you're down to 800MHz and fill all slots.
  • for high reliability, you might use 'Mirror mode' or SDCC.

Speed: Throughput vs Latency:
There is around a 20% memory throughput improvement by going with 1333Mhz memory where possible.

This piece on a SUN blog has a table of memory throughputs for Intel Xeon 5500 series with DDR3:
  • "Table 3 - Relative Bandwidth Comparisons"
  • "Table 4 - Relative DIMM Power Comparisons"

It lines up ranks, bus speed, DIMM speed and number of channels populated against 'throughput'.
[Not access speed. High-performance mode is 1*DIMM per channel]



I've not been able to get clear info of the effect of DDR3 "ranks" on performance.
From the tables in the SUN piece, it seems to imply that memory controllers understand 'ranks' and can utilise them to improve throughput (not latency).

Another caveat is that SUN claim to be able to run their memory bus @ 1333Mhz even if 2 or 3 slots per channel are filled. The generic and Dell doco I've found says the bus derates to 1066Mhz for 2 slots and 800Mhz for 3 slots. 

SUN don't give the bus throughput for "100%". A snippet from their table.
 

1xDR @ 1066 = 36%
1xDR @ 1333 = 44%
2*DR @ 1066 = 68%
3xDR @ 800  = 74%


Quotes from the SUN conclusions:

Key takeaways from the above (Table 3 on bandwidth) are:

  • for DIMM configurations that support both speeds, memory bandwidth is up to 12% higher with 1333 DIMMs than with 1066 DIMMs (which is not obvious from chart since it shows bandwidths relative to 1x DR per channel and not a direct comparison between 1333 and 1066 for a given configuration)
  • for a given capacity, one dual rank DIMM per channel provides higher bandwidth performance than two single rank DIMMs per channel
and
From the table (4: Power) it can be determined that:

  • for a particular DIMM configuration and bus speed, DDR3-1333 DIMMs consume up to 6% less power than DDR3-1066 modules
  • for a given DIMM configuration, the incremental power required to operate DDR3-1333 DIMMs at 1333MT/s data rate vs. 1066MT/s is 8% or less (also not obvious from chart since it shows power relative to 1x DR per channel and not a direct comparison between 1333 and 1066 for a given configuration)
and somewhat less obvious but equally important:
  • a dual rank DIMM operating at 1333MT/s consumes less power than two single rank DIMMs at 1066MT/s