A DRAM failure

One of our Ryzen-based servers acted up after almost a year of working without flaw: It spontaneously rebooted two times (25 minutes apart) and then reported 19 ECC errors in the next 14 hours (at which point we shut it down for repair). The log contained messages like:

Mar 29 04:10:35 a4 kernel: [ 6542.673070] mce: [Hardware Error]: Machine check events logged
Mar 29 04:10:35 a4 kernel: [ 6542.673079] Memory failure: 0x6a0430: recovery action for free buddy page: Delayed
Mar 29 04:10:35 a4 kernel: [ 6542.673116] [Hardware Error]: Deferred error, no action required.
Mar 29 04:10:35 a4 kernel: [ 6542.673119] [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Mar 29 04:10:35 a4 kernel: [ 6542.673124] [Hardware Error]: Error Addr: 0x00000006a04302c0
Mar 29 04:10:35 a4 kernel: [ 6542.673125] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x77778a000b800000
Mar 29 04:10:35 a4 kernel: [ 6542.673128] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Mar 29 04:10:35 a4 kernel: [ 6542.673140] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x1ab10c0 offset:0x9c0 grain:64)
Mar 29 04:10:35 a4 kernel: [ 6542.673142] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

18 of the errors were uncorrectable (UE), 1 was correctable (CE). We noticed no software crashes from the unrecoverable errors, so the processor may have noticed the errors during scrubbing (it walks through the memory to detect and correct any flipped bits before they accumulate) rather than on a software read access. All these messages refer to "mc#0csrow#0channel#1 (csrow:0 channel:1", which indicates that it is a specific DIMM that causes these problems rather than a problem with the memory controller (i.e., the CPU), or the mainboard in general. Also, when doing

  grep . /sys/devices/system/edac/mc/mc*/rank*/dimm_?e_count

we saw that all errors were in rank1 (i.e., the same side of the same DIMM). The question was: which DIMM should we replace?

  dmidecode -t memory

gives us some information about the DIMMs, but unfortunately with no overlap with the information about the ranks. We eventually resorted to pulling a DIMM out and checking which ranks were still present when doing the grep above (fortunately only actually present ranks are shown), and repeating that until rank1 was no longer there. Then we replaced that DIMM with a working one, and we again have a machine that does not produce ECC errors frequently.

In case you also have an ASUS TUF Gaming B550M-Plus board, here are the mappings we found between ranks and DIMM slots.

      dimm_location DIMM     
 rank csrow channel slot
  0     0      0     B1 
  1     0      1     A1 
  2     1      0     B1 
  3     1      1     A1 
  4     2      0     B2 
  5     2      1     A2 
  6     3      0     B2 
  7     3      1     A2

Did ECC help in this case? Given that 18 of the 19 errors were uncorrectable, one may doubt that. However, it gave the only indication of a hardware problem we saw, and it allowed us to pinpoint a failing component relatively quickly that would have been hard to find out through occasional software failures; would we have started to use memtest after a few unexplainable software crashes?

Just in case you are wondering about the symptoms: With a degraded bit in a DRAM device (i.e., DRAM chip, 9 per side on our DIMMs) we would expect correctable errors, and no spontaneous reboots. In case of some other degradation of a RAM device we would expect errors to be logged first in most cases. Our theory is that some common cause resulted in the sponaneous reboots and also caused degradation of a more significant part than one bit in a DRAM device (my wild guess would be that a row decoder was damaged). The fact that we don't know what this cause was worries me.

Anton Ertl