Machine Check Exceptions (MCE)
Over the lest month or so my homelab server has been rebooting unexpectedly. This started shortly after I began using influxdb to collect metrics from the host system, but disabling that has not resolved the problem. After rebooting, I noticed for the first time a machine check exception (MCE) is logged in the system journal. It happened again today, thus this post.
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: Machine check events logged
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8fd156d6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1664118033 SOCKET 0 APIC 5 microcode 8001138
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: Machine check events logged
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff90a01186 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Sep 25 08:00:44 pve1 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1664118033 SOCKET 0 APIC d microcode 8001138
The interesting value here is bea0000000000108
. After googling, I encountered some previously seen results about some existin AMD CPU Errata, as mentioned in the Arch Forums. I’d already added the options below to my kernel command line (below), updated to the latest available/recommended BIOS version, and disable various power-saving options in the BIOS.
rcu_nocbs=0-15 processor.max_cstate=1 idle=nomwait
I ran good old memtest86 for several hours with no errors reported, so I don’t think it’s a memory issue.
Interestingly, there’s a lot of mention out there of this being an issue with the GPU, possibly relating to power state changes. That alone isn’t that interesting, but my system runs headless (the 1700X has no APU and I have no GPU installed)! Suggestions on the AMD forums and elsewhere include adding more kernel command line parameters:
amdgpu.ppfeaturemask=0xffffbfff amdgpu.dpm=0
I can add these, but I wouldn’t expect them to do anything given the absence of any GPU, let alone an AMD GPU/APU.
My suspicion is that the recent heatwave in the American Southwest has accelerated hardware aging and I’m seeing a hardware level failure in the Memory, PSU, or CPU. This is cheap, three-year old consumer grade hardware running in my garage so this isn’t much of a stretch.
I’ll probably replace it with an AMD 5700G now that prices are coming down and if I still have problems, I’ll replace the motherboard, memory, and PSU.
SSD Failures
I obtained my first SSD nearly 15 years ago; I can’t remember exactly what it was, but it was small (handful of GB) and slow. Windows XP worked on it and it had good IOPS combared to spinning rust but terrible throughput.
Since then I’ve been on the SSD bandwagon for years. Most of the SSDs I’ve seen deployed didn’t fail before we retired and replaced the hardware after its depreciation period was over, but there have been a few cases in the last few years where failures have begun.
I burned this SSD out by using it for ZFS L2ARC for the better part of a year. I suspect write amplification may have given it an early demise, but I simply disabled l2arc and moved on with life. The drive’s read performance is now relatrively bad (~50 MB/sec) and writes to it eventually lead to errors like Buffer I/O error on dev sdb, logical block 6402, lost async page write
, aka “it’s dead, Jim”. Anyways, since I’ve been doing so much maintenance on my homelab server I noticed this in the journal:
Sep 25 14:23:22 pve1 smartd[4784]: Device: /dev/bus/9 [megaraid_disk_14] [SAT], Failed SMART usage Attribute: 202 Percent_Lifetime_Remain.
And reading the SMART attributes show:
root@pve1:~# smartctl -a /dev/bus/9 -d megaraid,14
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.53-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT500MX500SSD1
Firmware Version: M3CR020
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 29100
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 196
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 001 001 000 Old_age Always - 1486
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 92
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 44
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 059 042 000 Old_age Always - 41 (Min/Max 0/58)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 001 001 001 Old_age Offline FAILING_NOW 99
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 19698105935
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 3656623244
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 22684509485
So this drive has 1% of its remaining useful life (RUL) left. Maybe I’ll replace the entire array with some of the cheap 1 TB Crucial SSDs I’ve seen on Slickdeals if they go on sale again around Black Friday.