Homelab Troubleshooting
I spent today debugging several issues I’ve had with a server in my homelab. The server happens to be part of a proxmox cluster, making the issues more annoying.
HDDs not enumerating
I have four SATA HDDs in a USB-C enclosure attached to the host. The four drives are part of a zfs raidz volume that are identified by their World Wide Name (wwn
).
Occassionally, when the system reboots or powers on, only three of the four disks have a their wwn-
symlink appear in /dev/disk/by-id/
.
lrwxrwxrwx 1 root root 9 Sep 24 09:41 wwn-0x50014ee210f452db -> ../../sdk
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee210f452db-part1 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee210f452db-part9 -> ../../sdk9
lrwxrwxrwx 1 root root 9 Sep 24 09:41 wwn-0x50014ee2664999d1 -> ../../sdj
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2664999d1-part1 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2664999d1-part9 -> ../../sdj9
lrwxrwxrwx 1 root root 9 Sep 24 09:41 wwn-0x50014ee2bb9f24e0 -> ../../sdl
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2bb9f24e0-part1 -> ../../sdl1
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2bb9f24e0-part9 -> ../../sdl9
lrwxrwxrwx 1 root root 9 Sep 24 09:41 wwn-0x50025385a013ece8 -> ../../sdi
This results in the pool being degraded:
root@pve1:/sys/bus/pci_express/devices# zpool status
pool: tank
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: resilvered 1.19G in 00:00:28 with 0 errors on Tue Sep 20 08:55:45 2022
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
wwn-0x50014ee2664999d1 ONLINE 0 0 0
wwn-0x50014ee210f452db ONLINE 0 0 0
wwn-0x50014ee2bb9f24e0 ONLINE 0 0 0
wwn-0x50014ee210f434b7 UNAVAIL 0 0 0
errors: No known data errors
Up until today I’ve been lazy and solving it by shutting down the system, powering down the enclosure, powering up the enclosure, and then turning on the system. This has reliably ensured that the enclosure and pool starts correctly.
root@pve1:/sys/class/scsi_disk# echo "- - -" | tee /sys/class/scsi_host/host1{0,1,2,3}/scan
- - -
root@pve1:/sys/class/scsi_disk# sudo udevadm trigger
Afterwards in /dev/disk/by-id/
.
lrwxrwxrwx 1 root root 9 Sep 24 09:56 wwn-0x50014ee210f434b7 -> ../../sdm
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f434b7-part1 -> ../../sdm1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f434b7-part9 -> ../../sdm9
lrwxrwxrwx 1 root root 9 Sep 24 09:56 wwn-0x50014ee210f452db -> ../../sdk
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f452db-part1 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f452db-part9 -> ../../sdk9
lrwxrwxrwx 1 root root 9 Sep 24 09:56 wwn-0x50014ee2664999d1 -> ../../sdj
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2664999d1-part1 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2664999d1-part9 -> ../../sdj9
lrwxrwxrwx 1 root root 9 Sep 24 09:56 wwn-0x50014ee2bb9f24e0 -> ../../sdl
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2bb9f24e0-part1 -> ../../sdl1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2bb9f24e0-part9 -> ../../sdl9
And the array automatically resilvers:
root@pve1:~# zpool status
pool: tank
state: ONLINE
scan: resilvered 29.6M in 00:00:03 with 0 errors on Sat Sep 24 09:56:38 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x50014ee2664999d1 ONLINE 0 0 0
wwn-0x50014ee210f452db ONLINE 0 0 0
wwn-0x50014ee2bb9f24e0 ONLINE 0 0 0
wwn-0x50014ee210f434b7 ONLINE 0 0 0
errors: No known data errors
Unexpected serial port changes
My server has a serial port that I can use as a console, but I was surprised to find that after systemd
began starting up, the console would stop working. grub
and the kernel output worked fine.
I checked the serial port and noticed that the baud rate switched from the kernel command line setting of `` to a much lower speed:
root@pve1:~# stty -a -F /dev/ttyS0
speed 1200 baud; rows 24; columns 80; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q;
stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O; min = 0; time = 5;
-parenb -parodd -cmspar cs8 -hupcl -cstopb cread clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff -iuclc -ixany -imaxbel -iutf8
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop -echoprt -echoctl -echoke -flusho -extproc
root@pve1:~# stty -a -F /dev/ttyS0
speed 2400 baud; rows 24; columns 80; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q;
stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O; min = 0; time = 5;
-parenb -parodd -cmspar cs8 -hupcl -cstopb cread clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff -iuclc -ixany -imaxbel -iutf8
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop -echoprt -echoctl -echoke -flusho -extproc
Setting the baud rate manually to 115200 would make it briefly work:
root@pve1:~# stty -F /dev/ttyS0 115200
But after 10 seconds or so the console would revert back to 1200 and the output in the console would not work.
Turns out the problem was my UPS monitoring software, pwrstatd:
root@pve1:~# lsof -n | grep ttyS0
pwrstatd 4892 root 4u CHR 4,64 0t0 89 /dev/ttyS0
agetty 4927 root 0u CHR 4,64 0t0 89 /dev/ttyS0
agetty 4927 root 1u CHR 4,64 0t0 89 /dev/ttyS0
agetty 4927 root 2u CHR 4,64 0t0 89 /dev/ttyS0
agetty 4927 root 4r a_inode 0,14 0 12461 inotify
I modified /etc/pwrstatd.conf
:
allowed-device-nodes = libusb
But that didn’t fix it. I tried another setting:
allowed-device-nodes = libusb;hiddev;ttyUSB
And the service stopped trying to access /dev/ttyS0
:
root@pve1:~# lsof -n | grep ttyS0
agetty 89648 root 0u CHR 4,64 0t0 89 /dev/ttyS0
agetty 89648 root 1u CHR 4,64 0t0 89 /dev/ttyS0
agetty 89648 root 2u CHR 4,64 0t0 89 /dev/ttyS0
Yay. Restarted getty and success!
root@pve1:~# systemctl stop [email protected]
root@pve1:~# systemctl start [email protected]
Mysterious Boot Failure
The same proxmox server runs a headless setup; I somehow got it to boot even though a Mellanox ConnectX-3 adapter is installed.
After updating the kernel I noticed that the system didn’t come back up. The serial console didn’t even show output from grub, so there was either a significant hardware problem or the bootloader had somehow become corrupted. I tried power cycling it several times but it would never boot. Finally I replaced the Mellanox card with an old video card so I could see what’s going on.
Apparently during a previous power cycle the LSI Megaraid 9260-8i card had an unclean shutdown and was not able to flush its write cache. The card decided to let me know by prompting me to press any key to continue or to press ‘C’ to enter the configuration utiltiy. The prompt is a one time event (per occurrence, I’m sure), so after pressing spacebar and verifying grub came up, I tested that grub would come up after a reboot, then restored the Mellanox card. The system booted up fine after that.
Cache data was lost due to an unexpected power-off or reboot during a write operation, but the adapter has recovered. This could be due to a memory problem, bad battery, or you may not have a bettery installed.
Press any key to continue or 'C' to load the configuration utility.
Recently the LSI card had a BBU-iBBU08 battery pack installed, but I removed it when I noticed significant bulging on the side of the battery pack. I bought the battery pack three years ago so it had a good run.