Homelab Troubleshooting

I spent today debugging several issues I’ve had with a server in my homelab. The server happens to be part of a proxmox cluster, making the issues more annoying.

HDDs not enumerating

I have four SATA HDDs in a USB-C enclosure attached to the host. The four drives are part of a zfs raidz volume that are identified by their World Wide Name (wwn).

Occassionally, when the system reboots or powers on, only three of the four disks have a their wwn- symlink appear in /dev/disk/by-id/.

lrwxrwxrwx 1 root root  9 Sep 24 09:41 wwn-0x50014ee210f452db -> ../../sdk
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee210f452db-part1 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee210f452db-part9 -> ../../sdk9
lrwxrwxrwx 1 root root  9 Sep 24 09:41 wwn-0x50014ee2664999d1 -> ../../sdj
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2664999d1-part1 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2664999d1-part9 -> ../../sdj9
lrwxrwxrwx 1 root root  9 Sep 24 09:41 wwn-0x50014ee2bb9f24e0 -> ../../sdl
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2bb9f24e0-part1 -> ../../sdl1
lrwxrwxrwx 1 root root 10 Sep 24 09:41 wwn-0x50014ee2bb9f24e0-part9 -> ../../sdl9
lrwxrwxrwx 1 root root  9 Sep 24 09:41 wwn-0x50025385a013ece8 -> ../../sdi

This results in the pool being degraded:

root@pve1:/sys/bus/pci_express/devices# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 1.19G in 00:00:28 with 0 errors on Tue Sep 20 08:55:45 2022
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz1-0                  DEGRADED     0     0     0
            wwn-0x50014ee2664999d1  ONLINE       0     0     0
            wwn-0x50014ee210f452db  ONLINE       0     0     0
            wwn-0x50014ee2bb9f24e0  ONLINE       0     0     0
            wwn-0x50014ee210f434b7  UNAVAIL      0     0     0

errors: No known data errors

Up until today I’ve been lazy and solving it by shutting down the system, powering down the enclosure, powering up the enclosure, and then turning on the system. This has reliably ensured that the enclosure and pool starts correctly.

root@pve1:/sys/class/scsi_disk# echo "- - -" | tee /sys/class/scsi_host/host1{0,1,2,3}/scan
- - -
root@pve1:/sys/class/scsi_disk# sudo udevadm trigger

Afterwards in /dev/disk/by-id/.

lrwxrwxrwx 1 root root  9 Sep 24 09:56 wwn-0x50014ee210f434b7 -> ../../sdm
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f434b7-part1 -> ../../sdm1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f434b7-part9 -> ../../sdm9
lrwxrwxrwx 1 root root  9 Sep 24 09:56 wwn-0x50014ee210f452db -> ../../sdk
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f452db-part1 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee210f452db-part9 -> ../../sdk9
lrwxrwxrwx 1 root root  9 Sep 24 09:56 wwn-0x50014ee2664999d1 -> ../../sdj
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2664999d1-part1 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2664999d1-part9 -> ../../sdj9
lrwxrwxrwx 1 root root  9 Sep 24 09:56 wwn-0x50014ee2bb9f24e0 -> ../../sdl
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2bb9f24e0-part1 -> ../../sdl1
lrwxrwxrwx 1 root root 10 Sep 24 09:56 wwn-0x50014ee2bb9f24e0-part9 -> ../../sdl9

And the array automatically resilvers:

root@pve1:~# zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 29.6M in 00:00:03 with 0 errors on Sat Sep 24 09:56:38 2022
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x50014ee2664999d1  ONLINE       0     0     0
            wwn-0x50014ee210f452db  ONLINE       0     0     0
            wwn-0x50014ee2bb9f24e0  ONLINE       0     0     0
            wwn-0x50014ee210f434b7  ONLINE       0     0     0

errors: No known data errors

Unexpected serial port changes

My server has a serial port that I can use as a console, but I was surprised to find that after systemd began starting up, the console would stop working. grub and the kernel output worked fine.

I checked the serial port and noticed that the baud rate switched from the kernel command line setting of `` to a much lower speed:

root@pve1:~# stty -a -F /dev/ttyS0
speed 1200 baud; rows 24; columns 80; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q;
stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O; min = 0; time = 5;
-parenb -parodd -cmspar cs8 -hupcl -cstopb cread clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff -iuclc -ixany -imaxbel -iutf8
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop -echoprt -echoctl -echoke -flusho -extproc
root@pve1:~# stty -a -F /dev/ttyS0
speed 2400 baud; rows 24; columns 80; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q;
stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O; min = 0; time = 5;
-parenb -parodd -cmspar cs8 -hupcl -cstopb cread clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff -iuclc -ixany -imaxbel -iutf8
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop -echoprt -echoctl -echoke -flusho -extproc

Setting the baud rate manually to 115200 would make it briefly work:

root@pve1:~# stty -F /dev/ttyS0 115200

But after 10 seconds or so the console would revert back to 1200 and the output in the console would not work.

Turns out the problem was my UPS monitoring software, pwrstatd:

root@pve1:~# lsof -n | grep ttyS0
pwrstatd   4892                             root    4u      CHR               4,64          0t0         89 /dev/ttyS0
agetty     4927                             root    0u      CHR               4,64          0t0         89 /dev/ttyS0
agetty     4927                             root    1u      CHR               4,64          0t0         89 /dev/ttyS0
agetty     4927                             root    2u      CHR               4,64          0t0         89 /dev/ttyS0
agetty     4927                             root    4r  a_inode               0,14            0      12461 inotify

I modified /etc/pwrstatd.conf:

allowed-device-nodes = libusb

But that didn’t fix it. I tried another setting:

allowed-device-nodes = libusb;hiddev;ttyUSB

And the service stopped trying to access /dev/ttyS0:

root@pve1:~# lsof -n | grep ttyS0
agetty    89648                             root    0u      CHR               4,64          0t0         89 /dev/ttyS0
agetty    89648                             root    1u      CHR               4,64          0t0         89 /dev/ttyS0
agetty    89648                             root    2u      CHR               4,64          0t0         89 /dev/ttyS0

Yay. Restarted getty and success!

root@pve1:~# systemctl stop [email protected]
root@pve1:~# systemctl start [email protected]

Mysterious Boot Failure

The same proxmox server runs a headless setup; I somehow got it to boot even though a Mellanox ConnectX-3 adapter is installed.

After updating the kernel I noticed that the system didn’t come back up. The serial console didn’t even show output from grub, so there was either a significant hardware problem or the bootloader had somehow become corrupted. I tried power cycling it several times but it would never boot. Finally I replaced the Mellanox card with an old video card so I could see what’s going on.

Apparently during a previous power cycle the LSI Megaraid 9260-8i card had an unclean shutdown and was not able to flush its write cache. The card decided to let me know by prompting me to press any key to continue or to press ‘C’ to enter the configuration utiltiy. The prompt is a one time event (per occurrence, I’m sure), so after pressing spacebar and verifying grub came up, I tested that grub would come up after a reboot, then restored the Mellanox card. The system booted up fine after that.

Cache data was lost due to an unexpected power-off or reboot during a write operation, but the adapter has recovered. This could be due to a memory problem, bad battery, or you may not have a bettery installed.
Press any key to continue or 'C' to load the configuration utility.

Recently the LSI card had a BBU-iBBU08 battery pack installed, but I removed it when I noticed significant bulging on the side of the battery pack. I bought the battery pack three years ago so it had a good run.