107% wear: chronicle of a turbulent disk swap

If you've experienced some unusual downtime over the past few days, here's the story of what happened behind the scenes.

The problem

It all started with a fairly unsettling monitoring email: both NVMe SSDs in the Leek Wars server were showing 107% wear. For those wondering how a disk can be worn out beyond 100%: manufacturers give an estimated endurance (in terabytes written), beyond which they no longer guarantee anything. SMART keeps counting past that, and 107% basically means "technically, this thing should already be dead".

Both disks in RAID 1 (mirror) were twins, born at the same time, aging at the same pace. So: emergency OVH intervention to replace them.

Phase 1: paranoia

Before any intervention, I wanted to make sure we had complete backups elsewhere. The usual backups cover the main PostgreSQL database and critical directories, but some historical tables were excluded from the daily backup for size reasons.

A few hours of pg_dump later, those tables were safe, along with the critical Docker volumes. Time to breathe.

Phase 2: first disk, first trap

OVH replaces the first disk. The server reboots... into rescue mode. No problem, we switch to disk boot, and the system comes back. Except Traefik refuses to start with a cryptic message:

Except nothing is listening on that port. Unless... Docker is keeping ghost entries in its internal local-kv.db database, pointing to an old network bridge IP (172.18.0.6) that no longer exists. On reboot, Docker mechanically recreates orphaned docker-proxy processes that squat on the ports exposed by the services, preventing Traefik from using them.

Solution: stop Docker, delete the local-kv.db, restart Docker. It rebuilds its state cleanly from the Swarm. Service restored.

First disk added to the RAID, resync (~40 min), boot installed on the new disk. All good.

Phase 3: second disk, real headache

OVH comes back and tells us that the second disk (the one we hadn't replaced yet) also failed its SMART check. Not surprising. We schedule the replacement.

The swap happens. The server reboots... and lands in a UEFI shell. The firmware can't find any valid boot entry. Of course: we had installed GRUB on the disk OVH just replaced.

Off to rescue mode. Now, open-heart surgery:

1. Mount the degraded RAID (nvme1n1 survived, it has the entire OS) 2. Format the empty EFI partition on the new disk 3. Chroot, reinstall GRUB 4. Manually create the EFI entry in NVRAM with efibootmgr 5. Repartition the new disk to match the old one 6. Add it to the RAID, start the resync 7. Reboot

The server reboots, GRUB loads the kernel, Linux starts up... and then nothing. Ping OK, but SSH refuses connections. Meanwhile, in the KVM console, the firewall is logging blocked connections: a sign that the system is running, but the application services aren't started.

Phase 4: investigation

The site has been down for two hours. Back to rescue mode, I mount the production filesystem and check the systemd journal from the last boot. And there, the culprit appears:

Emergency mode. That's where systemd takes refuge when a critical mount fails. In emergency mode, no network service starts: no SSH, no Docker, nothing. But the kernel stays active, so ping and firewall keep working. The perfect trap.

Root cause? The /etc/fstab says:

When I formatted the EFI partitions on the new disks with mkfs.vfat, I forgot to give them the EFI_SYSPART label. So systemd waits 90 seconds for a partition that doesn't exist, gives up, and brings down all of local-fs.target in cascade.

A quick fatlabel /dev/nvme0n1p1 EFI_SYSPART later, cross-checks on every fstab mount, reboot. This time it's the right one: SSH responds in under 2 minutes, all services start, the site is back.

The takeaway

In the end:

Two brand new disks, 0% wear, RAID synced
All backups safe off-site
A few hours of cumulative downtime
One lesson learned: before rebooting, read the systemd journal instead of guessing. I wasted a good hour chasing ghosts in Docker when the real problem was written in black and white in journalctl --boot=-1.

Thanks for your patience during these interruptions. The server is now good to go for years to come with its new disks. Until next time!