Sudden hard disk failure

Well sometimes you have accumulated a setup which is not exactly foolproof but sufficient for the needs you have at the moment. I have such setup, where I have a lvm with multiple disks and xfs on top of that. This is the kind of setup where you say: “It will be fine”. But sometimes it can go fast.

1. S.M.A.R.T. mails going fast

First we get unreadable pending sectors

This message was generated by the smartd daemon running on:

   host name:  archserver
   DNS domain: localdomain

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 24 Currently unreadable (pending) sectors

Device info:
ST10000VN0004-1ZD101, S/[N:XXXXXXXX](N:XXXXXXXX), [WWN:5-000c50-0a3049ec9](WWN:5-000c50-0a3049ec9), [FW:SC60](FW:SC60), 10.0 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

Very quickly after that those were already uncorrectable

This message was generated by the smartd daemon running on:

   host name:  archserver
   DNS domain: localdomain

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 24 Offline uncorrectable sectors

Device info:
ST10000VN0004-1ZD101, S/[N:XXXXXXXX](N:XXXXXXXX), [WWN:5-000c50-0a3049ec9](WWN:5-000c50-0a3049ec9), [FW:SC60](FW:SC60), 10.0 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

Well this is no fun, usually these mails are not coming in so fast and you have sufficient time to get a new hard disk and move the data from the one that will be going bad to a new disk and just be fine. But it got even worse very fast.

This message was generated by the smartd daemon running on:  
  
   host name:  archserver  
   DNS domain: localdomain  
  
The following warning/error was logged by the smartd daemon:  
  
Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1  
  
Device info:  
ST10000VN0004-1ZD101, S/N:XXXXXXXX, WWN:5-000c50-0a3049ec9, FW:SC60, 10.0 TB  
  
For details see host's SYSLOG.  
  
You can also use the smartctl utility for further investigation.  
No additional messages about this problem will be sent.

Ok this is very bad.

2. Errors and timeouts

Looking in dmesg output there were a whole lot of these kind of errors, so quick action was needed.

kernel: sd 5:0:0:0: [sda] Unaligned partial completion (resid=65528, sector_sz=512)
kernel: sd 5:0:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 00 07 49 4f 80 00 00 00 80 00 00
kernel: sd 5:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
kernel: sd 5:0:0:0: [sda] tag#7 Sense Key : Medium Error [current]
kernel: sd 5:0:0:0: [sda] tag#7 Add. Sense: Unrecovered read error
kernel: sd 5:0:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 00 07 49 4f 80 00 00 00 80 00 00
kernel: critical medium error, dev sda, sector 122244992 op 0x0:(READ) flags 0x800 phys_seg 16 prio class 0
kernel: sd 5:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
kernel: sd 5:0:0:0: [sda] tag#4 Sense Key : Hardware Error [current]
kernel: sd 5:0:0:0: [sda] tag#4 Add. Sense: Internal target failure
kernel: sd 5:0:0:0: [sda] tag#4 CDB: Read(16) 88 00 00 00 00 00 07 49 50 00 00 00 00 80 00 00
kernel: critical target error, dev sda, sector 122245120 op 0x0:(READ) flags 0x800 phys_seg 16 prio class 0
kernel: sd 5:0:0:0: [sda] tag#5 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
kernel: sd 5:0:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 00 07 49 50 80 00 00 00 80 00 00
kernel: scsi host5: uas_eh_device_reset_handler start
kernel: usb 2-4.3: reset SuperSpeed Plus Gen 2x1 USB device number 64 using xhci_hcd
kernel: scsi host5: uas_eh_device_reset_handler success
kernel: sd 5:0:0:0: [sda] tag#13 timing out command, waited 180s
kernel: sd 5:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=183s
kernel: sd 5:0:0:0: [sda] tag#13 Sense Key : Not Ready [current]
kernel: sd 5:0:0:0: [sda] tag#13 Add. Sense: Logical unit is in process of becoming ready
kernel: sd 5:0:0:0: [sda] tag#13 CDB: Read(16) 88 00 00 00 00 00 07 49 50 80 00 00 00 80 00 00
kernel: I/O error, dev sda, sector 122245248 op 0x0:(READ) flags 0x800 phys_seg 16 prio class 0

Since this disk is part of a Volume Group in an lvm, my first thought was: “Just add a volume and pvmove”. This failed almost immediatly. Wel it did not fail but got stuck. Reading the disk with dd also got stuck pretty quickly.

3. Thanks for ddrescue

ddrescue --force --idirect --odirect --sector-size=4096 --no-scrape --verbose /dev/sda /dev/sdb rescue.log  
GNU ddrescue 1.28  
About to copy 10000 GBytes from '/dev/sda' to '/dev/sdb'  
   Starting positions: infile = 0 B,  outfile = 0 B  
   Copy block size:  16 sectors       Initial skip size: 24432 sectors  
Sector size: 4096 Bytes  
  
Press Ctrl-C to interrupt  
    ipos:   10000 GB, non-trimmed:    2949 kB,  current rate:    107 MB/s  
    opos:   10000 GB, non-scraped:        0 B,  average rate:    119 MB/s  
non-tried:    4903 MB,  bad-sector:        0 B,    error rate:       0 B/s  
 rescued:    9995 GB,   bad areas:        0,        run time: 23h 12m 20s  
pct rescued:   99.95%, read errors:       45,  remaining time:         44s  
                             time since last successful read:          0s  
Copying non-tried blocks... Pass 1 (forwards)  
    ipos:   62684 MB, non-trimmed:    4587 kB,  current rate:       0 B/s  
    opos:   62684 MB, non-scraped:        0 B,  average rate:    118 MB/s  
non-tried:    1245 MB,  bad-sector:        0 B,    error rate:   10922 B/s  
 rescued:    9999 GB,   bad areas:        0,        run time: 23h 31m 43s  
pct rescued:   99.98%, read errors:       70,  remaining time:      1m 15s  
                             time since last successful read:          6s  
Copying non-tried blocks... Pass 2 (backwards)  
    ipos:   62589 MB, non-trimmed:    8781 kB,  current rate:       0 B/s  
    opos:   62589 MB, non-scraped:        0 B,  average rate:    116 MB/s  
non-tried:    1240 MB,  bad-sector:        0 B,    error rate:       0 B/s  
 rescued:    9999 GB,   bad areas:        0,        run time: 23h 48m  2s  
pct rescued:   99.98%, read errors:      134,  remaining time:  9d 15h 46m  
                             time since last successful read:     10m 46s  
Copying non-tried blocks... Pass 5 (forwards)

So instead of losing the full disk, There seems to be a loss of around 1.3GB. Since this storage lvm contains archives of things we have on other locations as well it will be reasonably easy to fill those 1.3GB again with the correct data. As an extra I always have par-files for archives since an archive does no longer change.

4. xfs needs repairing

[Sun Jun  9 19:01:47 2024] XFS (dm-0): Metadata corruption detected at xfs_buf_ioend+0x193/0x550 [xfs], xfs_inode block 0xb88824168 xfs_inode_buf_verify  
[Sun Jun  9 19:01:47 2024] XFS (dm-0): Unmount and run xfs_repair                                                         
[Sun Jun  9 19:01:47 2024] XFS (dm-0): First 128 bytes of corrupted metadata buffer:  
[Sun Jun  9 19:01:47 2024] 00000000: 83 95 17 88 0f cf 93 63 bc 73 19 63 c6 db 1d ac  .......c.s.c....                    
[Sun Jun  9 19:01:47 2024] 00000010: 9c 9b 6a 3b 16 b3 e8 85 16 57 35 fe 65 ca 03 e8  ..j;.....W5.e...                    
[Sun Jun  9 19:01:47 2024] 00000020: 5f 1a 70 6e ea 04 d1 a6 77 eb 0a 3d c6 27 e9 62  _.pn....w..=.'.b  
[Sun Jun  9 19:01:47 2024] 00000030: c2 53 9d d8 06 85 5e 1d dc 93 1b 41 5d 1e 2a bc  .S....^....A].*.                    
[Sun Jun  9 19:01:47 2024] 00000040: 38 9a 2b 22 9e fc c9 15 e7 d3 52 ae fa 3b 10 bd  8.+"......R..;..                    
[Sun Jun  9 19:01:47 2024] 00000050: 8f ae 95 95 33 f3 b2 10 87 cd 51 e4 eb 7b a1 24  ....3.....Q..{.$  
[Sun Jun  9 19:01:47 2024] 00000060: 16 cc 46 b4 c8 0f ca b0 e2 1c 5e dd 58 6e 9a e7  ..F.......^.Xn..                    
[Sun Jun  9 19:01:47 2024] 00000070: 1c 2f 7c 50 23 2f 60 aa fc b1 8e e8 2d 2f 0c 43  ./|P#/`.....-/.C  
[Sun Jun  9 19:01:47 2024] XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x66/0xe0 [xfs]" at daddr 0xb88824168 len 32 error 117

So since there is metadata corruption, the filesystem must be unmounted before running the repair.

xfs_repair /dev/mapper/volgroup-logicalvolume

After remounting, the xfs filesystem does no longer report issues with the metadata. So that is fine already.

5. Find what data is lost

Because of the nature of the data and the presence of parfiles it was easy to find corrupted data. Due to sheer luck, no data was lost and the archives got repaired. The most annoying part of this is the time it takes to find the corrupted data.

Warning

Without the presence of parfiles it would have been way harder to find corrupted data

Takeaways

Multi disk lvm is easy to setup. But if there is a quick failure of a disk, you’re in for a lot of work.

Also important to keep close track of the S.M.A.R.T. status of your disks and setup checks for that.

For any kind of disk emergency it is good to have a spare disk. Since this incident I have a spare disk of equal size for my ZFS pools as well. And for the lvm I now have a spare disk the same size of the largest disk in the lvm.