Sunday, February 17, 2008

Fixing raid-5 failures, the adventurous approach

You might remember the trouble I had with my raid5 before. Well, it's still not 100% sorted out, but I know the cause now. It really was a faulty drive! I came to notice that after I replaced the motherboard, CPU and RAM with new components. After I've added them and booted into the system (which worked flawlessly on the first try, by the way, although the hardware is absolutely unrelated to the previous one) I noticed a click-sound from one of the harddisks. I immediately realized that I bought new hardware for nuthin. But at least I was sure which component was causing the failure now, plus I got 5 free SATA ports to upgrade the RAID. Previously I had non unused ports, leaving no potential for a possible upgrade. But somehow the raid got messed up in the process. I wasn't able to assemble it with the remaining 3 discs because one disc was always added as spare. So I had 2 functional devices and one spare added, which is obviously not enough to run the raid. This is due to some corrupted superblock, but luckily the superblock is just metadata which can be recreated. If I knew the correct devices and slots they corresponded to before all this happened, I could've created the array with mdadm --create and the correct params. Unfortunately, I did not know the exact params so I had take a more... adventurous approach. There's a perl-script on the linux-raid wiki which permutates over each possible combination of devices (including one missing device) and tries to mount the created array. It does everything in read-only mode so no actual data is being touched, only metadata. If it could mount the raid it prints the mdadm --create command used to build it, stops the array and goes on. You can then execute the creation-commands yourself and see if everything's right. In my case, luckily it was and I got all my data back. Note that I had to connect the failed drive for this to work because it always replaces one given device with 'missing' ('missing' tells mdadm that this device is, well, missing) instead of adding 'missing' to the devices-list. This is because it's not supposed to recreate a partial, but only a complete array. So you need to provide ALL raid-members to the command-line, otherwise it won't work. It should be fairly easy to hack the script to work for partial arrays, too, but it was easier for me to add the drive again than to hack perl-code.

After this the raid was up and I needed to mark the drive as faulty and remove it so it can't cause problems anymore. It's always a bit problematic to map the device-names (/dev/sdx) to the real harddrives and you might pull out the wrong one, possibly leading to more problems. I found out a reliable way to identify the drives:

hdparm -I /dev/sdx | grep 'Serial Number'
This will print the serial number, which usually is visible on the actual discs, too. Somehow the -I option to hdparm never occured to me before. The serial-number matched one of my disks and so I was able to locate and remove the faulty drive.


Next step is to contact the reseller for a replacement. I hope the next bad drive will be less problematic.

1 comment: