Softraid Multi-dirve Failure

Charles Jones charles.jones at ciscolearning.org
Fri Jan 9 16:34:20 MST 2009


Glad to help. In a way I sort of like software RAID better than hardware 
RAID, because of workarounds like this. I've had a 3ware hardware raid 
card fail, and there was nothing I could do until I snagged another 
3ware card from ebay.  I also had an old  promise raid card drop 2 
drives at once, but there was no option to force the array back together 
without a rebuild. I have so far been lucky enough to recover from 
multi-drive failures every time when using software RAID. As far as 
performance, I benchmarked a server we have that has a 12 disk software 
RAID5, and it got the highest IO of any server I had tested, and that 
was with crappy ATA-133 PATA drives.  I also experience lots of problems 
with those darn Dell PERC raid controllers - they seem to like to flake 
out for no reason.

Joe Fleming wrote:
> You're my savior man! I found a post on some forum talking about using 
> mdadm --examine to check the superblock on the drives. /dev/sdc1 was a 
> complete no show, but /dev/sdd1 (which was also failed) looked ok, 
> though outdated. I deactivated the array with mdadm --stop /dev/md0 
> and forced an assemble with the command you gave me.
>
> mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdd1
>
> And I'm back online! Time to copy files off ASAP. I still hear the 
> chirping noise, from one of the drives, but at least it's back up. 
> Thanks again!
>
> -Joe
>
> Charles Jones wrote:
>> Joe Fleming wrote:
>>> Hey all, I have a Debian box that was acting as a 4 drive RAID-5 
>>> mdadm softraid server. I heard one of the drives making strange 
>>> noises but mdstat reported no problems with any of the drives. I 
>>> decided to copy the data off the array so I had a backup before I 
>>> tried to figure out which drive it was. Unfortunately, in the middle 
>>> of copying said data, 2 of the drives dropped out at the same time. 
>>> Since RAID-5 is only tolerant to one failure at a time, basically 
>>> the whole array is hosed now. I've had drives drop out on me before, 
>>> but never 2 at once. Sigh.
>>>
>>> I tried to Google a little about dealing with multi-drive failures 
>>> with mdadm, but I couldn't find much in my initial looking. I'm 
>>> going to keep digging, but I thought I'd post a question to the 
>>> group and see what happens. So, is there a way to tell mdadm to 
>>> "unmark" one of the 2 drives as failed and try to bring up the array 
>>> again WITHOUT rebuilding it? I really don't think both of the drives 
>>> failed on me simultaneously and I'd like to try to return 1 of the 2 
>>> to the array and test my theory. If I can get the array back up, I 
>>> can either keep trying to copy data off it or add a new replacement 
>>> and try to rebuild. I'm pretty novice with mdadm thought I don't see 
>>> an option that will let me do what I want. Can anyone offer me some 
>>> advice or point me in the right direction..... or am I just SOL?
>>>
>>> As a side note, why can't hard drive manufacturers make drives that 
>>> last anymore? I've had like 5 drives fail on me in the last year... 
>>> WD, Seagate, Hitachi, they all suck equally! I can't find any that 
>>> last for any reasonable amount of time, and all the warranties leave 
>>> you with reman'd drives which fail even more rapidly, some even show 
>>> up DOA. Plus, I'm not sending my unencrypted data off to some random 
>>> place! Sorry for venting, just a little ticked off at all of this. 
>>> Thanks in advance for any help.
>>>
>>> -Joe
>>
>> I've had luck in the past recovering from a multi-drive failure, 
>> where the other failed drive was not truly dead but rather was 
>> dropped because of an IO error caused by a thermal calibration or 
>> something similar.  The trick is to re-add the drive to the array and 
>> using the option to force it NOT to try to rebuild the array.  This 
>> used to be an require several options like --really-force and 
>> --really-dangerous but now I think its just something like --assemble 
>> --force /dev/md0. This forces the array to come back up to its 
>> degraded (still down 1 disk) state.  If possible replace the degraded 
>> disk or copy your data off before the other flakey drive fails.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20090109/c1dc811b/attachment.htm 


More information about the PLUG-discuss mailing list