Softraid Multi-dirve Failure
Eric Shubert
ejs at shubes.net
Fri Jan 9 18:11:27 MST 2009
I prefer SW to HW raid as well. It runs on generic HW (controllers), so
there's nothing special to replace. Finding a replacement HW raid card
can be difficult, especially if it's an older card.
Charles Jones wrote:
> Glad to help. In a way I sort of like software RAID better than hardware
> RAID, because of workarounds like this. I've had a 3ware hardware raid
> card fail, and there was nothing I could do until I snagged another
> 3ware card from ebay. I also had an old promise raid card drop 2
> drives at once, but there was no option to force the array back together
> without a rebuild. I have so far been lucky enough to recover from
> multi-drive failures every time when using software RAID. As far as
> performance, I benchmarked a server we have that has a 12 disk software
> RAID5, and it got the highest IO of any server I had tested, and that
> was with crappy ATA-133 PATA drives. I also experience lots of problems
> with those darn Dell PERC raid controllers - they seem to like to flake
> out for no reason.
>
> Joe Fleming wrote:
>> You're my savior man! I found a post on some forum talking about using
>> mdadm --examine to check the superblock on the drives. /dev/sdc1 was a
>> complete no show, but /dev/sdd1 (which was also failed) looked ok,
>> though outdated. I deactivated the array with mdadm --stop /dev/md0
>> and forced an assemble with the command you gave me.
>>
>> mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdd1
>>
>> And I'm back online! Time to copy files off ASAP. I still hear the
>> chirping noise, from one of the drives, but at least it's back up.
>> Thanks again!
>>
>> -Joe
>>
>> Charles Jones wrote:
>>> Joe Fleming wrote:
>>>> Hey all, I have a Debian box that was acting as a 4 drive RAID-5
>>>> mdadm softraid server. I heard one of the drives making strange
>>>> noises but mdstat reported no problems with any of the drives. I
>>>> decided to copy the data off the array so I had a backup before I
>>>> tried to figure out which drive it was. Unfortunately, in the middle
>>>> of copying said data, 2 of the drives dropped out at the same time.
>>>> Since RAID-5 is only tolerant to one failure at a time, basically
>>>> the whole array is hosed now. I've had drives drop out on me before,
>>>> but never 2 at once. Sigh.
>>>>
>>>> I tried to Google a little about dealing with multi-drive failures
>>>> with mdadm, but I couldn't find much in my initial looking. I'm
>>>> going to keep digging, but I thought I'd post a question to the
>>>> group and see what happens. So, is there a way to tell mdadm to
>>>> "unmark" one of the 2 drives as failed and try to bring up the array
>>>> again WITHOUT rebuilding it? I really don't think both of the drives
>>>> failed on me simultaneously and I'd like to try to return 1 of the 2
>>>> to the array and test my theory. If I can get the array back up, I
>>>> can either keep trying to copy data off it or add a new replacement
>>>> and try to rebuild. I'm pretty novice with mdadm thought I don't see
>>>> an option that will let me do what I want. Can anyone offer me some
>>>> advice or point me in the right direction..... or am I just SOL?
>>>>
>>>> As a side note, why can't hard drive manufacturers make drives that
>>>> last anymore? I've had like 5 drives fail on me in the last year...
>>>> WD, Seagate, Hitachi, they all suck equally! I can't find any that
>>>> last for any reasonable amount of time, and all the warranties leave
>>>> you with reman'd drives which fail even more rapidly, some even show
>>>> up DOA. Plus, I'm not sending my unencrypted data off to some random
>>>> place! Sorry for venting, just a little ticked off at all of this.
>>>> Thanks in advance for any help.
>>>>
>>>> -Joe
>>>
>>> I've had luck in the past recovering from a multi-drive failure,
>>> where the other failed drive was not truly dead but rather was
>>> dropped because of an IO error caused by a thermal calibration or
>>> something similar. The trick is to re-add the drive to the array and
>>> using the option to force it NOT to try to rebuild the array. This
>>> used to be an require several options like --really-force and
>>> --really-dangerous but now I think its just something like --assemble
>>> --force /dev/md0. This forces the array to come back up to its
>>> degraded (still down 1 disk) state. If possible replace the degraded
>>> disk or copy your data off before the other flakey drive fails.
>
--
-Eric 'shubes'
More information about the PLUG-discuss
mailing list