need help troubleshooting software raid failures

Alex Dean alex at crackpot.org
Thu Jul 16 08:49:59 MST 2009


I have a fileserver which I used to make backups on my local network.   
The 2 disks (Seagate 640GB, purchased new about 6 months ago) are set  
up with software RAID1.  Each disk has 4 partitions, assembled into 4  
RAID1 devices.

On Monday, I lost power to the machine.  (Power cord yanked out, not  
power failure.)  I powered the machine back on, and it started up  
without issues.  About an hour later, I got emails from mdadm saying  
that 2 of my 4 RAID devices had failed.

I assumed disk had not actually physically failed, since 2 of my 4  
devices were still online and (apparently) fine.  I removed the failed  
partitions from their arrays, and they re-added them.  mdadm happily  
re-synced them, and everything seemed fine.  Now, a few days later,  
the same 2 partitions (md0, md1) are again saying they have failed,  
and md3 has also failed.  The failures didn't happen at the same time.

I would appreciate some help troubleshooting here.  What else should I  
do to investigate the problem partitions, or the disk itself?  Prior  
to this little poweroff incident, this machine had been running  
without any issues for several months.  I'm using an old workstation  
with a cheapo IDE SATA controller.  (Maybe that's the problem?)

*** Timeline ***

7/13 18:00 (roughly) : power yanked
7/13 22:00 (roughly) : power restored
7/14  9:24 : email from mdadm.  sda5 (md1) failed.
7/14  9:29 : email from mdadm.  sda1 (md0) failed.
7/15 morning (can't find the log entry to get exact time) : remove/add  
sda5 and sda1.  resynced w/o issues.
7/15 20:20 : email from mdadm.  sda1 (md0) failed.
7/16  0:07 : email from mdadm.  sda7 (md3) failed.
7/16  9:38 : email from mdadm.  sda5 (md1) failed.

*** Syslog Errors ***
I notice errors like this in /var/log/syslog in a few places.  I can  
try to track them all down if that's helpful.

Jul 14 09:24:29 gringotts kernel: [50073.107428] ata3.00: exception  
Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Jul 14 09:24:29 gringotts kernel: [50073.107509] ata3: SError:  
{ PHYRdyChg }
Jul 14 09:24:29 gringotts kernel: [50073.107550] ata3.00: cmd ea/ 
00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul 14 09:24:29 gringotts kernel: [50073.107553]          res d0/ 
d0:d0:d0:d0:d0/00:00:00:00:00/d0 Emask 0x12 (ATA bus error)
Jul 14 09:24:29 gringotts kernel: [50073.107639] ata3.00: status:  
{ Busy }
Jul 14 09:24:29 gringotts kernel: [50073.107670] ata3.00: error:  
{ ICRC UNC IDNF }
Jul 14 09:24:29 gringotts kernel: [50073.107714] ata3: hard resetting  
link
Jul 14 09:24:35 gringotts kernel: [50078.590072] ata3: SATA link up  
1.5 Gbps (SStatus 113 SControl 310)
Jul 14 09:24:35 gringotts kernel: [50078.651141] ata3.00: configured  
for UDMA/100
Jul 14 09:24:35 gringotts kernel: [50078.651190] end_request: I/O  
error, dev sda, sector 58604927
Jul 14 09:24:35 gringotts kernel: [50078.651244] md: super_written  
gets error=-5, uptodate=0
Jul 14 09:24:35 gringotts kernel: [50078.651257] raid1: Disk failure  
on sda5, disabling device.
Jul 14 09:24:35 gringotts kernel: [50078.651260] raid1: Operation  
continuing on 1 devices.
Jul 14 09:24:35 gringotts kernel: [50078.651353] ata3: EH complete
Jul 14 09:24:35 gringotts kernel: [50078.653030] sd 2:0:0:0: [sda]  
1250263728 512-byte hardware sectors (640135 MB)
Jul 14 09:24:35 gringotts kernel: [50078.664063] sd 2:0:0:0: [sda]  
Write Protect is off
Jul 14 09:24:35 gringotts kernel: [50078.664080] sd 2:0:0:0: [sda]  
Mode Sense: 00 3a 00 00
Jul 14 09:24:35 gringotts kernel: [50078.664184] sd 2:0:0:0: [sda]  
Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 14 09:24:35 gringotts kernel: [50078.677073] RAID1 conf printout:
Jul 14 09:24:35 gringotts kernel: [50078.677094]  --- wd:1 rd:2
Jul 14 09:24:35 gringotts kernel: [50078.677103]  disk 0, wo:1, o:0,  
dev:sda5
Jul 14 09:24:35 gringotts kernel: [50078.677109]  disk 1, wo:0, o:1,  
dev:sdb5
Jul 14 09:24:35 gringotts kernel: [50078.679371] RAID1 conf printout:
Jul 14 09:24:35 gringotts kernel: [50078.679389]  --- wd:1 rd:2
Jul 14 09:24:35 gringotts kernel: [50078.679399]  disk 1, wo:0, o:1,  
dev:sdb5
Jul 14 09:24:35 gringotts mdadm[3495]: Fail event detected on md  
device /dev/md1, component device /dev/sda5

Jul 14 19:02:05 gringotts kernel: [84728.741110] ata3: exception Emask  
0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Jul 14 19:02:05 gringotts kernel: [84728.741190] ata3: SError:  
{ PHYRdyChg }
Jul 14 19:02:05 gringotts kernel: [84728.741234] ata3: hard resetting  
link
Jul 14 19:02:11 gringotts kernel: [84734.530052] ata3: link is slow to  
respond, please be patient (ready=-19)
Jul 14 19:02:15 gringotts kernel: [84738.790046] ata3: COMRESET failed  
(errno=-16)
Jul 14 19:02:15 gringotts kernel: [84738.790108] ata3: hard resetting  
link
Jul 14 19:02:16 gringotts kernel: [84739.600066] ata3: SATA link up  
1.5 Gbps (SStatus 113 SControl 310)
Jul 14 19:02:16 gringotts kernel: [84739.660599] ata3.00: configured  
for UDMA/100
Jul 14 19:02:16 gringotts kernel: [84739.660627] ata3: EH complete
Jul 14 19:02:16 gringotts kernel: [84739.661665] sd 2:0:0:0: [sda]  
1250263728 512-byte hardware sectors (640135 MB)
Jul 14 19:02:16 gringotts kernel: [84739.662300] sd 2:0:0:0: [sda]  
Write Protect is off
Jul 14 19:02:16 gringotts kernel: [84739.662315] sd 2:0:0:0: [sda]  
Mode Sense: 00 3a 00 00
Jul 14 19:02:16 gringotts kernel: [84739.663229] sd 2:0:0:0: [sda]  
Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Jul 16 09:38:16 gringotts kernel: [223700.170900] ata3.00: exception  
Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Jul 16 09:38:16 gringotts kernel: [223700.170985] ata3: SError:  
{ PHYRdyChg }
Jul 16 09:38:16 gringotts kernel: [223700.171029] ata3.00: cmd ea/ 
00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul 16 09:38:16 gringotts kernel: [223700.171033]          res d0/ 
d0:d0:d0:d0:d0/d0:d0:d0:d0:d0/d0 Emask 0x12 (ATA bus error)
Jul 16 09:38:16 gringotts kernel: [223700.171120] ata3.00: status:  
{ Busy }
Jul 16 09:38:16 gringotts kernel: [223700.171151] ata3.00: error:  
{ ICRC UNC IDNF }
Jul 16 09:38:16 gringotts kernel: [223700.171197] ata3: hard resetting  
link
Jul 16 09:38:22 gringotts kernel: [223705.600073] ata3: SATA link up  
1.5 Gbps (SStatus 113 SControl 310)
Jul 16 09:38:22 gringotts kernel: [223705.660564] ata3.00: configured  
for UDMA/100
Jul 16 09:38:22 gringotts kernel: [223705.660609] end_request: I/O  
error, dev sda, sector 58604927
Jul 16 09:38:22 gringotts kernel: [223705.660667] md: super_written  
gets error=-5, uptodate=0
Jul 16 09:38:22 gringotts kernel: [223705.660678] raid1: Disk failure  
on sda5, disabling device.
Jul 16 09:38:22 gringotts kernel: [223705.660682] raid1: Operation  
continuing on 1 devices.
Jul 16 09:38:22 gringotts kernel: [223705.660771] ata3: EH complete
Jul 16 09:38:22 gringotts kernel: [223705.660902] sd 2:0:0:0: [sda]  
1250263728 512-byte hardware sectors (640135 MB)
Jul 16 09:38:22 gringotts kernel: [223705.660954] sd 2:0:0:0: [sda]  
Write Protect is off
Jul 16 09:38:22 gringotts kernel: [223705.660962] sd 2:0:0:0: [sda]  
Mode Sense: 00 3a 00 00
Jul 16 09:38:22 gringotts kernel: [223705.661043] sd 2:0:0:0: [sda]  
Write cache: enabled, read cache: enabled, doesn't support DPO or FUA


*** Background Info ***
root at gringotts:~# uname -a
Linux gringotts 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009  
i686 GNU/Linux

root at gringotts:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]  
[raid4] [raid10]
md3 : active raid1 sda7[2](F) sdb7[1]
       593874752 blocks [2/1] [_U]

md2 : active raid1 sda6[0] sdb6[1]
       1951744 blocks [2/2] [UU]

md1 : active raid1 sda5[2](F) sdb5[1]
       19534912 blocks [2/1] [_U]

md0 : active raid1 sda1[2](F) sdb1[1]
       9767424 blocks [2/1] [_U]

unused devices: <none>


root at gringotts:~# mount
/dev/md0 on / type ext3 (rw,relatime,errors=remount-ro)
/dev/md1 on /home type ext3 (rw,relatime,user_xattr)
/dev/md3 on /opt/backup type ext3 (rw,relatime,user_xattr)

NOTE /dev/md2 is swap space.

root at gringotts:~# mdadm --detail /dev/md0
/dev/md0:
         Version : 00.90
   Creation Time : Sun Feb 22 10:07:22 2009
      Raid Level : raid1
      Array Size : 9767424 (9.31 GiB 10.00 GB)
   Used Dev Size : 9767424 (9.31 GiB 10.00 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Thu Jul 16 10:20:08 2009
           State : clean, degraded
  Active Devices : 1
Working Devices : 1
  Failed Devices : 1
   Spare Devices : 0

            UUID : e002f922:2f6ba792:a34321e9:b82a201b
          Events : 0.10924

     Number   Major   Minor   RaidDevice State
        0       0        0        0      removed
        1       8       17        1      active sync   /dev/sdb1

        2       8        1        -      faulty spare   /dev/sda1


root at gringotts:~# mdadm --detail /dev/md1
/dev/md1:
         Version : 00.90
   Creation Time : Sun Feb 22 10:07:32 2009
      Raid Level : raid1
      Array Size : 19534912 (18.63 GiB 20.00 GB)
   Used Dev Size : 19534912 (18.63 GiB 20.00 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 1
     Persistence : Superblock is persistent

     Update Time : Thu Jul 16 10:20:21 2009
           State : clean, degraded
  Active Devices : 1
Working Devices : 1
  Failed Devices : 1
   Spare Devices : 0

            UUID : 799faf28:c7ea5c83:21a2963b:0dddc562
          Events : 0.1602

     Number   Major   Minor   RaidDevice State
        0       0        0        0      removed
        1       8       21        1      active sync   /dev/sdb5

        2       8        5        -      faulty spare   /dev/sda5

root at gringotts:/var/log# mdadm --detail /dev/md2
/dev/md2:
         Version : 00.90
   Creation Time : Sun Feb 22 10:07:40 2009
      Raid Level : raid1
      Array Size : 1951744 (1906.32 MiB 1998.59 MB)
   Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 2
     Persistence : Superblock is persistent

     Update Time : Thu Jul 16 07:23:32 2009
           State : clean
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            UUID : d78f4817:88606cc2:cd273dbb:85d5ca20
          Events : 0.18

     Number   Major   Minor   RaidDevice State
        0       8        6        0      active sync   /dev/sda6
        1       8       22        1      active sync   /dev/sdb6


root at gringotts:/var/log# mdadm --detail /dev/md3
/dev/md3:
         Version : 00.90
   Creation Time : Sun Feb 22 10:07:47 2009
      Raid Level : raid1
      Array Size : 593874752 (566.36 GiB 608.13 GB)
   Used Dev Size : 593874752 (566.36 GiB 608.13 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 3
     Persistence : Superblock is persistent

     Update Time : Thu Jul 16 01:43:28 2009
           State : clean, degraded
  Active Devices : 1
Working Devices : 1
  Failed Devices : 1
   Spare Devices : 0

            UUID : 44f192d1:72c1a98b:1027f96a:5aea9347
          Events : 0.2524

     Number   Major   Minor   RaidDevice State
        0       0        0        0      removed
        1       8       23        1      active sync   /dev/sdb7

        2       8        7        -      faulty spare   /dev/sda7

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20090716/f8249cd5/attachment.pgp 


More information about the PLUG-discuss mailing list