need help troubleshooting software raid failures

Eric Shubert ejs at shubes.net
Thu Jul 16 09:08:49 MST 2009


I'd look into the controller. The Promise PCI SATA cards are known to be 
problematic with linux. I don't know the details, but have seen 2 of 
them have problems with software raid at least. The problem can be very 
sporadic, so it seems to fit your symptoms. Google for the details.

Please let me know if/when you find a PCI/SATA card that works!

Alex Dean wrote:
> I have a fileserver which I used to make backups on my local network.  
> The 2 disks (Seagate 640GB, purchased new about 6 months ago) are set up 
> with software RAID1.  Each disk has 4 partitions, assembled into 4 RAID1 
> devices.
> 
> On Monday, I lost power to the machine.  (Power cord yanked out, not 
> power failure.)  I powered the machine back on, and it started up 
> without issues.  About an hour later, I got emails from mdadm saying 
> that 2 of my 4 RAID devices had failed.
> 
> I assumed disk had not actually physically failed, since 2 of my 4 
> devices were still online and (apparently) fine.  I removed the failed 
> partitions from their arrays, and they re-added them.  mdadm happily 
> re-synced them, and everything seemed fine.  Now, a few days later, the 
> same 2 partitions (md0, md1) are again saying they have failed, and md3 
> has also failed.  The failures didn't happen at the same time.
> 
> I would appreciate some help troubleshooting here.  What else should I 
> do to investigate the problem partitions, or the disk itself?  Prior to 
> this little poweroff incident, this machine had been running without any 
> issues for several months.  I'm using an old workstation with a cheapo 
> IDE SATA controller.  (Maybe that's the problem?)
> 
> *** Timeline ***
> 
> 7/13 18:00 (roughly) : power yanked
> 7/13 22:00 (roughly) : power restored
> 7/14  9:24 : email from mdadm.  sda5 (md1) failed.
> 7/14  9:29 : email from mdadm.  sda1 (md0) failed.
> 7/15 morning (can't find the log entry to get exact time) : remove/add 
> sda5 and sda1.  resynced w/o issues.
> 7/15 20:20 : email from mdadm.  sda1 (md0) failed.
> 7/16  0:07 : email from mdadm.  sda7 (md3) failed.
> 7/16  9:38 : email from mdadm.  sda5 (md1) failed.
> 
> *** Syslog Errors ***
> I notice errors like this in /var/log/syslog in a few places.  I can try 
> to track them all down if that's helpful.
> 
> Jul 14 09:24:29 gringotts kernel: [50073.107428] ata3.00: exception 
> Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 14 09:24:29 gringotts kernel: [50073.107509] ata3: SError: { 
> PHYRdyChg }
> Jul 14 09:24:29 gringotts kernel: [50073.107550] ata3.00: cmd 
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Jul 14 09:24:29 gringotts kernel: [50073.107553]          res 
> d0/d0:d0:d0:d0:d0/00:00:00:00:00/d0 Emask 0x12 (ATA bus error)
> Jul 14 09:24:29 gringotts kernel: [50073.107639] ata3.00: status: { Busy }
> Jul 14 09:24:29 gringotts kernel: [50073.107670] ata3.00: error: { ICRC 
> UNC IDNF }
> Jul 14 09:24:29 gringotts kernel: [50073.107714] ata3: hard resetting link
> Jul 14 09:24:35 gringotts kernel: [50078.590072] ata3: SATA link up 1.5 
> Gbps (SStatus 113 SControl 310)
> Jul 14 09:24:35 gringotts kernel: [50078.651141] ata3.00: configured for 
> UDMA/100
> Jul 14 09:24:35 gringotts kernel: [50078.651190] end_request: I/O error, 
> dev sda, sector 58604927
> Jul 14 09:24:35 gringotts kernel: [50078.651244] md: super_written gets 
> error=-5, uptodate=0
> Jul 14 09:24:35 gringotts kernel: [50078.651257] raid1: Disk failure on 
> sda5, disabling device.
> Jul 14 09:24:35 gringotts kernel: [50078.651260] raid1: Operation 
> continuing on 1 devices.
> Jul 14 09:24:35 gringotts kernel: [50078.651353] ata3: EH complete
> Jul 14 09:24:35 gringotts kernel: [50078.653030] sd 2:0:0:0: [sda] 
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 14 09:24:35 gringotts kernel: [50078.664063] sd 2:0:0:0: [sda] Write 
> Protect is off
> Jul 14 09:24:35 gringotts kernel: [50078.664080] sd 2:0:0:0: [sda] Mode 
> Sense: 00 3a 00 00
> Jul 14 09:24:35 gringotts kernel: [50078.664184] sd 2:0:0:0: [sda] Write 
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Jul 14 09:24:35 gringotts kernel: [50078.677073] RAID1 conf printout:
> Jul 14 09:24:35 gringotts kernel: [50078.677094]  --- wd:1 rd:2
> Jul 14 09:24:35 gringotts kernel: [50078.677103]  disk 0, wo:1, o:0, 
> dev:sda5
> Jul 14 09:24:35 gringotts kernel: [50078.677109]  disk 1, wo:0, o:1, 
> dev:sdb5
> Jul 14 09:24:35 gringotts kernel: [50078.679371] RAID1 conf printout:
> Jul 14 09:24:35 gringotts kernel: [50078.679389]  --- wd:1 rd:2
> Jul 14 09:24:35 gringotts kernel: [50078.679399]  disk 1, wo:0, o:1, 
> dev:sdb5
> Jul 14 09:24:35 gringotts mdadm[3495]: Fail event detected on md device 
> /dev/md1, component device /dev/sda5
> 
> Jul 14 19:02:05 gringotts kernel: [84728.741110] ata3: exception Emask 
> 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 14 19:02:05 gringotts kernel: [84728.741190] ata3: SError: { 
> PHYRdyChg }
> Jul 14 19:02:05 gringotts kernel: [84728.741234] ata3: hard resetting link
> Jul 14 19:02:11 gringotts kernel: [84734.530052] ata3: link is slow to 
> respond, please be patient (ready=-19)
> Jul 14 19:02:15 gringotts kernel: [84738.790046] ata3: COMRESET failed 
> (errno=-16)
> Jul 14 19:02:15 gringotts kernel: [84738.790108] ata3: hard resetting link
> Jul 14 19:02:16 gringotts kernel: [84739.600066] ata3: SATA link up 1.5 
> Gbps (SStatus 113 SControl 310)
> Jul 14 19:02:16 gringotts kernel: [84739.660599] ata3.00: configured for 
> UDMA/100
> Jul 14 19:02:16 gringotts kernel: [84739.660627] ata3: EH complete
> Jul 14 19:02:16 gringotts kernel: [84739.661665] sd 2:0:0:0: [sda] 
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 14 19:02:16 gringotts kernel: [84739.662300] sd 2:0:0:0: [sda] Write 
> Protect is off
> Jul 14 19:02:16 gringotts kernel: [84739.662315] sd 2:0:0:0: [sda] Mode 
> Sense: 00 3a 00 00
> Jul 14 19:02:16 gringotts kernel: [84739.663229] sd 2:0:0:0: [sda] Write 
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> Jul 16 09:38:16 gringotts kernel: [223700.170900] ata3.00: exception 
> Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 16 09:38:16 gringotts kernel: [223700.170985] ata3: SError: { 
> PHYRdyChg }
> Jul 16 09:38:16 gringotts kernel: [223700.171029] ata3.00: cmd 
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Jul 16 09:38:16 gringotts kernel: [223700.171033]          res 
> d0/d0:d0:d0:d0:d0/d0:d0:d0:d0:d0/d0 Emask 0x12 (ATA bus error)
> Jul 16 09:38:16 gringotts kernel: [223700.171120] ata3.00: status: { Busy }
> Jul 16 09:38:16 gringotts kernel: [223700.171151] ata3.00: error: { ICRC 
> UNC IDNF }
> Jul 16 09:38:16 gringotts kernel: [223700.171197] ata3: hard resetting link
> Jul 16 09:38:22 gringotts kernel: [223705.600073] ata3: SATA link up 1.5 
> Gbps (SStatus 113 SControl 310)
> Jul 16 09:38:22 gringotts kernel: [223705.660564] ata3.00: configured 
> for UDMA/100
> Jul 16 09:38:22 gringotts kernel: [223705.660609] end_request: I/O 
> error, dev sda, sector 58604927
> Jul 16 09:38:22 gringotts kernel: [223705.660667] md: super_written gets 
> error=-5, uptodate=0
> Jul 16 09:38:22 gringotts kernel: [223705.660678] raid1: Disk failure on 
> sda5, disabling device.
> Jul 16 09:38:22 gringotts kernel: [223705.660682] raid1: Operation 
> continuing on 1 devices.
> Jul 16 09:38:22 gringotts kernel: [223705.660771] ata3: EH complete
> Jul 16 09:38:22 gringotts kernel: [223705.660902] sd 2:0:0:0: [sda] 
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 16 09:38:22 gringotts kernel: [223705.660954] sd 2:0:0:0: [sda] 
> Write Protect is off
> Jul 16 09:38:22 gringotts kernel: [223705.660962] sd 2:0:0:0: [sda] Mode 
> Sense: 00 3a 00 00
> Jul 16 09:38:22 gringotts kernel: [223705.661043] sd 2:0:0:0: [sda] 
> Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> 
> *** Background Info ***
> root at gringotts:~# uname -a
> Linux gringotts 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 
> i686 GNU/Linux
> 
> root at gringotts:~# cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
> [raid4] [raid10]
> md3 : active raid1 sda7[2](F) sdb7[1]
>       593874752 blocks [2/1] [_U]
> 
> md2 : active raid1 sda6[0] sdb6[1]
>       1951744 blocks [2/2] [UU]
> 
> md1 : active raid1 sda5[2](F) sdb5[1]
>       19534912 blocks [2/1] [_U]
> 
> md0 : active raid1 sda1[2](F) sdb1[1]
>       9767424 blocks [2/1] [_U]
> 
> unused devices: <none>
> 
> 
> root at gringotts:~# mount
> /dev/md0 on / type ext3 (rw,relatime,errors=remount-ro)
> /dev/md1 on /home type ext3 (rw,relatime,user_xattr)
> /dev/md3 on /opt/backup type ext3 (rw,relatime,user_xattr)
> 
> NOTE /dev/md2 is swap space.
> 
> root at gringotts:~# mdadm --detail /dev/md0
> /dev/md0:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:22 2009
>      Raid Level : raid1
>      Array Size : 9767424 (9.31 GiB 10.00 GB)
>   Used Dev Size : 9767424 (9.31 GiB 10.00 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Jul 16 10:20:08 2009
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
> 
>            UUID : e002f922:2f6ba792:a34321e9:b82a201b
>          Events : 0.10924
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       17        1      active sync   /dev/sdb1
> 
>        2       8        1        -      faulty spare   /dev/sda1
> 
> 
> root at gringotts:~# mdadm --detail /dev/md1
> /dev/md1:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:32 2009
>      Raid Level : raid1
>      Array Size : 19534912 (18.63 GiB 20.00 GB)
>   Used Dev Size : 19534912 (18.63 GiB 20.00 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Jul 16 10:20:21 2009
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
> 
>            UUID : 799faf28:c7ea5c83:21a2963b:0dddc562
>          Events : 0.1602
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       21        1      active sync   /dev/sdb5
> 
>        2       8        5        -      faulty spare   /dev/sda5
> 
> root at gringotts:/var/log# mdadm --detail /dev/md2
> /dev/md2:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:40 2009
>      Raid Level : raid1
>      Array Size : 1951744 (1906.32 MiB 1998.59 MB)
>   Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 2
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Jul 16 07:23:32 2009
>           State : clean
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>            UUID : d78f4817:88606cc2:cd273dbb:85d5ca20
>          Events : 0.18
> 
>     Number   Major   Minor   RaidDevice State
>        0       8        6        0      active sync   /dev/sda6
>        1       8       22        1      active sync   /dev/sdb6
> 
> 
> root at gringotts:/var/log# mdadm --detail /dev/md3
> /dev/md3:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:47 2009
>      Raid Level : raid1
>      Array Size : 593874752 (566.36 GiB 608.13 GB)
>   Used Dev Size : 593874752 (566.36 GiB 608.13 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 3
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Jul 16 01:43:28 2009
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
> 
>            UUID : 44f192d1:72c1a98b:1027f96a:5aea9347
>          Events : 0.2524
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       23        1      active sync   /dev/sdb7
> 
>        2       8        7        -      faulty spare   /dev/sda7
> 


-- 
-Eric 'shubes'



More information about the PLUG-discuss mailing list