Re: need help troubleshooting software raid failures

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Eric Shubert
Date:  
To: plug-discuss
Subject: Re: need help troubleshooting software raid failures
I'd look into the controller. The Promise PCI SATA cards are known to be
problematic with linux. I don't know the details, but have seen 2 of
them have problems with software raid at least. The problem can be very
sporadic, so it seems to fit your symptoms. Google for the details.

Please let me know if/when you find a PCI/SATA card that works!

Alex Dean wrote:
> I have a fileserver which I used to make backups on my local network.
> The 2 disks (Seagate 640GB, purchased new about 6 months ago) are set up
> with software RAID1. Each disk has 4 partitions, assembled into 4 RAID1
> devices.
>
> On Monday, I lost power to the machine. (Power cord yanked out, not
> power failure.) I powered the machine back on, and it started up
> without issues. About an hour later, I got emails from mdadm saying
> that 2 of my 4 RAID devices had failed.
>
> I assumed disk had not actually physically failed, since 2 of my 4
> devices were still online and (apparently) fine. I removed the failed
> partitions from their arrays, and they re-added them. mdadm happily
> re-synced them, and everything seemed fine. Now, a few days later, the
> same 2 partitions (md0, md1) are again saying they have failed, and md3
> has also failed. The failures didn't happen at the same time.
>
> I would appreciate some help troubleshooting here. What else should I
> do to investigate the problem partitions, or the disk itself? Prior to
> this little poweroff incident, this machine had been running without any
> issues for several months. I'm using an old workstation with a cheapo
> IDE SATA controller. (Maybe that's the problem?)
>
> *** Timeline ***
>
> 7/13 18:00 (roughly) : power yanked
> 7/13 22:00 (roughly) : power restored
> 7/14 9:24 : email from mdadm. sda5 (md1) failed.
> 7/14 9:29 : email from mdadm. sda1 (md0) failed.
> 7/15 morning (can't find the log entry to get exact time) : remove/add
> sda5 and sda1. resynced w/o issues.
> 7/15 20:20 : email from mdadm. sda1 (md0) failed.
> 7/16 0:07 : email from mdadm. sda7 (md3) failed.
> 7/16 9:38 : email from mdadm. sda5 (md1) failed.
>
> *** Syslog Errors ***
> I notice errors like this in /var/log/syslog in a few places. I can try
> to track them all down if that's helpful.
>
> Jul 14 09:24:29 gringotts kernel: [50073.107428] ata3.00: exception 
> Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 14 09:24:29 gringotts kernel: [50073.107509] ata3: SError: { 
> PHYRdyChg }
> Jul 14 09:24:29 gringotts kernel: [50073.107550] ata3.00: cmd 
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Jul 14 09:24:29 gringotts kernel: [50073.107553]          res 
> d0/d0:d0:d0:d0:d0/00:00:00:00:00/d0 Emask 0x12 (ATA bus error)
> Jul 14 09:24:29 gringotts kernel: [50073.107639] ata3.00: status: { Busy }
> Jul 14 09:24:29 gringotts kernel: [50073.107670] ata3.00: error: { ICRC 
> UNC IDNF }
> Jul 14 09:24:29 gringotts kernel: [50073.107714] ata3: hard resetting link
> Jul 14 09:24:35 gringotts kernel: [50078.590072] ata3: SATA link up 1.5 
> Gbps (SStatus 113 SControl 310)
> Jul 14 09:24:35 gringotts kernel: [50078.651141] ata3.00: configured for 
> UDMA/100
> Jul 14 09:24:35 gringotts kernel: [50078.651190] end_request: I/O error, 
> dev sda, sector 58604927
> Jul 14 09:24:35 gringotts kernel: [50078.651244] md: super_written gets 
> error=-5, uptodate=0
> Jul 14 09:24:35 gringotts kernel: [50078.651257] raid1: Disk failure on 
> sda5, disabling device.
> Jul 14 09:24:35 gringotts kernel: [50078.651260] raid1: Operation 
> continuing on 1 devices.
> Jul 14 09:24:35 gringotts kernel: [50078.651353] ata3: EH complete
> Jul 14 09:24:35 gringotts kernel: [50078.653030] sd 2:0:0:0: [sda] 
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 14 09:24:35 gringotts kernel: [50078.664063] sd 2:0:0:0: [sda] Write 
> Protect is off
> Jul 14 09:24:35 gringotts kernel: [50078.664080] sd 2:0:0:0: [sda] Mode 
> Sense: 00 3a 00 00
> Jul 14 09:24:35 gringotts kernel: [50078.664184] sd 2:0:0:0: [sda] Write 
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Jul 14 09:24:35 gringotts kernel: [50078.677073] RAID1 conf printout:
> Jul 14 09:24:35 gringotts kernel: [50078.677094]  --- wd:1 rd:2
> Jul 14 09:24:35 gringotts kernel: [50078.677103]  disk 0, wo:1, o:0, 
> dev:sda5
> Jul 14 09:24:35 gringotts kernel: [50078.677109]  disk 1, wo:0, o:1, 
> dev:sdb5
> Jul 14 09:24:35 gringotts kernel: [50078.679371] RAID1 conf printout:
> Jul 14 09:24:35 gringotts kernel: [50078.679389]  --- wd:1 rd:2
> Jul 14 09:24:35 gringotts kernel: [50078.679399]  disk 1, wo:0, o:1, 
> dev:sdb5
> Jul 14 09:24:35 gringotts mdadm[3495]: Fail event detected on md device 
> /dev/md1, component device /dev/sda5

>
> Jul 14 19:02:05 gringotts kernel: [84728.741110] ata3: exception Emask
> 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 14 19:02:05 gringotts kernel: [84728.741190] ata3: SError: {
> PHYRdyChg }
> Jul 14 19:02:05 gringotts kernel: [84728.741234] ata3: hard resetting link
> Jul 14 19:02:11 gringotts kernel: [84734.530052] ata3: link is slow to
> respond, please be patient (ready=-19)
> Jul 14 19:02:15 gringotts kernel: [84738.790046] ata3: COMRESET failed
> (errno=-16)
> Jul 14 19:02:15 gringotts kernel: [84738.790108] ata3: hard resetting link
> Jul 14 19:02:16 gringotts kernel: [84739.600066] ata3: SATA link up 1.5
> Gbps (SStatus 113 SControl 310)
> Jul 14 19:02:16 gringotts kernel: [84739.660599] ata3.00: configured for
> UDMA/100
> Jul 14 19:02:16 gringotts kernel: [84739.660627] ata3: EH complete
> Jul 14 19:02:16 gringotts kernel: [84739.661665] sd 2:0:0:0: [sda]
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 14 19:02:16 gringotts kernel: [84739.662300] sd 2:0:0:0: [sda] Write
> Protect is off
> Jul 14 19:02:16 gringotts kernel: [84739.662315] sd 2:0:0:0: [sda] Mode
> Sense: 00 3a 00 00
> Jul 14 19:02:16 gringotts kernel: [84739.663229] sd 2:0:0:0: [sda] Write
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
>
> Jul 16 09:38:16 gringotts kernel: [223700.170900] ata3.00: exception 
> Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 16 09:38:16 gringotts kernel: [223700.170985] ata3: SError: { 
> PHYRdyChg }
> Jul 16 09:38:16 gringotts kernel: [223700.171029] ata3.00: cmd 
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Jul 16 09:38:16 gringotts kernel: [223700.171033]          res 
> d0/d0:d0:d0:d0:d0/d0:d0:d0:d0:d0/d0 Emask 0x12 (ATA bus error)
> Jul 16 09:38:16 gringotts kernel: [223700.171120] ata3.00: status: { Busy }
> Jul 16 09:38:16 gringotts kernel: [223700.171151] ata3.00: error: { ICRC 
> UNC IDNF }
> Jul 16 09:38:16 gringotts kernel: [223700.171197] ata3: hard resetting link
> Jul 16 09:38:22 gringotts kernel: [223705.600073] ata3: SATA link up 1.5 
> Gbps (SStatus 113 SControl 310)
> Jul 16 09:38:22 gringotts kernel: [223705.660564] ata3.00: configured 
> for UDMA/100
> Jul 16 09:38:22 gringotts kernel: [223705.660609] end_request: I/O 
> error, dev sda, sector 58604927
> Jul 16 09:38:22 gringotts kernel: [223705.660667] md: super_written gets 
> error=-5, uptodate=0
> Jul 16 09:38:22 gringotts kernel: [223705.660678] raid1: Disk failure on 
> sda5, disabling device.
> Jul 16 09:38:22 gringotts kernel: [223705.660682] raid1: Operation 
> continuing on 1 devices.
> Jul 16 09:38:22 gringotts kernel: [223705.660771] ata3: EH complete
> Jul 16 09:38:22 gringotts kernel: [223705.660902] sd 2:0:0:0: [sda] 
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 16 09:38:22 gringotts kernel: [223705.660954] sd 2:0:0:0: [sda] 
> Write Protect is off
> Jul 16 09:38:22 gringotts kernel: [223705.660962] sd 2:0:0:0: [sda] Mode 
> Sense: 00 3a 00 00
> Jul 16 09:38:22 gringotts kernel: [223705.661043] sd 2:0:0:0: [sda] 
> Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

>
>
> *** Background Info ***
> root@gringotts:~# uname -a
> Linux gringotts 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009
> i686 GNU/Linux
>
> root@gringotts:~# cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
> [raid4] [raid10]
> md3 : active raid1 sda7[2](F) sdb7[1]
>       593874752 blocks [2/1] [_U]

>
> md2 : active raid1 sda6[0] sdb6[1]
>       1951744 blocks [2/2] [UU]

>
> md1 : active raid1 sda5[2](F) sdb5[1]
>       19534912 blocks [2/1] [_U]

>
> md0 : active raid1 sda1[2](F) sdb1[1]
>       9767424 blocks [2/1] [_U]

>
> unused devices: <none>
>
>
> root@gringotts:~# mount
> /dev/md0 on / type ext3 (rw,relatime,errors=remount-ro)
> /dev/md1 on /home type ext3 (rw,relatime,user_xattr)
> /dev/md3 on /opt/backup type ext3 (rw,relatime,user_xattr)
>
> NOTE /dev/md2 is swap space.
>
> root@gringotts:~# mdadm --detail /dev/md0
> /dev/md0:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:22 2009
>      Raid Level : raid1
>      Array Size : 9767424 (9.31 GiB 10.00 GB)
>   Used Dev Size : 9767424 (9.31 GiB 10.00 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
>     Persistence : Superblock is persistent

>
>     Update Time : Thu Jul 16 10:20:08 2009
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0

>
>            UUID : e002f922:2f6ba792:a34321e9:b82a201b
>          Events : 0.10924

>
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       17        1      active sync   /dev/sdb1

>
>        2       8        1        -      faulty spare   /dev/sda1

>
>
> root@gringotts:~# mdadm --detail /dev/md1
> /dev/md1:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:32 2009
>      Raid Level : raid1
>      Array Size : 19534912 (18.63 GiB 20.00 GB)
>   Used Dev Size : 19534912 (18.63 GiB 20.00 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 1
>     Persistence : Superblock is persistent

>
>     Update Time : Thu Jul 16 10:20:21 2009
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0

>
>            UUID : 799faf28:c7ea5c83:21a2963b:0dddc562
>          Events : 0.1602

>
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       21        1      active sync   /dev/sdb5

>
>        2       8        5        -      faulty spare   /dev/sda5

>
> root@gringotts:/var/log# mdadm --detail /dev/md2
> /dev/md2:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:40 2009
>      Raid Level : raid1
>      Array Size : 1951744 (1906.32 MiB 1998.59 MB)
>   Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 2
>     Persistence : Superblock is persistent

>
>     Update Time : Thu Jul 16 07:23:32 2009
>           State : clean
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0

>
>            UUID : d78f4817:88606cc2:cd273dbb:85d5ca20
>          Events : 0.18

>
>     Number   Major   Minor   RaidDevice State
>        0       8        6        0      active sync   /dev/sda6
>        1       8       22        1      active sync   /dev/sdb6

>
>
> root@gringotts:/var/log# mdadm --detail /dev/md3
> /dev/md3:
>         Version : 00.90
>   Creation Time : Sun Feb 22 10:07:47 2009
>      Raid Level : raid1
>      Array Size : 593874752 (566.36 GiB 608.13 GB)
>   Used Dev Size : 593874752 (566.36 GiB 608.13 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 3
>     Persistence : Superblock is persistent

>
>     Update Time : Thu Jul 16 01:43:28 2009
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0

>
>            UUID : 44f192d1:72c1a98b:1027f96a:5aea9347
>          Events : 0.2524

>
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       23        1      active sync   /dev/sdb7

>
>        2       8        7        -      faulty spare   /dev/sda7

>



--
-Eric 'shubes'

---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss