I'd look into the controller. The Promise PCI SATA cards are known to be
problematic with linux. I don't know the details, but have seen 2 of
them have problems with software raid at least. The problem can be very
sporadic, so it seems to fit your symptoms. Google for the details.
> I have a fileserver which I used to make backups on my local network.
> The 2 disks (Seagate 640GB, purchased new about 6 months ago) are set up
> with software RAID1. Each disk has 4 partitions, assembled into 4 RAID1
> devices.
>
> On Monday, I lost power to the machine. (Power cord yanked out, not
> power failure.) I powered the machine back on, and it started up
> without issues. About an hour later, I got emails from mdadm saying
> that 2 of my 4 RAID devices had failed.
>
> I assumed disk had not actually physically failed, since 2 of my 4
> devices were still online and (apparently) fine. I removed the failed
> partitions from their arrays, and they re-added them. mdadm happily
> re-synced them, and everything seemed fine. Now, a few days later, the
> same 2 partitions (md0, md1) are again saying they have failed, and md3
> has also failed. The failures didn't happen at the same time.
>
> I would appreciate some help troubleshooting here. What else should I
> do to investigate the problem partitions, or the disk itself? Prior to
> this little poweroff incident, this machine had been running without any
> issues for several months. I'm using an old workstation with a cheapo
> IDE SATA controller. (Maybe that's the problem?)
>
> *** Timeline ***
>
> 7/13 18:00 (roughly) : power yanked
> 7/13 22:00 (roughly) : power restored
> 7/14 9:24 : email from mdadm. sda5 (md1) failed.
> 7/14 9:29 : email from mdadm. sda1 (md0) failed.
> 7/15 morning (can't find the log entry to get exact time) : remove/add
> sda5 and sda1. resynced w/o issues.
> 7/15 20:20 : email from mdadm. sda1 (md0) failed.
> 7/16 0:07 : email from mdadm. sda7 (md3) failed.
> 7/16 9:38 : email from mdadm. sda5 (md1) failed.
>
> *** Syslog Errors ***
> I notice errors like this in /var/log/syslog in a few places. I can try
> to track them all down if that's helpful.
>
> Jul 14 09:24:29 gringotts kernel: [50073.107428] ata3.00: exception
> Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 14 09:24:29 gringotts kernel: [50073.107509] ata3: SError: {
> PHYRdyChg }
> Jul 14 09:24:29 gringotts kernel: [50073.107550] ata3.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Jul 14 09:24:29 gringotts kernel: [50073.107553] res
> d0/d0:d0:d0:d0:d0/00:00:00:00:00/d0 Emask 0x12 (ATA bus error)
> Jul 14 09:24:29 gringotts kernel: [50073.107639] ata3.00: status: { Busy }
> Jul 14 09:24:29 gringotts kernel: [50073.107670] ata3.00: error: { ICRC
> UNC IDNF }
> Jul 14 09:24:29 gringotts kernel: [50073.107714] ata3: hard resetting link
> Jul 14 09:24:35 gringotts kernel: [50078.590072] ata3: SATA link up 1.5
> Gbps (SStatus 113 SControl 310)
> Jul 14 09:24:35 gringotts kernel: [50078.651141] ata3.00: configured for
> UDMA/100
> Jul 14 09:24:35 gringotts kernel: [50078.651190] end_request: I/O error,
> dev sda, sector 58604927
> Jul 14 09:24:35 gringotts kernel: [50078.651244] md: super_written gets
> error=-5, uptodate=0
> Jul 14 09:24:35 gringotts kernel: [50078.651257] raid1: Disk failure on
> sda5, disabling device.
> Jul 14 09:24:35 gringotts kernel: [50078.651260] raid1: Operation
> continuing on 1 devices.
> Jul 14 09:24:35 gringotts kernel: [50078.651353] ata3: EH complete
> Jul 14 09:24:35 gringotts kernel: [50078.653030] sd 2:0:0:0: [sda]
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 14 09:24:35 gringotts kernel: [50078.664063] sd 2:0:0:0: [sda] Write
> Protect is off
> Jul 14 09:24:35 gringotts kernel: [50078.664080] sd 2:0:0:0: [sda] Mode
> Sense: 00 3a 00 00
> Jul 14 09:24:35 gringotts kernel: [50078.664184] sd 2:0:0:0: [sda] Write
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Jul 14 09:24:35 gringotts kernel: [50078.677073] RAID1 conf printout:
> Jul 14 09:24:35 gringotts kernel: [50078.677094] --- wd:1 rd:2
> Jul 14 09:24:35 gringotts kernel: [50078.677103] disk 0, wo:1, o:0,
> dev:sda5
> Jul 14 09:24:35 gringotts kernel: [50078.677109] disk 1, wo:0, o:1,
> dev:sdb5
> Jul 14 09:24:35 gringotts kernel: [50078.679371] RAID1 conf printout:
> Jul 14 09:24:35 gringotts kernel: [50078.679389] --- wd:1 rd:2
> Jul 14 09:24:35 gringotts kernel: [50078.679399] disk 1, wo:0, o:1,
> dev:sdb5
> Jul 14 09:24:35 gringotts mdadm[3495]: Fail event detected on md device
> /dev/md1, component device /dev/sda5
>
> Jul 14 19:02:05 gringotts kernel: [84728.741110] ata3: exception Emask
> 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 14 19:02:05 gringotts kernel: [84728.741190] ata3: SError: {
> PHYRdyChg }
> Jul 14 19:02:05 gringotts kernel: [84728.741234] ata3: hard resetting link
> Jul 14 19:02:11 gringotts kernel: [84734.530052] ata3: link is slow to
> respond, please be patient (ready=-19)
> Jul 14 19:02:15 gringotts kernel: [84738.790046] ata3: COMRESET failed
> (errno=-16)
> Jul 14 19:02:15 gringotts kernel: [84738.790108] ata3: hard resetting link
> Jul 14 19:02:16 gringotts kernel: [84739.600066] ata3: SATA link up 1.5
> Gbps (SStatus 113 SControl 310)
> Jul 14 19:02:16 gringotts kernel: [84739.660599] ata3.00: configured for
> UDMA/100
> Jul 14 19:02:16 gringotts kernel: [84739.660627] ata3: EH complete
> Jul 14 19:02:16 gringotts kernel: [84739.661665] sd 2:0:0:0: [sda]
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 14 19:02:16 gringotts kernel: [84739.662300] sd 2:0:0:0: [sda] Write
> Protect is off
> Jul 14 19:02:16 gringotts kernel: [84739.662315] sd 2:0:0:0: [sda] Mode
> Sense: 00 3a 00 00
> Jul 14 19:02:16 gringotts kernel: [84739.663229] sd 2:0:0:0: [sda] Write
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
>
> Jul 16 09:38:16 gringotts kernel: [223700.170900] ata3.00: exception
> Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
> Jul 16 09:38:16 gringotts kernel: [223700.170985] ata3: SError: {
> PHYRdyChg }
> Jul 16 09:38:16 gringotts kernel: [223700.171029] ata3.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Jul 16 09:38:16 gringotts kernel: [223700.171033] res
> d0/d0:d0:d0:d0:d0/d0:d0:d0:d0:d0/d0 Emask 0x12 (ATA bus error)
> Jul 16 09:38:16 gringotts kernel: [223700.171120] ata3.00: status: { Busy }
> Jul 16 09:38:16 gringotts kernel: [223700.171151] ata3.00: error: { ICRC
> UNC IDNF }
> Jul 16 09:38:16 gringotts kernel: [223700.171197] ata3: hard resetting link
> Jul 16 09:38:22 gringotts kernel: [223705.600073] ata3: SATA link up 1.5
> Gbps (SStatus 113 SControl 310)
> Jul 16 09:38:22 gringotts kernel: [223705.660564] ata3.00: configured
> for UDMA/100
> Jul 16 09:38:22 gringotts kernel: [223705.660609] end_request: I/O
> error, dev sda, sector 58604927
> Jul 16 09:38:22 gringotts kernel: [223705.660667] md: super_written gets
> error=-5, uptodate=0
> Jul 16 09:38:22 gringotts kernel: [223705.660678] raid1: Disk failure on
> sda5, disabling device.
> Jul 16 09:38:22 gringotts kernel: [223705.660682] raid1: Operation
> continuing on 1 devices.
> Jul 16 09:38:22 gringotts kernel: [223705.660771] ata3: EH complete
> Jul 16 09:38:22 gringotts kernel: [223705.660902] sd 2:0:0:0: [sda]
> 1250263728 512-byte hardware sectors (640135 MB)
> Jul 16 09:38:22 gringotts kernel: [223705.660954] sd 2:0:0:0: [sda]
> Write Protect is off
> Jul 16 09:38:22 gringotts kernel: [223705.660962] sd 2:0:0:0: [sda] Mode
> Sense: 00 3a 00 00
> Jul 16 09:38:22 gringotts kernel: [223705.661043] sd 2:0:0:0: [sda]
> Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>
>
> *** Background Info ***
> root@gringotts:~# uname -a
> Linux gringotts 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009
> i686 GNU/Linux
>
> root@gringotts:~# cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10]
> md3 : active raid1 sda7[2](F) sdb7[1]
> 593874752 blocks [2/1] [_U]
>
> md2 : active raid1 sda6[0] sdb6[1]
> 1951744 blocks [2/2] [UU]
>
> md1 : active raid1 sda5[2](F) sdb5[1]
> 19534912 blocks [2/1] [_U]
>
> md0 : active raid1 sda1[2](F) sdb1[1]
> 9767424 blocks [2/1] [_U]
>
> unused devices: <none>
>
>
> root@gringotts:~# mount
> /dev/md0 on / type ext3 (rw,relatime,errors=remount-ro)
> /dev/md1 on /home type ext3 (rw,relatime,user_xattr)
> /dev/md3 on /opt/backup type ext3 (rw,relatime,user_xattr)
>
> NOTE /dev/md2 is swap space.
>
> root@gringotts:~# mdadm --detail /dev/md0
> /dev/md0:
> Version : 00.90
> Creation Time : Sun Feb 22 10:07:22 2009
> Raid Level : raid1
> Array Size : 9767424 (9.31 GiB 10.00 GB)
> Used Dev Size : 9767424 (9.31 GiB 10.00 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 0
> Persistence : Superblock is persistent
>
> Update Time : Thu Jul 16 10:20:08 2009
> State : clean, degraded
> Active Devices : 1
> Working Devices : 1
> Failed Devices : 1
> Spare Devices : 0
>
> UUID : e002f922:2f6ba792:a34321e9:b82a201b
> Events : 0.10924
>
> Number Major Minor RaidDevice State
> 0 0 0 0 removed
> 1 8 17 1 active sync /dev/sdb1
>
> 2 8 1 - faulty spare /dev/sda1
>
>
> root@gringotts:~# mdadm --detail /dev/md1
> /dev/md1:
> Version : 00.90
> Creation Time : Sun Feb 22 10:07:32 2009
> Raid Level : raid1
> Array Size : 19534912 (18.63 GiB 20.00 GB)
> Used Dev Size : 19534912 (18.63 GiB 20.00 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 1
> Persistence : Superblock is persistent
>
> Update Time : Thu Jul 16 10:20:21 2009
> State : clean, degraded
> Active Devices : 1
> Working Devices : 1
> Failed Devices : 1
> Spare Devices : 0
>
> UUID : 799faf28:c7ea5c83:21a2963b:0dddc562
> Events : 0.1602
>
> Number Major Minor RaidDevice State
> 0 0 0 0 removed
> 1 8 21 1 active sync /dev/sdb5
>
> 2 8 5 - faulty spare /dev/sda5
>
> root@gringotts:/var/log# mdadm --detail /dev/md2
> /dev/md2:
> Version : 00.90
> Creation Time : Sun Feb 22 10:07:40 2009
> Raid Level : raid1
> Array Size : 1951744 (1906.32 MiB 1998.59 MB)
> Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 2
> Persistence : Superblock is persistent
>
> Update Time : Thu Jul 16 07:23:32 2009
> State : clean
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 0
>
> UUID : d78f4817:88606cc2:cd273dbb:85d5ca20
> Events : 0.18
>
> Number Major Minor RaidDevice State
> 0 8 6 0 active sync /dev/sda6
> 1 8 22 1 active sync /dev/sdb6
>
>
> root@gringotts:/var/log# mdadm --detail /dev/md3
> /dev/md3:
> Version : 00.90
> Creation Time : Sun Feb 22 10:07:47 2009
> Raid Level : raid1
> Array Size : 593874752 (566.36 GiB 608.13 GB)
> Used Dev Size : 593874752 (566.36 GiB 608.13 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 3
> Persistence : Superblock is persistent
>
> Update Time : Thu Jul 16 01:43:28 2009
> State : clean, degraded
> Active Devices : 1
> Working Devices : 1
> Failed Devices : 1
> Spare Devices : 0
>
> UUID : 44f192d1:72c1a98b:1027f96a:5aea9347
> Events : 0.2524
>
> Number Major Minor RaidDevice State
> 0 0 0 0 removed
> 1 8 23 1 active sync /dev/sdb7
>
> 2 8 7 - faulty spare /dev/sda7
>