Hard drive tribulations and ReiserFS

Shawn T. Rutledge rutledge@cx47646-a.phnx1.az.home.com
Tue, 5 Sep 2000 00:43:54 -0700


(Long story ahead, but ReiserFS comes at the end...)

A few months ago I had a cheapie Celeron motherboard from Camelback
Computers.  That machine had the DPT RAID controller installed.  After 
2 of the RAID drives died (due to overheating probably - they ran too 
hot) and I gave up on the idea, I got a new Seagate IDE disk and tried 
to use it with the motherboard's built-in controller, which up to that 
point had seen no use.  Well, I started seeing kernel messages about
IDE problems (0x51 blah blah blah or something like that).  I tried 
several kernels of different vintages, tried various IDE-related kernel
config options, etc., to no avail.  Sometimes I'd hear some violent-sounding
seek noises, like it was banging the heads against the drive case or 
something.  (I didn't think that was possible with an IDE drive - it
should be smart enough to take care of itself even if given wrong 
cylinder requests shouldn't it?)  A few days later the drive died.  I
took it back to Fry's, said "so much for Seagate" and got an IBM.  As
soon as I got Linux installed on it, I started seeing the same errors and
hearing the same kinds of unusual noises.  So I figured maybe it was 
the controller's fault and maybe I was about to kill another hard drive,
so I immediately quit using it, and now that motherboard is over at my
dad's place running Win98 as the manufacturer intended, so my younger
sisters can play their silly girlie games etc.

So I figured I was tired of crappy hardware and got a "guaranteed
overclockable" setup online, consisting of a BP6 motherboard and two
Celeron 366's and two humongous 32CFM fansinks.  I put the same IBM
hard drive in there, and on the HPT366 IDE controller too (it has 
both a garden-variety UDMA33 controller and the UDMA66 controller).
I tried overclocking but it wasn't stable; once I got it to run for 
15 minutes or so at the next higher bus speed, but then it crashed,
and for some reason I've never been able to make slight adjustments
to the bus clock and have it work; it only works if I go to the next
major speed bump.  Odd.  Anyway it seems reliable enough at 366MHz.

I never got any more IDE kernel errors that I can remember, but once
in a while the machine would just freeze up.  Usually either X was
running, and the machine would freeze with a frozen graphical display,
or else the stinkin' console screen saver would have blanked the screen
so that there weren't any errors on the screen either.  I didn't find 
anything weird in any logs after rebooting, and then it would be fine
for several days again.  But the fsck errors kept getting worse; I was
getting some fairly severe filesystem corruption, and stuff was showing
up in lost+found.  I tried putting the drive onto the slower IDE 
controller, but it still continued to hang up every few days.  I 
figured out how to get rid of the console screen saver 
(setterm -blank 0), and this weekend, it hung once, while the 
console was up, and I saw no error messages at all.  So I said "this
blows" and dug out my old Buslogic SCSI controller and an extra 4 gig
drive I had laying around, and installed a fresh copy of Potato on it.
I tried to backup the files from the IDE drive into a directory on the
SCSI drive, and twice while trying to do that using my usual 2.3.42
SMP kernel I'd been running on that machine, it hung again.  So I rebooted
to the old 2.0.36 kernel that came with Slink, and successfully
copied the files.  Then I built a fresh 2.4.0.test7+reiserfs kernel,
on the SCSI drive, and booted with that.  So far I have had no more
hangs or crashes with this setup.  But, I wanted to reformat the drive
and torture-test it to see if it's going to be reliable, or if the
problems have been its fault all along.  Since it had an iffy chance
of success anyway, I formatted it with ReiserFS.  Mounted it at /var.
My next step is to create a Postgres database for weather data like
the one currently on electron (http://gw.kb7pwd.ampr.org) and see how
long that runs.  The weather data gets written fairly continuously, 
and last time I tried this experiment on the previous install on the
IDE drive, the database got corrupted in a couple days; so this should
be a good test.  Then again, maybe I should do it on an ext2 fs so if it
fails, I will know the drive is the problem; whereas ReiserFS might be
able to mask some of its problems, or fail of its own accord.  The
drive is now on the slower IDE controller.  If it survives I'll try
the fast one again.

So anyway... anybody else trying ReiserFS, and has it been stable so far?
It's a little troublesome that right now acc. to the web page, reiserfsck
is not working very well, so if I do get some corruption, I'll be SOL.
At least the weather data is no great loss.  I got this relatively large
IDE drive with the idea of using it as my main NFS server, but so far
I can't trust it enough for that.  (17 gigs... ha, it was big when I
got it, but now a 30 gig costs less than it did.  Geez.)

The other experiment ongoing with this machine is tv-watching, via my
Hauppauge TV card.  (I had Tivo envy, what can I say.) So I might try 
recording video on that drive too, that ought to stress it a bit.  
Ironically, when it crashed early last week, xawtv was running at the 
time, and it kept working all week, but I noticed I couldn't change 
channels and was suspecting the batteries in the wireless keyboard, 
until I noticed I couldn't ping the box either.  So I guess the TV 
operation really is CPU independent - just two PCI cards talking to 
each other.  Neato.

My dual-Pentium has developed a habit of hanging now and then too.
My dad's gateway dials in via PPP, and for a couple weeks, it's been
hosed because of a power outage; I went and fixed it Saturday - just 
had to walk it through fsck errors (fix this?  uh yeah, what else would 
I do!?!?) and changed the init script so hopefully next time, it won't 
ask dumb questions and just fix the filesystem.  If ReiserFS proves
stable enough, that machine's going to get it eventually.  Anyway...
while my dad's gateway was down, my dual-Pentium gateway machine had
a perfect uptime record.  And when I went to fix Dad's gateway, it 
managed to get one PPP connection, and then my machine hung again.  So
whatever it is, it seems to be PPP-related.  How odd.

-- 
  _______                   Shawn T. Rutledge / KB7PWD  ecloud@bigfoot.com
 (_  | |_)          http://www.bigfoot.com/~ecloud  kb7pwd@kb7pwd.ampr.org
 __) | | \________________________________________________________________
Get money for spare CPU cycles at http://www.ProcessTree.com/?sponsor=5903