Frustrated - Weird problem

Simon Chatfield simon at thechatfieldgroup.com
Thu Sep 2 20:03:33 MST 2010


Ok, I've got a doozy of an issue which has happened twice this week and 
is absolutely crushing to my clients who are in busy season right about 
now. Here's the issue...

I have a beefy linux database server which runs both postgres and mysql. 
We just recently loaded mysql and putting it under significant load.

Apperantly at random, twice the week (Monday and this evening) it 
appears to take the network down save for a single machine which we are 
still able to ssh into. There are 6 other boxes which we cannot ssh into 
when this occurs. Link light activity does appear to still be active on 
the network. The method for solving the problem has been to hard reboot 
this specific server and as soon as it goes down, we can access the 
other boxes via ssh and they start working again. When the box comes 
back up, we can then ssh into that machine and everything is good (until 
it happens again that is). After the reboot, there isn't much in the 
logs, but I see the log entry for the tech unplugging and plugging in 
the computer from the switch PRIOR to the reboot so the network link was 
detected and logged even though it was not responding to ssh.

These machines are hosted down at i/o so a hardboot is causing us 
significant time to get a tech to handle it.

Has anyone ever heard of a single linux box bringing down 'most' of a 
network? then reboot and the other boxes are then accessible?

My client is at his whits end, and I don't blame him. However, I'm not 
even sure what kind of problem this is. hardware on that box? system 
configuration? a bad switch?

Looking for ideas at least, and if someone has time and ability, I'd 
love to have someone on-site to help debug and fix this issue...

Thanks everyone!

-- 
Simon Chatfield



More information about the PLUG-discuss mailing list