Ok, I've got a doozy of an issue which has happened twice this week and
is absolutely crushing to my clients who are in busy season right about
now. Here's the issue...
I have a beefy linux database server which runs both postgres and mysql.
We just recently loaded mysql and putting it under significant load.
Apperantly at random, twice the week (Monday and this evening) it
appears to take the network down save for a single machine which we are
still able to ssh into. There are 6 other boxes which we cannot ssh into
when this occurs. Link light activity does appear to still be active on
the network. The method for solving the problem has been to hard reboot
this specific server and as soon as it goes down, we can access the
other boxes via ssh and they start working again. When the box comes
back up, we can then ssh into that machine and everything is good (until
it happens again that is). After the reboot, there isn't much in the
logs, but I see the log entry for the tech unplugging and plugging in
the computer from the switch PRIOR to the reboot so the network link was
detected and logged even though it was not responding to ssh.
These machines are hosted down at i/o so a hardboot is causing us
significant time to get a tech to handle it.
Has anyone ever heard of a single linux box bringing down 'most' of a
network? then reboot and the other boxes are then accessible?
My client is at his whits end, and I don't blame him. However, I'm not
even sure what kind of problem this is. hardware on that box? system
configuration? a bad switch?
Looking for ideas at least, and if someone has time and ability, I'd
love to have someone on-site to help debug and fix this issue...
Thanks everyone!
--
Simon Chatfield
---------------------------------------------------
PLUG-discuss mailing list -
PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss