Frustrated - Weird problem

Ed plug at 0x1b.com
Wed Sep 8 14:18:40 MST 2010


On Thu, Sep 2, 2010 at 8:03 PM, Simon Chatfield
<simon at thechatfieldgroup.com> wrote:
>
> Ok, I've got a doozy of an issue which has happened twice this week and is
> absolutely crushing to my clients who are in busy season right about now.
> Here's the issue...
>
> I have a beefy linux database server which runs both postgres and mysql. We
> just recently loaded mysql and putting it under significant load.
>
> Apperantly at random, twice the week (Monday and this evening) it appears to
> take the network down save for a single machine which we are still able to
> ssh into. There are 6 other boxes which we cannot ssh into when this occurs.
> Link light activity does appear to still be active on the network. The
> method for solving the problem has been to hard reboot this specific server
> and as soon as it goes down, we can access the other boxes via ssh and they
> start working again. When the box comes back up, we can then ssh into that
> machine and everything is good (until it happens again that is). After the
> reboot, there isn't much in the logs, but I see the log entry for the tech
> unplugging and plugging in the computer from the switch PRIOR to the reboot
> so the network link was detected and logged even though it was not
> responding to ssh.
>
> These machines are hosted down at i/o so a hardboot is causing us
> significant time to get a tech to handle it.
>
> Has anyone ever heard of a single linux box bringing down 'most' of a
> network? then reboot and the other boxes are then accessible?
>
> My client is at his whits end, and I don't blame him. However, I'm not even
> sure what kind of problem this is. hardware on that box? system
> configuration? a bad switch?
>
> Looking for ideas at least, and if someone has time and ability, I'd love to
> have someone on-site to help debug and fix this issue...
>
> Thanks everyone!
>
> --
> Simon Chatfield
>

Are the 6 machines actually crashing or are they loosing their routing
tables - if the systems just got un-networked, see if/how the routing
tables change - OTOH do you have avahi running? look for 169.254/16 IP
addresses in the wrong place.


More information about the PLUG-discuss mailing list