Re: Frustrated - Weird problem

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Ed
Date:  
To: Main PLUG discussion list
Subject: Re: Frustrated - Weird problem
On Thu, Sep 2, 2010 at 8:03 PM, Simon Chatfield
<> wrote:
>
> Ok, I've got a doozy of an issue which has happened twice this week and is
> absolutely crushing to my clients who are in busy season right about now.
> Here's the issue...
>
> I have a beefy linux database server which runs both postgres and mysql. We
> just recently loaded mysql and putting it under significant load.
>
> Apperantly at random, twice the week (Monday and this evening) it appears to
> take the network down save for a single machine which we are still able to
> ssh into. There are 6 other boxes which we cannot ssh into when this occurs.
> Link light activity does appear to still be active on the network. The
> method for solving the problem has been to hard reboot this specific server
> and as soon as it goes down, we can access the other boxes via ssh and they
> start working again. When the box comes back up, we can then ssh into that
> machine and everything is good (until it happens again that is). After the
> reboot, there isn't much in the logs, but I see the log entry for the tech
> unplugging and plugging in the computer from the switch PRIOR to the reboot
> so the network link was detected and logged even though it was not
> responding to ssh.
>
> These machines are hosted down at i/o so a hardboot is causing us
> significant time to get a tech to handle it.
>
> Has anyone ever heard of a single linux box bringing down 'most' of a
> network? then reboot and the other boxes are then accessible?
>
> My client is at his whits end, and I don't blame him. However, I'm not even
> sure what kind of problem this is. hardware on that box? system
> configuration? a bad switch?
>
> Looking for ideas at least, and if someone has time and ability, I'd love to
> have someone on-site to help debug and fix this issue...
>
> Thanks everyone!
>
> --
> Simon Chatfield
>


Are the 6 machines actually crashing or are they loosing their routing
tables - if the systems just got un-networked, see if/how the routing
tables change - OTOH do you have avahi running? look for 169.254/16 IP
addresses in the wrong place.
---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss