Frustrated - Weird problem

Dan Dubovik dandubo at gmail.com
Fri Sep 3 17:32:16 MST 2010


I suspect the answer may come from determining what resources the 6
servers that go down have in common, that the one remains up does not.

As others have mentioned, when I've seen this, it is due to the switch
being flooded by a single server, and not having enough bandwidth for
any other services.

Do you have access to any of the sar data by chance?  Or can the tech
that reboots it archive the output of dmesg before the reboot?

-- Dan.

On Thu, Sep 2, 2010 at 9:11 PM, Brian Cluff <brian at snaptek.com> wrote:
> When I've seen this in the past it's been caused by a loop in the network,
> and it takes a while for the broadcast traffic, mostly DHCP in my case, to
> build up to the point that there is nothing left of the network.  In my
> cause it was usually caused by someone bringing in a rouge wireless access
> point, and plugging multiple cables into it.
>
> Try loading something like iptraf and see if you see a ton of off traffic
> from all over the place.
>
> Brian Cluff
>
> On 09/02/2010 08:03 PM, Simon Chatfield wrote:
>>
>> Ok, I've got a doozy of an issue which has happened twice this week and
>> is absolutely crushing to my clients who are in busy season right about
>> now. Here's the issue...
>>
>> I have a beefy linux database server which runs both postgres and mysql.
>> We just recently loaded mysql and putting it under significant load.
>>
>> Apperantly at random, twice the week (Monday and this evening) it
>> appears to take the network down save for a single machine which we are
>> still able to ssh into. There are 6 other boxes which we cannot ssh into
>> when this occurs. Link light activity does appear to still be active on
>> the network. The method for solving the problem has been to hard reboot
>> this specific server and as soon as it goes down, we can access the
>> other boxes via ssh and they start working again. When the box comes
>> back up, we can then ssh into that machine and everything is good (until
>> it happens again that is). After the reboot, there isn't much in the
>> logs, but I see the log entry for the tech unplugging and plugging in
>> the computer from the switch PRIOR to the reboot so the network link was
>> detected and logged even though it was not responding to ssh.
>>
>> These machines are hosted down at i/o so a hardboot is causing us
>> significant time to get a tech to handle it.
>>
>> Has anyone ever heard of a single linux box bringing down 'most' of a
>> network? then reboot and the other boxes are then accessible?
>>
>> My client is at his whits end, and I don't blame him. However, I'm not
>> even sure what kind of problem this is. hardware on that box? system
>> configuration? a bad switch?
>>
>> Looking for ideas at least, and if someone has time and ability, I'd
>> love to have someone on-site to help debug and fix this issue...
>>
>> Thanks everyone!
>>
>
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>


More information about the PLUG-discuss mailing list