Frustrated - Weird problem

Mon Sep 6 11:54:51 MST 2010

I too recommend sniffing the network traffic on multiple boxes. If you
find a build up of network traffic you can be fairly confident that
something is looping back into the switch. If you have a managed
switch get Cacti or Zabix on to monitor it. That should help you track
down the rebroadcasts.

On Fri, Sep 3, 2010 at 5:32 PM, Dan Dubovik <dandubo at gmail.com> wrote:
> I suspect the answer may come from determining what resources the 6
> servers that go down have in common, that the one remains up does not.
>
> As others have mentioned, when I've seen this, it is due to the switch
> being flooded by a single server, and not having enough bandwidth for
> any other services.
>
> Do you have access to any of the sar data by chance?  Or can the tech
> that reboots it archive the output of dmesg before the reboot?
>
> -- Dan.
>
> On Thu, Sep 2, 2010 at 9:11 PM, Brian Cluff <brian at snaptek.com> wrote:
>> When I've seen this in the past it's been caused by a loop in the network,
>> and it takes a while for the broadcast traffic, mostly DHCP in my case, to
>> build up to the point that there is nothing left of the network.  In my
>> cause it was usually caused by someone bringing in a rouge wireless access
>> point, and plugging multiple cables into it.
>>
>> Try loading something like iptraf and see if you see a ton of off traffic
>> from all over the place.
>>
>> Brian Cluff
>>
>> On 09/02/2010 08:03 PM, Simon Chatfield wrote:
>>>
>>> Ok, I've got a doozy of an issue which has happened twice this week and
>>> is absolutely crushing to my clients who are in busy season right about
>>> now. Here's the issue...
>>>
>>> I have a beefy linux database server which runs both postgres and mysql.
>>> We just recently loaded mysql and putting it under significant load.
>>>
>>> Apperantly at random, twice the week (Monday and this evening) it
>>> appears to take the network down save for a single machine which we are
>>> still able to ssh into. There are 6 other boxes which we cannot ssh into
>>> when this occurs. Link light activity does appear to still be active on
>>> the network. The method for solving the problem has been to hard reboot
>>> this specific server and as soon as it goes down, we can access the
>>> other boxes via ssh and they start working again. When the box comes
>>> back up, we can then ssh into that machine and everything is good (until
>>> it happens again that is). After the reboot, there isn't much in the
>>> logs, but I see the log entry for the tech unplugging and plugging in
>>> the computer from the switch PRIOR to the reboot so the network link was
>>> detected and logged even though it was not responding to ssh.
>>>
>>> These machines are hosted down at i/o so a hardboot is causing us
>>> significant time to get a tech to handle it.
>>>
>>> Has anyone ever heard of a single linux box bringing down 'most' of a
>>> network? then reboot and the other boxes are then accessible?
>>>
>>> My client is at his whits end, and I don't blame him. However, I'm not
>>> even sure what kind of problem this is. hardware on that box? system
>>> configuration? a bad switch?
>>>
>>> Looking for ideas at least, and if someone has time and ability, I'd
>>> love to have someone on-site to help debug and fix this issue...
>>>
>>> Thanks everyone!
>>>
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>