I too recommend sniffing the network traffic on multiple boxes. If you find a build up of network traffic you can be fairly confident that something is looping back into the switch. If you have a managed switch get Cacti or Zabix on to monitor it. That should help you track down the rebroadcasts. On Fri, Sep 3, 2010 at 5:32 PM, Dan Dubovik wrote: > I suspect the answer may come from determining what resources the 6 > servers that go down have in common, that the one remains up does not. > > As others have mentioned, when I've seen this, it is due to the switch > being flooded by a single server, and not having enough bandwidth for > any other services. > > Do you have access to any of the sar data by chance?  Or can the tech > that reboots it archive the output of dmesg before the reboot? > > -- Dan. > > On Thu, Sep 2, 2010 at 9:11 PM, Brian Cluff wrote: >> When I've seen this in the past it's been caused by a loop in the network, >> and it takes a while for the broadcast traffic, mostly DHCP in my case, to >> build up to the point that there is nothing left of the network.  In my >> cause it was usually caused by someone bringing in a rouge wireless access >> point, and plugging multiple cables into it. >> >> Try loading something like iptraf and see if you see a ton of off traffic >> from all over the place. >> >> Brian Cluff >> >> On 09/02/2010 08:03 PM, Simon Chatfield wrote: >>> >>> Ok, I've got a doozy of an issue which has happened twice this week and >>> is absolutely crushing to my clients who are in busy season right about >>> now. Here's the issue... >>> >>> I have a beefy linux database server which runs both postgres and mysql. >>> We just recently loaded mysql and putting it under significant load. >>> >>> Apperantly at random, twice the week (Monday and this evening) it >>> appears to take the network down save for a single machine which we are >>> still able to ssh into. There are 6 other boxes which we cannot ssh into >>> when this occurs. Link light activity does appear to still be active on >>> the network. The method for solving the problem has been to hard reboot >>> this specific server and as soon as it goes down, we can access the >>> other boxes via ssh and they start working again. When the box comes >>> back up, we can then ssh into that machine and everything is good (until >>> it happens again that is). After the reboot, there isn't much in the >>> logs, but I see the log entry for the tech unplugging and plugging in >>> the computer from the switch PRIOR to the reboot so the network link was >>> detected and logged even though it was not responding to ssh. >>> >>> These machines are hosted down at i/o so a hardboot is causing us >>> significant time to get a tech to handle it. >>> >>> Has anyone ever heard of a single linux box bringing down 'most' of a >>> network? then reboot and the other boxes are then accessible? >>> >>> My client is at his whits end, and I don't blame him. However, I'm not >>> even sure what kind of problem this is. hardware on that box? system >>> configuration? a bad switch? >>> >>> Looking for ideas at least, and if someone has time and ability, I'd >>> love to have someone on-site to help debug and fix this issue... >>> >>> Thanks everyone! >>> >> >> --------------------------------------------------- >> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us >> To subscribe, unsubscribe, or to change your mail settings: >> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss >> > --------------------------------------------------- > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us > To subscribe, unsubscribe, or to change your mail settings: > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss > --------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss