Some more pieces to the puzzle...

I ran the same ping test last night (laptop to server), but stopped the following services on the server:
apache2
exim4
mediatomb
mysql
nfs-kernel-server
nfs-common
openvpnas
cups
ntp
rpcbind

And there were no packets lost!

2802 packets transmitted, 2802 received, 0% packet loss, time 28010243ms
rtt min/avg/max/mdev = 0.063/0.157/0.319/0.033 ms

This is all that was running:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   2036   736 ?        Ss   Jun25   0:01 init [2] 
root         2  0.0  0.0      0     0 ?        S    Jun25   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Jun25   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Jun25   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Jun25   0:00 [watchdog/0]
root         6  0.0  0.0      0     0 ?        S    Jun25   0:00 [events/0]
root         7  0.0  0.0      0     0 ?        S    Jun25   0:00 [cpuset]
root         8  0.0  0.0      0     0 ?        S    Jun25   0:00 [khelper]
root         9  0.0  0.0      0     0 ?        S    Jun25   0:00 [netns]
root        10  0.0  0.0      0     0 ?        S    Jun25   0:00 [async/mgr]
root        11  0.0  0.0      0     0 ?        S    Jun25   0:00 [pm]
root        12  0.0  0.0      0     0 ?        S    Jun25   0:00 [sync_supers]
root        13  0.0  0.0      0     0 ?        S    Jun25   0:00 [bdi-default]
root        14  0.0  0.0      0     0 ?        S    Jun25   0:00 [kintegrityd/0]
root        15  0.0  0.0      0     0 ?        S    Jun25   0:00 [kblockd/0]
root        16  0.0  0.0      0     0 ?        S    Jun25   0:00 [kacpid]
root        17  0.0  0.0      0     0 ?        S    Jun25   0:00 [kacpi_notify]
root        18  0.0  0.0      0     0 ?        S    Jun25   0:00 [kacpi_hotplug]
root        19  0.0  0.0      0     0 ?        S    Jun25   0:00 [kseriod]
root        21  0.0  0.0      0     0 ?        S    Jun25   0:00 [kondemand/0]
root        22  0.0  0.0      0     0 ?        S    Jun25   0:00 [khungtaskd]
root        23  0.0  0.0      0     0 ?        S    Jun25   0:00 [kswapd0]
root        24  0.0  0.0      0     0 ?        SN   Jun25   0:00 [ksmd]
root        25  0.0  0.0      0     0 ?        S    Jun25   0:00 [aio/0]
root        26  0.0  0.0      0     0 ?        S    Jun25   0:00 [crypto/0]
root       154  0.0  0.0      0     0 ?        S    Jun25   0:00 [ksuspend_usbd]
root       155  0.0  0.0      0     0 ?        S    Jun25   0:00 [khubd]
root       157  0.0  0.0      0     0 ?        S    Jun25   0:00 [ata/0]

So, perhaps I don't have a hardware problem, but a software problem?

Mark

On Tue, Jun 26, 2012 at 9:02 PM, Stephen <cryptworks@gmail.com> wrote:

What about pings from the server? Also my paranoia about this would have me checking the arp tables to see if the ip address is getting mis-somethinged. Also see if uou can make a task that will wrife to a file once every qp sex and see if the server is falling asleep or if it is network related.  It may be nic or switch for example.

On Jun 26, 2012 7:42 PM, "Mark Phillips" <mark@phillipsmarketing.biz> wrote:
My ping test is not what I expected but still shows a problem.

I setup the test to ping (64 bytes, ttl =64) the problem server every 10 seconds from my laptop. Both are plugged in, on the same subnet. The boxes are about 5 feet apart. Here are the results:

3434 packets transmitted, 3307 received, 3% packet loss, time 34336321ms rtt min/avg/max/mdev = 0.100/0.231/4.332/0.275 ms

I found 38 instances where the time stamps for the pings "hiccuped" and there was a delay. Each hiccup lasted (min/avg/max) 10/44/70 seconds. The time between hiccups was (min/avg/max) 1:20/14:29/29:00 minutes.

I grep'd the log files for messages around these 38 incidents, but did not find any messages in any of the logs. However this is not a good test, so I just trolled the logs and didn't find anything significant.

Do these numbers strike a chord with anyone?

I will check the caps on the MB later this week.

I looked in syslog, and could not find any correlation with the 38 hiccups. However, what do these two cron jobs do, since the run quite frequently:
CMD (   cd / && run-parts --report /etc/cron.hourly)
CMD (  [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -delete)

A search through the log files for "error" does not return anything interesting.

Since apache and few other apps were running on the server, I will run the ping test again tonight after I kill everything on the server.

Thanks!

Mark

On Tue, Jun 26, 2012 at 9:35 AM, Carruth, Rusty <Rusty.Carruth@smartstoragesys.com> wrote:

Curious how your test turned out.

 

You may also want to run an iostat to a file and see if that correlates to the slow responses.

 

However, that ‘bulging capacitor’ thing others have mentioned sounds like a pretty convincing coincidence, as it were….

 

(I will say that USUALLY I’d agree with JD – Iobound or low RAM (and thus iobound on swap space) are the only things I’ve seen that cause unresponsiveness (never seen an overheat slow it down, usually it just dies suddenly.  I probably get fast overheating and not slow increases in heat levels J)

 

OH!  WAIT!  I just remembered another event – and it WON’T show up in normal performance logs.  If ‘you’ send a command to a disk drive, and it goes busy for a long time, your system can become totally locked until the timeout happens and the kernel gives up.  (If that happens, there SHOULD be a timeout recorded in the syslog or /var/log/messages.  Check there for timeouts on disk drives or hard resets or such).  (I know this because of where I work J)  (Disk drives are supposed to acknowledge the command almost immediately.  It is almost always a bad thing when the drive takes the command but does not finish the initial command handshake sequence…  You might want to look at the S.M.A.R.T. attributes for your drives as well to see if any of them are showing ‘pre-fail’ conditions)

 

Rusty

 

From: plug-discuss-bounces@lists.plug.phoenix.az.us [mailto:plug-discuss-bounces@lists.plug.phoenix.az.us] On Behalf Of Mark Phillips
Sent: Monday, June 25, 2012 10:13 PM
To: Main PLUG discussion list
Subject: Re: Strange Server Behavior

 

Right now, the server is not doing anything but sitting there....

Tasks:  98 total,   1 running,  97 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1033780k total,   217560k used,   816220k free,     6220k buffers
Swap:  2019320k total,        0k used,  2019320k free,    94056k cached

Plenty of swap, not very busy. It may be over heating, but not sure why.

I am going to run a test tonight - ping every 10 seconds and time stamp the output into a file. Perhaps I will see gaps or unusually long response times and I can correlate that with the log files.

Mark

On Mon, Jun 25, 2012 at 10:09 PM, JD Austin <> wrote:

I've had servers that act like that.. usually they're over heating, completely I/O bound, or swapping due to low available memory. 

 

On Mon, Jun 25, 2012 at 10:00 PM, Mark Phillips <mark@phillipsmarketing.biz> wrote:

Nope - everything just stops - ping waits for a response, web services just wait for the server, file transfers stop and wait.......as if time just stopped for the server, then starts again without any errors being evident.

Mark

 

On Mon, Jun 25, 2012 at 9:57 PM, Stephen < > wrote:

Can you do access any other services hosted by the server during this time? Or even an extended ping?

On Jun 25, 2012 9:53 PM, "Mark Phillips" < > wrote:

I have a headless server running Linux version 2.6.32-5-686 (Debian 2.6.32-45) (dannf@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) and no X or window manager, and I have noticed in the past couple of days that when I ssh in the server it occasionally stops responding for a minute or two, then comes back as if nothing had happened. It is a random event - maybe once an hour. I cannot find anything in the logs - no error messages. There is nothing wrong with the machine where I initiated the ssh session, and it is not connected to ssh. The server completely stops responding, then comes back as if nothing had happened.

How would I go about diagnosing this problem?

Thanks,


 


---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss


---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss