Strange Server Behavior

Wed Jun 27 06:51:01 MST 2012

Some more pieces to the puzzle...

I ran the same ping test last night (laptop to server), but stopped the
following services on the server:
apache2
exim4
mediatomb
mysql
nfs-kernel-server
nfs-common
openvpnas
cups
ntp
rpcbind

And there were no packets lost!

2802 packets transmitted, 2802 received, 0% packet loss, time 28010243ms
rtt min/avg/max/mdev = 0.063/0.157/0.319/0.033 ms

This is all that was running:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   2036   736 ?        Ss   Jun25   0:01 init [2]
root         2  0.0  0.0      0     0 ?        S    Jun25   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Jun25   0:00
[migration/0]
root         4  0.0  0.0      0     0 ?        S    Jun25   0:00
[ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Jun25   0:00
[watchdog/0]
root         6  0.0  0.0      0     0 ?        S    Jun25   0:00 [events/0]
root         7  0.0  0.0      0     0 ?        S    Jun25   0:00 [cpuset]
root         8  0.0  0.0      0     0 ?        S    Jun25   0:00 [khelper]
root         9  0.0  0.0      0     0 ?        S    Jun25   0:00 [netns]
root        10  0.0  0.0      0     0 ?        S    Jun25   0:00 [async/mgr]
root        11  0.0  0.0      0     0 ?        S    Jun25   0:00 [pm]
root        12  0.0  0.0      0     0 ?        S    Jun25   0:00
[sync_supers]
root        13  0.0  0.0      0     0 ?        S    Jun25   0:00
[bdi-default]
root        14  0.0  0.0      0     0 ?        S    Jun25   0:00
[kintegrityd/0]
root        15  0.0  0.0      0     0 ?        S    Jun25   0:00 [kblockd/0]
root        16  0.0  0.0      0     0 ?        S    Jun25   0:00 [kacpid]
root        17  0.0  0.0      0     0 ?        S    Jun25   0:00
[kacpi_notify]
root        18  0.0  0.0      0     0 ?        S    Jun25   0:00
[kacpi_hotplug]
root        19  0.0  0.0      0     0 ?        S    Jun25   0:00 [kseriod]
root        21  0.0  0.0      0     0 ?        S    Jun25   0:00
[kondemand/0]
root        22  0.0  0.0      0     0 ?        S    Jun25   0:00
[khungtaskd]
root        23  0.0  0.0      0     0 ?        S    Jun25   0:00 [kswapd0]
root        24  0.0  0.0      0     0 ?        SN   Jun25   0:00 [ksmd]
root        25  0.0  0.0      0     0 ?        S    Jun25   0:00 [aio/0]
root        26  0.0  0.0      0     0 ?        S    Jun25   0:00 [crypto/0]
root       154  0.0  0.0      0     0 ?        S    Jun25   0:00
[ksuspend_usbd]
root       155  0.0  0.0      0     0 ?        S    Jun25   0:00 [khubd]
root       157  0.0  0.0      0     0 ?        S    Jun25   0:00 [ata/0]

So, perhaps I don't have a hardware problem, but a software problem?

Mark

On Tue, Jun 26, 2012 at 9:02 PM, Stephen <cryptworks at gmail.com> wrote:

> What about pings from the server? Also my paranoia about this would have
> me checking the arp tables to see if the ip address is getting
> mis-somethinged. Also see if uou can make a task that will wrife to a file
> once every qp sex and see if the server is falling asleep or if it is
> network related.  It may be nic or switch for example.
> On Jun 26, 2012 7:42 PM, "Mark Phillips" <mark at phillipsmarketing.biz>
> wrote:
>
>> My ping test is not what I expected but still shows a problem.
>>
>> I setup the test to ping (64 bytes, ttl =64) the problem server every 10
>> seconds from my laptop. Both are plugged in, on the same subnet. The boxes
>> are about 5 feet apart. Here are the results:
>>
>> 3434 packets transmitted, 3307 received, 3% packet loss, time 34336321ms
>> rtt min/avg/max/mdev = 0.100/0.231/4.332/0.275 ms
>>
>> I found 38 instances where the time stamps for the pings "hiccuped" and
>> there was a delay. Each hiccup lasted (min/avg/max) 10/44/70 seconds. The
>> time between hiccups was (min/avg/max) 1:20/14:29/29:00 minutes.
>>
>> I grep'd the log files for messages around these 38 incidents, but did
>> not find any messages in any of the logs. However this is not a good test,
>> so I just trolled the logs and didn't find anything significant.
>>
>> Do these numbers strike a chord with anyone?
>>
>> I will check the caps on the MB later this week.
>>
>> I looked in syslog, and could not find any correlation with the 38
>> hiccups. However, what do these two cron jobs do, since the run quite
>> frequently:
>> CMD (   cd / && run-parts --report /etc/cron.hourly)
>> CMD (  [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find
>> /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -delete)
>>
>> A search through the log files for "error" does not return anything
>> interesting.
>>
>> Since apache and few other apps were running on the server, I will run
>> the ping test again tonight after I kill everything on the server.
>>
>> Thanks!
>>
>> Mark
>>
>> On Tue, Jun 26, 2012 at 9:35 AM, Carruth, Rusty <
>> Rusty.Carruth at smartstoragesys.com> wrote:
>>
>>> Curious how your test turned out.****
>>>
>>> ** **
>>>
>>> You may also want to run an iostat to a file and see if that correlates
>>> to the slow responses.****
>>>
>>> ** **
>>>
>>> However, that ‘bulging capacitor’ thing others have mentioned sounds
>>> like a pretty convincing coincidence, as it were….****
>>>
>>> ** **
>>>
>>> (I will say that USUALLY I’d agree with JD – Iobound or low RAM (and
>>> thus iobound on swap space) are the only things I’ve seen that cause
>>> unresponsiveness (never seen an overheat slow it down, usually it just dies
>>> suddenly.  I probably get fast overheating and not slow increases in heat
>>> levels J)****
>>>
>>> ** **
>>>
>>> OH!  WAIT!  I just remembered another event – and it WON’T show up in
>>> normal performance logs.  If ‘you’ send a command to a disk drive, and it
>>> goes busy for a long time, your system can become totally locked until the
>>> timeout happens and the kernel gives up.  (If that happens, there SHOULD be
>>> a timeout recorded in the syslog or /var/log/messages.  Check there for
>>> timeouts on disk drives or hard resets or such).  (I know this because of
>>> where I work J)  (Disk drives are supposed to acknowledge the command
>>> almost immediately.  It is almost always a bad thing when the drive takes
>>> the command but does not finish the initial command handshake sequence…
>>> You might want to look at the S.M.A.R.T. attributes for your drives as well
>>> to see if any of them are showing ‘pre-fail’ conditions)****
>>>
>>> ** **
>>>
>>> Rusty****
>>>
>>> ** **
>>>
>>> *From:* plug-discuss-bounces at lists.plug.phoenix.az.us [mailto:
>>> plug-discuss-bounces at lists.plug.phoenix.az.us] *On Behalf Of *Mark
>>> Phillips
>>> *Sent:* Monday, June 25, 2012 10:13 PM
>>> *To:* Main PLUG discussion list
>>> *Subject:* Re: Strange Server Behavior****
>>>
>>> ** **
>>>
>>> Right now, the server is not doing anything but sitting there....
>>>
>>> Tasks:  98 total,   1 running,  97 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:   1033780k total,   217560k used,   816220k free,     6220k buffers
>>> Swap:  2019320k total,        0k used,  2019320k free,    94056k cached
>>>
>>> Plenty of swap, not very busy. It may be over heating, but not sure why.
>>>
>>> I am going to run a test tonight - ping every 10 seconds and time stamp
>>> the output into a file. Perhaps I will see gaps or unusually long response
>>> times and I can correlate that with the log files.
>>>
>>> Mark****
>>>
>>> On Mon, Jun 25, 2012 at 10:09 PM, JD Austin < <jd at twingeckos.com>>
>>> wrote:****
>>>
>>> I've had servers that act like that.. usually they're over heating,
>>> completely I/O bound, or swapping due to low available memory. ****
>>>
>>> ** **
>>>
>>> On Mon, Jun 25, 2012 at 10:00 PM, Mark Phillips <
>>> mark at phillipsmarketing.biz> wrote:****
>>>
>>> Nope - everything just stops - ping waits for a response, web services
>>> just wait for the server, file transfers stop and wait.......as if time
>>> just stopped for the server, then starts again without any errors being
>>> evident.
>>>
>>> Mark****
>>>
>>> ** **
>>>
>>> On Mon, Jun 25, 2012 at 9:57 PM, Stephen < > wrote:****
>>>
>>> Can you do access any other services hosted by the server during this
>>> time? Or even an extended ping?****
>>>
>>> On Jun 25, 2012 9:53 PM, "Mark Phillips" < > wrote:****
>>>
>>> I have a headless server running Linux version 2.6.32-5-686 (Debian
>>> 2.6.32-45) (dannf at debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) and
>>> no X or window manager, and I have noticed in the past couple of days that
>>> when I ssh in the server it occasionally stops responding for a minute or
>>> two, then comes back as if nothing had happened. It is a random event -
>>> maybe once an hour. I cannot find anything in the logs - no error messages.
>>> There is nothing wrong with the machine where I initiated the ssh session,
>>> and it is not connected to ssh. The server completely stops responding,
>>> then comes back as if nothing had happened.
>>>
>>> How would I go about diagnosing this problem?
>>>
>>> Thanks,
>>>
>>>
>>> ****
>>>
>>> ** **
>>>
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>>
>>
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>
>
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20120627/072997be/attachment.html>