From owner-freebsd-questions@FreeBSD.ORG  Wed Dec 29 15:47:42 2004
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C5D8816A4CE
	for <freebsd-questions@freebsd.org>;
	Wed, 29 Dec 2004 15:47:42 +0000 (GMT)
Received: from kenmore.kozy-kabin.nl (fia148-72.dsl.hccnet.nl [62.251.72.148])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 864E343D2F
	for <freebsd-questions@freebsd.org>;
	Wed, 29 Dec 2004 15:47:41 +0000 (GMT)
	(envelope-from colin@kenmore.kozy-kabin.nl)
Received: from localhost (colin@localhost)
	by kenmore.kozy-kabin.nl (8.11.6p2/8.11.6) with ESMTP id iBTFlRD21614;
	Wed, 29 Dec 2004 16:47:31 +0100 (CET)
Date: Wed, 29 Dec 2004 16:47:27 +0100
From: "Colin J. Raven" <colin@kenmore.kozy-kabin.nl>
To: Rob <spamrefuse@yahoo.com>
In-Reply-To: <41D281B5.3050107@yahoo.com>
Message-ID: <Pine.NEB.4.61.9.0412291631340.6312@kenmore.kozy-kabin.nl>
References: <41D27378.7010103@yahoo.com>
	<Pine.NEB.4.61.9.0412291012130.6312@kenmore.kozy-kabin.nl>
	<41D281B5.3050107@yahoo.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
cc: FreeBSD <freebsd-questions@freebsd.org>
Subject: Re: 5.3 in diskless cluster: irregular reboots at 14:09 hr. ?!?!
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Dec 2004 15:47:43 -0000

On Dec 29, Rob launched this into the bitstream:

> Colin J. Raven wrote:
>> On Dec 29, Rob launched this into the bitstream:
>> 
>>> 
>>> I'm running 5.3-Stable on all PC's.
>>> 
>>> I have a master/router with 7 diskless slaves. One of the
>>> slaves shows irregular reboots, without a trace, not even
>>> a shutdown message in the logs.
>>> 
>>> Until now I have the following sudden reboots of one particular
>>> slave happen:
>>>        Nov. 16 14:09:41
>>>        Nov. 30 14:09:23
>>>        Dec. 28 14:09:34
>>> 
>>> Each is exactly at the same time; this is rather peculiar, isn't it?
>>> 
>>> Any idea what's going on here, or how to trace this problem?
>> 
>> 
>> What *else* is happening at (or immediately before) 14:09 on this machine?? 
>> For example is something rather intense occurring immediately beforehand? 
>> I'm thinking power supply failure when it get's loaded beyond a certain 
>> point...so, pursuant to that is there maybe  a big log grep happening 
>> beforehand, or some other event that stresses components, thus consuming 
>> more power?
>
> Thank you Colin.
>
> What would be a good command to run, to find out how stressful the
> PC is right before the reboot? Is 'top' good enough? Or is there
> something better? 'ps auxw' for example?

That's a good question. I suspect there may be a wide spectrum of 
opinions on that one.

My own instinct would be to pipe the output of ps 
-whatever-switches-you-like to a file, *then* squirt top output into the 
same file - appended naturally - waurgh, also just to be obsessive about 
it, also tail -[some number] /var/log/messages into the same file and 
have cron send it to you at some external address. One day prolly 
wouldn't show you anything, but an accumulation of data *might* help you 
get to grips with conditions that immediately precede the witching hour 
of 14:09.

> Since I don't know on what date it happens a next time, I will start
> a cron job each day at 14:08 to check how stressful the PC is. It will
> output the result of the job to disk.

Yes for sure, a daily cron job is clearly required here...but.. Opinions 
vary, but FWIW, I wouldn't read the job output on the local disc. If 
this is serious enough you may wanna read it outside of the cluster 
environment as said above.

>> It has that funny; "I'll bet the PSU is on the way out" feeling to it,
>> but actually proving that can be tedious.
>
> I may also swap UPS between two slaves and see if the reboots are
> related to a shaky UPS. I don't want to replace the PSU yet :(.

Can't hurt, but think for a quick moment; if the box PSU is going down 
and the UPS is also shaky, then you potentially have two problems and 
not one. I'd (personally) take the step-by-step methodical approach. 
First examine the box environment for some time until you can see what 
immediately precedes the apparently spontaneous reboot, then focus on 
external issues like the UPS. Eliminate one factor at a time, even if 
you have innumerable items on your own inner list of possible suspects.

Keep us posted please. there have been a couple of instances of this 
behaviour posted to the list recently, it would be interesting (as well 
as instructive) to understand the proportionate number of cases in which 
the PSU is ultimately proven to be the cause. I'd doubt the OS itself in 
almost all cases. I mean, ffs it's FreeBSD.

Regards & HTH,
-Colin