From owner-freebsd-questions@FreeBSD.ORG Wed Dec 29 15:47:42 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C5D8816A4CE for ; Wed, 29 Dec 2004 15:47:42 +0000 (GMT) Received: from kenmore.kozy-kabin.nl (fia148-72.dsl.hccnet.nl [62.251.72.148]) by mx1.FreeBSD.org (Postfix) with ESMTP id 864E343D2F for ; Wed, 29 Dec 2004 15:47:41 +0000 (GMT) (envelope-from colin@kenmore.kozy-kabin.nl) Received: from localhost (colin@localhost) by kenmore.kozy-kabin.nl (8.11.6p2/8.11.6) with ESMTP id iBTFlRD21614; Wed, 29 Dec 2004 16:47:31 +0100 (CET) Date: Wed, 29 Dec 2004 16:47:27 +0100 From: "Colin J. Raven" To: Rob In-Reply-To: <41D281B5.3050107@yahoo.com> Message-ID: References: <41D27378.7010103@yahoo.com> <41D281B5.3050107@yahoo.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed cc: FreeBSD Subject: Re: 5.3 in diskless cluster: irregular reboots at 14:09 hr. ?!?! X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Dec 2004 15:47:43 -0000 On Dec 29, Rob launched this into the bitstream: > Colin J. Raven wrote: >> On Dec 29, Rob launched this into the bitstream: >> >>> >>> I'm running 5.3-Stable on all PC's. >>> >>> I have a master/router with 7 diskless slaves. One of the >>> slaves shows irregular reboots, without a trace, not even >>> a shutdown message in the logs. >>> >>> Until now I have the following sudden reboots of one particular >>> slave happen: >>> Nov. 16 14:09:41 >>> Nov. 30 14:09:23 >>> Dec. 28 14:09:34 >>> >>> Each is exactly at the same time; this is rather peculiar, isn't it? >>> >>> Any idea what's going on here, or how to trace this problem? >> >> >> What *else* is happening at (or immediately before) 14:09 on this machine?? >> For example is something rather intense occurring immediately beforehand? >> I'm thinking power supply failure when it get's loaded beyond a certain >> point...so, pursuant to that is there maybe a big log grep happening >> beforehand, or some other event that stresses components, thus consuming >> more power? > > Thank you Colin. > > What would be a good command to run, to find out how stressful the > PC is right before the reboot? Is 'top' good enough? Or is there > something better? 'ps auxw' for example? That's a good question. I suspect there may be a wide spectrum of opinions on that one. My own instinct would be to pipe the output of ps -whatever-switches-you-like to a file, *then* squirt top output into the same file - appended naturally - waurgh, also just to be obsessive about it, also tail -[some number] /var/log/messages into the same file and have cron send it to you at some external address. One day prolly wouldn't show you anything, but an accumulation of data *might* help you get to grips with conditions that immediately precede the witching hour of 14:09. > Since I don't know on what date it happens a next time, I will start > a cron job each day at 14:08 to check how stressful the PC is. It will > output the result of the job to disk. Yes for sure, a daily cron job is clearly required here...but.. Opinions vary, but FWIW, I wouldn't read the job output on the local disc. If this is serious enough you may wanna read it outside of the cluster environment as said above. >> It has that funny; "I'll bet the PSU is on the way out" feeling to it, >> but actually proving that can be tedious. > > I may also swap UPS between two slaves and see if the reboots are > related to a shaky UPS. I don't want to replace the PSU yet :(. Can't hurt, but think for a quick moment; if the box PSU is going down and the UPS is also shaky, then you potentially have two problems and not one. I'd (personally) take the step-by-step methodical approach. First examine the box environment for some time until you can see what immediately precedes the apparently spontaneous reboot, then focus on external issues like the UPS. Eliminate one factor at a time, even if you have innumerable items on your own inner list of possible suspects. Keep us posted please. there have been a couple of instances of this behaviour posted to the list recently, it would be interesting (as well as instructive) to understand the proportionate number of cases in which the PSU is ultimately proven to be the cause. I'd doubt the OS itself in almost all cases. I mean, ffs it's FreeBSD. Regards & HTH, -Colin