From owner-freebsd-arch@FreeBSD.ORG  Mon Jan 21 07:22:38 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id CCA3FC44;
 Mon, 21 Jan 2013 07:22:38 +0000 (UTC) (envelope-from bright@mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
 by mx1.freebsd.org (Postfix) with ESMTP id AB4F5F15;
 Mon, 21 Jan 2013 07:22:38 +0000 (UTC)
Received: from Alfreds-MacBook-Pro-9.local
 (c-67-180-208-218.hsd1.ca.comcast.net [67.180.208.218])
 by elvis.mu.org (Postfix) with ESMTPSA id E32591A3C1D;
 Sun, 20 Jan 2013 23:22:37 -0800 (PST)
Message-ID: <50FCECBD.9090002@mu.org>
Date: Sun, 20 Jan 2013 23:22:37 -0800
From: Alfred Perlstein <bright@mu.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: Ian Lepore <ian@FreeBSD.org>
Subject: Re: RFC: enhanced watchdog.
References: <201301190604.r0J64RbW009298@svn.freebsd.org>
 <50FA3D36.4080709@mu.org> <1358743064.32417.409.camel@revolution.hippie.lan>
In-Reply-To: <1358743064.32417.409.camel@revolution.hippie.lan>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "arch@freebsd.org" <arch@FreeBSD.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Jan 2013 07:22:38 -0000

On 1/20/13 8:37 PM, Ian Lepore wrote:
> On Fri, 2013-01-18 at 22:29 -0800, Alfred Perlstein wrote:
>> We at iX are trying to enhance the watchdog and we think some of the
>> changes may benefit the community as a whole.
>>
>> Basically we want to make it easy for developers to prototype watchdog
>> scripts in a "test-only" mode that basically logs if the watchdog had
>> failed.
>>
>> I have most of the code done, but could really use help on two things:
>>
>> 1) review
>> 2) suggestion for inserting the warning messages from the userland
>> watchdogd into the kernel message buffer.
>> 3) suggestion for logging/warning of pending death.
>>
>> In detail:
>> 1) The reason for review should be obvious, we want to make sure that
>> this works for everyone.
>> 2) The reason for inserting messages into the kernel log is because that
>> is the easiest place for us to recover the diagnostics when we do have a
>> crash due to watchdog.  Maybe there is a smarter thing to do?
> I've recently wished for a way that a sufficiently-credentialed userland
> process could, in effect, kernel-printf.  I've been burned a number of
> times by init(8) failing to start up for various reasons such as
> no /dev, and it has no way to say what's wrong.  It's surprisingly hard
> to figure out what the problem is.
>
> For your need, a possiblity I guess would be to have the watchdog device
> do it for you, since you're already talking to it.  Who knows, maybe
> some special watchdog hardware would be able to do something useful with
> a short message.  I've worked with hardware that has a few registers
> designed to survive a reboot, for communicating with your reincarnated
> self; nothing big enough for arbitrary strings yet, but hardware just
> keeps getting cooler all the time.
I'm almost wondering if there's some kind of /dev/klog we should/could have?


>
>> 3) What is a good way to warn of impeding death?  I was thinking of just
>> another thread in the process that would be signalled before the
>> watchdog script was run and would log when the timer is about to expire
>> or based on a configurable threshold.
>>
> SIGALRM that fires shortly before death?
That sounds great.  I'll look into that.

>
>> Finally, there is some thought about adding a kernel daemon to the
>> watchdog facility that would allow us to strobe watchdogs with low max
>> values while our userland watchdog was polling the system.
>>
>> Why??? Well because the ICH driver has a max timeout of ~2 minutes.  We
>> really want to be able to leverage this watchdog, but also go higher
>> than this.  The way to do this is to drive the system almost like a step
>> up electrical relay.
>>
> I very much like this.  A new ARM SoC I'm about to start working with
> has a max 16 second watchdog, and I'm afraid things like firmware
> updaters might lock out userland for longer than that on such a wimpy
> chip.
>
>> [... code ...]
> I skimmed through the code, but it's been a long day of reading code for
> me, so I'm not gonna pretend it was a thorough review.  The main thing
> that popped out at me was 'carp'.  Shouldn't a watchdog bark? :)
ha!


>
> I'm also curious why you chose CLOCK_UPTIME_FAST, which I'm not familiar
> with (gonna be reading a manpage in a minute).  Not knowing about some
> of the newer choices, I probably would've used CLOCK_MONOTONIC.
I unfortunately am a generalist and my clock-fu is weak.  I can look 
into switching to that.

What would be the difference between the two in general?

-Alfred