From owner-freebsd-arch@FreeBSD.ORG Mon Jan 21 07:22:38 2013 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CCA3FC44; Mon, 21 Jan 2013 07:22:38 +0000 (UTC) (envelope-from bright@mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id AB4F5F15; Mon, 21 Jan 2013 07:22:38 +0000 (UTC) Received: from Alfreds-MacBook-Pro-9.local (c-67-180-208-218.hsd1.ca.comcast.net [67.180.208.218]) by elvis.mu.org (Postfix) with ESMTPSA id E32591A3C1D; Sun, 20 Jan 2013 23:22:37 -0800 (PST) Message-ID: <50FCECBD.9090002@mu.org> Date: Sun, 20 Jan 2013 23:22:37 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Ian Lepore Subject: Re: RFC: enhanced watchdog. References: <201301190604.r0J64RbW009298@svn.freebsd.org> <50FA3D36.4080709@mu.org> <1358743064.32417.409.camel@revolution.hippie.lan> In-Reply-To: <1358743064.32417.409.camel@revolution.hippie.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 07:22:38 -0000 On 1/20/13 8:37 PM, Ian Lepore wrote: > On Fri, 2013-01-18 at 22:29 -0800, Alfred Perlstein wrote: >> We at iX are trying to enhance the watchdog and we think some of the >> changes may benefit the community as a whole. >> >> Basically we want to make it easy for developers to prototype watchdog >> scripts in a "test-only" mode that basically logs if the watchdog had >> failed. >> >> I have most of the code done, but could really use help on two things: >> >> 1) review >> 2) suggestion for inserting the warning messages from the userland >> watchdogd into the kernel message buffer. >> 3) suggestion for logging/warning of pending death. >> >> In detail: >> 1) The reason for review should be obvious, we want to make sure that >> this works for everyone. >> 2) The reason for inserting messages into the kernel log is because that >> is the easiest place for us to recover the diagnostics when we do have a >> crash due to watchdog. Maybe there is a smarter thing to do? > I've recently wished for a way that a sufficiently-credentialed userland > process could, in effect, kernel-printf. I've been burned a number of > times by init(8) failing to start up for various reasons such as > no /dev, and it has no way to say what's wrong. It's surprisingly hard > to figure out what the problem is. > > For your need, a possiblity I guess would be to have the watchdog device > do it for you, since you're already talking to it. Who knows, maybe > some special watchdog hardware would be able to do something useful with > a short message. I've worked with hardware that has a few registers > designed to survive a reboot, for communicating with your reincarnated > self; nothing big enough for arbitrary strings yet, but hardware just > keeps getting cooler all the time. I'm almost wondering if there's some kind of /dev/klog we should/could have? > >> 3) What is a good way to warn of impeding death? I was thinking of just >> another thread in the process that would be signalled before the >> watchdog script was run and would log when the timer is about to expire >> or based on a configurable threshold. >> > SIGALRM that fires shortly before death? That sounds great. I'll look into that. > >> Finally, there is some thought about adding a kernel daemon to the >> watchdog facility that would allow us to strobe watchdogs with low max >> values while our userland watchdog was polling the system. >> >> Why??? Well because the ICH driver has a max timeout of ~2 minutes. We >> really want to be able to leverage this watchdog, but also go higher >> than this. The way to do this is to drive the system almost like a step >> up electrical relay. >> > I very much like this. A new ARM SoC I'm about to start working with > has a max 16 second watchdog, and I'm afraid things like firmware > updaters might lock out userland for longer than that on such a wimpy > chip. > >> [... code ...] > I skimmed through the code, but it's been a long day of reading code for > me, so I'm not gonna pretend it was a thorough review. The main thing > that popped out at me was 'carp'. Shouldn't a watchdog bark? :) ha! > > I'm also curious why you chose CLOCK_UPTIME_FAST, which I'm not familiar > with (gonna be reading a manpage in a minute). Not knowing about some > of the newer choices, I probably would've used CLOCK_MONOTONIC. I unfortunately am a generalist and my clock-fu is weak. I can look into switching to that. What would be the difference between the two in general? -Alfred