Date: Sun, 20 Jan 2013 23:22:37 -0800 From: Alfred Perlstein <bright@mu.org> To: Ian Lepore <ian@FreeBSD.org> Cc: "arch@freebsd.org" <arch@FreeBSD.org> Subject: Re: RFC: enhanced watchdog. Message-ID: <50FCECBD.9090002@mu.org> In-Reply-To: <1358743064.32417.409.camel@revolution.hippie.lan> References: <201301190604.r0J64RbW009298@svn.freebsd.org> <50FA3D36.4080709@mu.org> <1358743064.32417.409.camel@revolution.hippie.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On 1/20/13 8:37 PM, Ian Lepore wrote: > On Fri, 2013-01-18 at 22:29 -0800, Alfred Perlstein wrote: >> We at iX are trying to enhance the watchdog and we think some of the >> changes may benefit the community as a whole. >> >> Basically we want to make it easy for developers to prototype watchdog >> scripts in a "test-only" mode that basically logs if the watchdog had >> failed. >> >> I have most of the code done, but could really use help on two things: >> >> 1) review >> 2) suggestion for inserting the warning messages from the userland >> watchdogd into the kernel message buffer. >> 3) suggestion for logging/warning of pending death. >> >> In detail: >> 1) The reason for review should be obvious, we want to make sure that >> this works for everyone. >> 2) The reason for inserting messages into the kernel log is because that >> is the easiest place for us to recover the diagnostics when we do have a >> crash due to watchdog. Maybe there is a smarter thing to do? > I've recently wished for a way that a sufficiently-credentialed userland > process could, in effect, kernel-printf. I've been burned a number of > times by init(8) failing to start up for various reasons such as > no /dev, and it has no way to say what's wrong. It's surprisingly hard > to figure out what the problem is. > > For your need, a possiblity I guess would be to have the watchdog device > do it for you, since you're already talking to it. Who knows, maybe > some special watchdog hardware would be able to do something useful with > a short message. I've worked with hardware that has a few registers > designed to survive a reboot, for communicating with your reincarnated > self; nothing big enough for arbitrary strings yet, but hardware just > keeps getting cooler all the time. I'm almost wondering if there's some kind of /dev/klog we should/could have? > >> 3) What is a good way to warn of impeding death? I was thinking of just >> another thread in the process that would be signalled before the >> watchdog script was run and would log when the timer is about to expire >> or based on a configurable threshold. >> > SIGALRM that fires shortly before death? That sounds great. I'll look into that. > >> Finally, there is some thought about adding a kernel daemon to the >> watchdog facility that would allow us to strobe watchdogs with low max >> values while our userland watchdog was polling the system. >> >> Why??? Well because the ICH driver has a max timeout of ~2 minutes. We >> really want to be able to leverage this watchdog, but also go higher >> than this. The way to do this is to drive the system almost like a step >> up electrical relay. >> > I very much like this. A new ARM SoC I'm about to start working with > has a max 16 second watchdog, and I'm afraid things like firmware > updaters might lock out userland for longer than that on such a wimpy > chip. > >> [... code ...] > I skimmed through the code, but it's been a long day of reading code for > me, so I'm not gonna pretend it was a thorough review. The main thing > that popped out at me was 'carp'. Shouldn't a watchdog bark? :) ha! > > I'm also curious why you chose CLOCK_UPTIME_FAST, which I'm not familiar > with (gonna be reading a manpage in a minute). Not knowing about some > of the newer choices, I probably would've used CLOCK_MONOTONIC. I unfortunately am a generalist and my clock-fu is weak. I can look into switching to that. What would be the difference between the two in general? -Alfred
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50FCECBD.9090002>