From owner-freebsd-arch Fri Nov 15 16:21:17 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C6F2537B4C3 for ; Fri, 15 Nov 2002 16:21:14 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0E35943E4A for ; Fri, 15 Nov 2002 16:21:14 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id gAG0L8p43031 for ; Fri, 15 Nov 2002 19:21:08 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Fri, 15 Nov 2002 19:21:08 -0500 (EST) From: Jeff Roberson To: arch@freebsd.org Subject: Software Watchdog Message-ID: <20021115191632.U22491-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sean Kelly has implemented a software watchdog based on input from myself and Peter. This works through a simple watcdog daemon that checks in with the kernel every so often. The kernel complains via hardclock() if the watchdog times out. This will be very useful for debugging hard lockups because hardclock() comes in through a fast intr. There are few things that will stop hardclock() from firing. Below I have included some snipits from an email Sean sent me. Here's what I've got so far: 1. Kernel watchdog a. Three sysctls i. debug.watchdog.timeout: Number of seconds allowed to go without a reset ii. debug.watchdog.reset: Upon read or write, resets the watchdog timer iii. debug.watchdog.enabled: When >0, perform watchdog checks. b. 'options WATCHDOG' or 'options INVARIANTS' to compile with watchdog code c. watchdog(4) manpage 2. Userland support a. /usr/sbin/watchdogd i. Performs stat("/etc") test ii. Awakens periodically and resets watchdog via d.w.reset sysctl iii. Sets d.w.enabled=1 on start and d.w.enabled=0 on exit. iv. Proper signal handling. v. Writes pidfile in /var/run/watchdogd.pid b. watchdogd(8) manpage c. /etc/rc check for watchdogd_enabled="YES" d. /etc/rc.d/watchdogd rcNG script e. Addition of 'watchdogd_enabled="NO"' to /etc/defaults/rc.conf I have a short TODO list as well: * Deal with when ticks overflows (this will be pretty easy) * Do multiple instances of interrupt and backtrace outputs a few seconds apart. (This will be pretty easy) * Flesh out the watchdogd daemon to do more checks once I figure out what checks people advise it do. And by checks, I mean "test a, b, and c must not fail or I won't reset the watchdog." What I have so far is available for viewing at http://www.zombie.org/watchdog.diff I believe this functionality will be invaluable for debugging 5.0. I'd like to have this included as soon as the todo list is covered and it gets a proper review. Comments? Cheers, Jeff To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message