Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 16 Jan 1999 07:33:03 +0100
From:      Eivind Eklund <eivind@FreeBSD.ORG>
To:        Archie Cobbs <archie@whistle.com>
Cc:        freebsd-current@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Automated debug sanity checkers
Message-ID:  <19990116073302.B6405@bitbox.follo.net>
In-Reply-To: <199901160512.VAA07999@bubba.whistle.com>; from Archie Cobbs on Fri, Jan 15, 1999 at 09:12:07PM -0800
References:  <199901160512.VAA07999@bubba.whistle.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jan 15, 1999 at 09:12:07PM -0800, Archie Cobbs wrote:
> I was thinking about the DIAGNOSTICS replacement macros and
> had a random thought...
> 
> Suppose you're sitting in front of a ddb (or better yet gdb) prompt
> because your kernel has just crashed due to who knows what reason.
> What do you do to debug this? You start looking at variables,
> memory, etc for anything funny going on.
> 
> For example, several times we've spent hours going through a crash
> dump to find, for example, that a process was on two queues, or
> some mbuf was mangled, etc.
> 
> The thought is that it would be really easy to help automate this
> process, by doing the following:
> 
>  1. Define a new kernel option INCLUDE_SANITY_CHECKS (or whatever)

INVARIANT_SUPPORT.

Hey, I just happen to remember that somebody added this a couple of
days ago - hmm, could it have been me?  :-)

>  2. When this is defined, all the various FreeBSD kernel
>     submodules (VM, networking, device drivers, etc) would
>     include a function that exhaustively runs sanity checks --
>     ie, validations that all the assumptions in the code are true --
>     for that particular submodule. This means checking all queues,
>     flags, whatever.

Ie, invariants.

>  4. The function is linked into a linker set SANITY_SET(...) or whatever

I've not thought of that - that may be a good idea.

> Then by simply calling this function from the debugger you can
> much more quickly narrow down on the problem (and hopefully fix
> it before you get tired and go to sleep :-)
> 
> Moreover, since the function is running post-mortem, it can do
> very detailed checks that would otherwise take way too long.
> E.g., check every mbuf, every queue entry, check the filesystem,
> etc. Basically a "fsck" for the kernel memory.

You do not only want to call this at post-mortem.  You often want to
selectively use this while the kernel is running.

Example: At one point (a year and half or so ago), I was debugging the
tty driver in bisdn.  For some reason, it was crashing in various ways
at various times, with no sane reason - just garbage data.  I spent
quite a bit of time looking at this, finding no reason for the faults
- they "just happened", taking on average perhaps 4 hours hours under
load to trigger.

As I was getting more and more frustrated with attempting to shotgun
debug this, I went back to my normal mode of development - I wrote
invariants for all data structures in the vicinity.  When I added an
invariant for the clist structures (and check of it all over the
place), I found that my "crash" (now an invariant incorrect panic)
time went down to two minutes - and that it was always the same way,
with the same stack backtrace, instead of crashing at various random
points.

The reason for the bug turned out to be that both I and the
implementor of the driver had missed the change of spls from levels in
BSD4.4 to masks in FreeBSD.  After I had seen the invariant failure, I
could see that something was being interrupted between two spls - and
after 3 minutes of reading the FreeBSD manpage and three lines of
changes I had something that worked.

That driver had been non-functional for at least three releases of
bisdn (and the userland code to handle it was not even there, which I
expect was due to this).  I further expect that somebody had tried
pretty hard to debug it, as they had spent the time to actually write
it.  The fact that I (which at that point had little experience with
the FreeBSD kernel) was able able to debug that in a couple of hours
where others had used more time and failed before me show some of the
power of invariants for finding obscure bugs.

I would like to have invariants available for all significant data
structures, and I'm planning to write them up as I get time for it.

> Is this something that people would be motivated enough to make
> as "official" FreeBSD kernel good housekeeping policy?

I suspect a large number of us will use it, making it likely it will
sort of maintain itself.

Eivind.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990116073302.B6405>