FreeBSD Mail Archives

Date:      Wed, 10 Dec 1997 03:54:50 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        henrich@crh.cl.msu.edu (Charles Henrich)
Cc:        eivind@yes.no, perhaps@yes.no, freebsd-current@freebsd.org
Subject:   Re: VM system info
Message-ID:  <199712100354.UAA07489@usr06.primenet.com>
In-Reply-To: <19971209145549.60899@crh.cl.msu.edu> from "Charles Henrich" at Dec 9, 97 02:55:49 pm

> > There are four ways to cope: (1) Ignore error; return OK, even though the
> > function failed to do it's job.  (2) Return error code (3) Throw an
> > exception of some sort, e.g. longjmp().  (4) panic(), a la assert().
> 
> That depends greatly on the situation.  There is also a (5) that says
> take all given known information and continue onward, while logging
> the error.  In some cases its obviously not possible where a routine
> is designed to have no return value.

[ ... ]

> Im not arguing the trap all errors as soon as possible piece, im arguing in
> what you do when you detect one.  To shutdown the machine is the worst
> solution.

I've recently identified (but not isolated a bug in the FreeBSD network
code that can apparently spam the kernel stack of anyprocess currently
in the kernel.

I have yet to track this down because all I can see is the side effect,
not the effect that results in the spamming.

Another engineer has identified the most probably place that the spam
occurred, simply because there's no place else that even looks vaguely
like it could result in what I'm seeing:

	o	In select(), selscan() got a page not present
		error when accessing obits[ 0].  This is not an
		error I can "ignore and log".  The select() was
		initiated by syslogd for input on its TCP (fd=3)
		and UDP (fd=4) ports.

	o	Apparently, something is spamming the contents
		of the kernel stack.  You can see this by going
		into kdb and examining the *ibits[3], *obits[3];
		atv values and noting something that looks like
		a sockaddr with the following attributes:

		o	A sa_len of 0x20
		o	A sa_family of 0xff
		o	The MAC addr of a remote machine
		o	The MAC addr of the local machine
		o	A protocol value of 0800 (TCP)

	o	There is (apparently) only one place in the kernel
		(a dereference of *eh members, where eh is an mdata(m...)
		of an mbuf) where this data could have originated.

The only fruitful approach is to check for a *eh < 0xf0000000.

With an assert with a panic to stop the processor earlier in the
problem.


This particular problem could result in random "non-fatal" corruption
of data in *your* kernel.  It's probably responsible for many "impossible"
situation type crashes (hint: random kernel stack stomping of a victim
processes stack is not a good thing).

If you can think of a way *other* than an assert to find this problem,
I'm open to suggestions.


> Lets think for a moment about the case if your the computer system
> on a F-15 fighter jet, the last thing the pilot wants to see is
> "Panic, system halted" as he spirals to his death instead of the
> software attempting to cope as best as possible.

Probably he would be less happy with "missle launched" as he's landing
on a friendly aircraft carrier because of some cascade failure.

BTW, to handle: You do a fast reset from ROM and hope the error
doesn't occur again.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199712100354.UAA07489>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation