Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Aug 1998 03:19:16 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        dyson@iquest.net
Cc:        tlambert@primenet.com, sos@FreeBSD.ORG, croot@btp1da.phy.uni-bayreuth.de, regnauld@deepo.prosa.dk, current@FreeBSD.ORG, smp@FreeBSD.ORG
Subject:   Re: softupdates and smp crash
Message-ID:  <199808230319.UAA21616@usr04.primenet.com>
In-Reply-To: <199808220523.AAA19739@dyson.iquest.net> from "John S. Dyson" at Aug 22, 98 00:23:34 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > I'm also not convinced that this is the only possible cause of
> > the problem; the VM code is hardly "assert" protected everywhere,
> > so diagnosing this thing is not trivial.  Look at the VM fixes
> > I recently did, which killed the bugs Karl Denniger was seeing
> > in 75% of the cases, leaving 25% of the cases "clustered" (in his
> > words), indicating a seperate problem, in addition to the ones I
> > fixed, in a periodically executing code path.  I had suspected
> > that this would be the case when I made the fix, since it doesn't
> > account for the buggy behaviour I'm personally seeing.  8-(.
> > 
> I have to chime in here -- some of the "fixes" are work-arounds, and
> there are still underlying VM problems.  It might be "good enough"
> for 3.0, but I would suggest preparing for some rework to find the
> root cause for the problem.

Can you identify which of the "better fixes" are workarounds?

The two fixes I have done, and now have enough confidence in to
want them committed, are:


o	The "valid = 0 at wrong time" that you told me about.

o	The "setting the recorded size of a backing object to
	a page boundary instead of to the actual size".

You could argue that this second, which promiscuously sets the
vnode object size after instancing the object, is a workaround
which should be repaired by adding a "real_size" parameter to the
allocator, but the fact is that the setsize code path is not a
problem at the only time when it is called (ie: it can't be called
at interrupt level as a result of a disk I/O completion interrupt);
so the window I noted has been analyzed, and is not there.  The code
is ugly, but it does the intended job, without side effects.


The other "fix", the "back up one" is, indeed, a kludge that happens
to work for some cases, but I would not want that one committed (I
explicitly posted that it should be tried as a dianostic).

The only other changes packaged with the two real changes, above, are
panics in the diagnostic case, which is basically an "assert" that
map contents aren't being stomped on page insertion, and a lock
acquisition logging that was arguably erroneously missing.  I haven't
been able to get anyone to run with the "DIAGNOSTIC" flag to test
the first nor the "MAP_LOCK_DIAGNOSTIC" for the second (but they
run without error here, where I can't trigger the failures at will).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199808230319.UAA21616>