Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 Mar 1997 17:15:18 -0500 (EST)
From:      Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
To:        ponds!lakes.water.net!rivers, ponds!lambert.org!terry
Cc:        ponds!freebsd.org!hackers
Subject:   Re: "dup alloc" - nope - kern/2875 wasn't it.
Message-ID:  <199703062215.RAA08636@lakes.water.net>

next in thread | raw e-mail | index | archive | help
> 
> > > I guess it would be worth while to take out the printf's until you can
> > > isolate the printf's that "fix" the problem.  Then analyze the effects of
> > > the printfs serializing writes.
> > 
> >  My thinking exactly - I've now gone back to just a pristine kernel and
> > I'm trying to find a missing splbio()/splx(), or something along those
> > lines... so far, no luck...
> 
> 
> I am, of course, unable to duplicate your panics.

 If you have a spare disk lying around; others have demonstrated with
MFS as well - so you may be able to reproduce it there.

	trash the disk (i.e. copy a large file, as large as the partition
		to the partition - or write a program that simply
		write n 0xff's...)
	newfs the disk
	fsck the disk

 If you get any fsck errors; you've run into the problem...

 But - it appears to be extremely timing dependent! (As you point out.)


> 
> I suggest you buckle down and do it the hard way; I'd help if I could
> duplicate the problem, or if my changes would not be seen as gratuitous,
> but I can't.  Without a problem fix resulting, there's no way I can
> prove that eliminating all possible race conditions is a Good Thing(tm)
> to those people who aren't getting bitten.

 Well, it is difficult to suggest to people that "oh yes, that system
that's been running fine for over a year does, in fact, have a bug
in it; you've just been lucky..."  I have a certain empathy for
that; especially when I was the only person in-the-entire-world
reporting the problem.  It's very easy to dismiss me as a nut with
bad hardware :-)   Now that other people have reported it; I'm hoping
to get more.  [I should quickly add here that I'm delighted with,
and grateful for, the response I have gotten, and I'm not complaining,
I'm just saying I could be easily seen as a "nut"...]

> 
> Here is what I suggest; effectively, you will be required to perform
> a full branch-path analysis of much of the code, by hand.  If you
> have a copy of BattleMap, you could use it some places, but since
> most kernel routines are not single-entry/single-exit, I would not
> recommend spending the $4000 or so for the software just for this
> problem, since it won't help much.

 Wow!  I was hoping not to have to do that for all (well most)
of the kernel.... 

 My approach will likely be to try and find items that appear to
cause a difference here; finding several such changes could help
triangulate on the problem...

 That is - If I change "this" the problem goes away, if I change "that"
the problem goes away; now what's common in the effects of "this" and
"that."

 Unfortunately, my changes thus far that actually affect the problem
are my printf()s to determine what the problem is; any the only
common effect is that they (presumably) alter timings in such a
way as to avoid the problem... not very useful.

 I'm trying to read through some of the code now, looking
for mis-matched splbio()/splx().  Or, something like that... I'm
just not (yet) educated enough to catch everything.  I've also noted
that some of these have been corrected in 2.2-GAMMA (i.e. vfs_subr.c
has splbio()/splx()'s in 2.2-GAMMA that it doesn't have in 2.1.6.1)
I'm guessing now that a missing one of these is the culprit....

 If someone were to detail exactly when you can futz with a 
struct buf without being splbio() it would help my reading....


	- Dave R. -






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703062215.RAA08636>