Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Mar 1999 23:30:44 -0600
From:      Russell Neeper <r-neeper@tamu.edu>
To:        Greg Lehey <grog@lemis.com>
Cc:        freebsd-current@FreeBSD.ORG
Subject:   Re: Help! (was: Crash while newfs'ing innocent vinum volume on fresh system.)
Message-ID:  <19990319233044.A10397@net.tamu.edu>
In-Reply-To: <19990318133123.S429@lemis.com>; from Greg Lehey on Thu, Mar 18, 1999 at 01:31:23PM %2B1030
References:  <19990317142107.A78437@matti.ee> <19990318133123.S429@lemis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I decided to give vinum a try a few days ago and ran into the same
problem as Vallo - after using it for a short period of time it caused
a kernel panic due to a page fault.

I spent some time with kgdb today and believe that I have found the bug.

On Thu, Mar 18, 1999 at 01:31:23PM +1030, Greg Lehey wrote:
> This is a problem I've seen before, but it completely baffles me.  The
> request passed to launch_requests (frame 10) has been deallocated.
> Some of the debug code I put in caught it:
> 
> (kgdb) p freeinfo[7]
> $2 = {
>   time = {
>     tv_sec = 921669613, 
>     tv_usec = 289712
>   }, 
>   seq = 24, 
>   size = 36, 
>   line = 174, 
>   address = 0xf0a3cb00 "ÞÀ­Þh\235\"ðÞÀ­ÞÀÈ£ðÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­ÞÞÀ­Þ", 
>   file = "vinuminterrupt.c"
> }
> 
> This was called from freerq, which frees the complete request.  freerq
> is called from only four places: one on completion of the request
> (which in this case is just about to be started), one if the request
> is aborted (which also sets bp->b_error, which is not set here), once
> in a read context (which is not applicable here: it's a write), and
> once just before the call to launch_requests in frame 11:

The best that I can tell, the problem is with the first call that you
listed: "on completion of the request".  The function 'complete_rqe' is
called asynchronously by an interrupt at the completion of the I/O
request.

> So where is this coming from?  I'm completely baffled.  It doesn't
> happen to most people, though I have had reports of one or two other
> cases.  About the only clue is that the problem didn't occur when I
> removed the debug memory allocator, but I don't know whether it went
> away or into hiding.  I'd really like to find out what's going on
> here.

I think that removing the debug memory allocator just made it go into
hiding because it changed the timing of the code.  Freeing the request
structure in the interrupt routine is causing a race condition in the
function 'launch_requests'.  Interrupts must be disabled around any and
all code which refers to the request chain and this wasn't being done.  I
have created a patch that seems to fix the problem.  However, there could
be other places in the code that refers to the request chain without
disabling interrupts.  After looking at it for only a few hours, I'm not
familiar enough with it to tell.  Hope this helps.

Here's the patch:

diff -u vinum/vinumrequest.c vinum-mod/vinumrequest.c
--- vinum/vinumrequest.c        Thu Mar 18 20:21:46 1999
+++ vinum-mod/vinumrequest.c    Fri Mar 19 22:55:49 1999
@@ -258,13 +258,8 @@
            biodone(bp);
            freerq(rq);
            return -1;
-       } {                                                 /* XXX */
-           int result;
-           int s = splhigh();
-           result = launch_requests(rq, reviveok);         /* now start the requests if we can */
-           splx(s);
-           return result;
        }
+       return launch_requests(rq, reviveok);       /* now start the requests if we can */
     } else
        /*
         * This is a write operation.  We write to all
@@ -366,6 +361,7 @@
     if (debug & DEBUG_LASTREQS)
        logrq(loginfo_user_bpl, rq->bp, rq->bp);
 #endif
+    s = splbio();
     for (rqg = rq->rqg; rqg != NULL; rqg = rqg->next) {            /* through the whole request chain */
        rqg->active = rqg->count;                           /* they're all active */
        rq->active++;                                       /* one more active request group */
@@ -396,13 +392,13 @@
                    logrq(loginfo_rqe, rqe, rq->bp);
 #endif
                /* fire off the request */
-               s = splbio();
                (*bdevsw[major(rqe->b.b_dev)]->d_strategy) (&rqe->b);
-               splx(s);
            }
            /* XXX Do we need caching?  Think about this more */
        }
     }
+    splx(s);
+
     return 0;
 }

I remove the splhigh/splx from around the first call of launch_requests
because, as far as I can tell, it became redundant after adding
splbio/splx around the for loop in the launch_requests function.

------
Russell Neeper                           Texas A&M University
Russell-Neeper@tamu.edu             Computing & Information Services
                                             Network Group


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990319233044.A10397>