From owner-freebsd-questions Sat Oct 27 12:27:51 2001 Delivered-To: freebsd-questions@freebsd.org Received: from chmls05.mediaone.net (chmls05.mediaone.net [24.147.1.143]) by hub.freebsd.org (Postfix) with ESMTP id 9092437B408; Sat, 27 Oct 2001 12:27:31 -0700 (PDT) Received: from aasp.net (h00045ad41936.ne.mediaone.net [24.60.36.208]) by chmls05.mediaone.net (8.11.1/8.11.1) with ESMTP id f9RJRLN10883; Sat, 27 Oct 2001 15:27:21 -0400 (EDT) Message-ID: <3BDB09F8.F6E0D3A0@aasp.net> Date: Sat, 27 Oct 2001 15:24:40 -0400 From: Jules Gilbert X-Mailer: Mozilla 4.77 [en]C-CCK-MCD NSCPCD477 (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: freebsd-questions@freebsd.org Cc: pg@eth1.com, wfaxon@gis.net, david@catwhisker.org, green@freebsd.org, mckusick@mckusick.com Subject: panic: bqrelse: multiple ref .. thought fixed a year ago Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hello folks: We are having a big problem which is interfering with a whole lot of things. We are running FreeBSD 4.3 and we are seeing the infamous "bqrelse: multiple refs" problem. The panic then the dump, syncing disks.. We thought this was fixed over a year ago in vfs_bio.c This problem occured most recently when I exit'ed a remote ssh session. The exit took several seconds, and caused me to believe something was wrong. I then logged back in, and sure enough, we had a 'sh.core' dump file (of zero size) and my running jobs had died. (The machine dumped.) Later, I could not get in at all, (of course the machine was dead at that point) and the other machines doing NFS writes failed as well. NFS structure IS: This machine with this problem, call it PRIME1, NFS serves 6 other FBSD4.3 machines as clients. They ALL mount PRIME1's /mnt/public and all 6 write into this directory with their own files. So, this morning, searching the net, we found several references to "bqrelse" but none of the references seemed to assert that the fix was such-and-such. Does a fix exist? By the way, I maintain multiple FreeBSD boxes and am doing lot's of NFS activity, in addition to my occasional SSH login. I am willing to make queue's larger, change parameters or whatever else it takes to make this work. Pls help us. =================================================================== Our search results netted the following fr July 2000 Search Result 1 From: Kirk McKusick (mckusick@mckusick.com) Subject: Re: Panic: bqrelse: multiple refs Newsgroups: mailing.freebsd.current View: (This is the only article in this thread) | Original Format Date: 2000/07/26 Date: Tue, 25 Jul 2000 11:47:03 -0400 (EDT) From: Brian Fundakowski Feldman To: Ollivier Robert Cc: "FreeBSD Current Users' list" , mckusick@mckusick.com Subject: Panic: lockmgr: pid 5, not exclusive lock holder 0 unlocking In-Reply-To: <20000725170455.F636@caerdonn.eurocontrol.fr> On Tue, 25 Jul 2000, Ollivier Robert wrote: > According to Brian Fundakowski Feldman: > > Actually, I'm pretty certain this is the fix: > > Well it won't panic but isn't it putting the problem under the carpet? > I agree the panic seems to be here temporarely but... No, I'm really certain this isn't the case. You see, struct buf has a b_lock that until recently was a plain, exclusive lockmgr lock. In Kirk's last round of changes, he converted b_lock to be LK_CANRECURSE, which means that the lock, while still an exclusive lock, may be relocked multiple times by the same caller. The panics are plain wrong. What's left is to determine what is the proper thing to do in each of these cases, which I'm certain that many people already know already (you see, I'm still a bit green ;). What I am _almost_ sure about is that the right thing is just to remove one of the locks and let it get freed back up the call chain. I'm almost certain this is the case because if you are grabbing exclusive locks and recursing upon them, your call chain is the only consumer and in a recursive-locking-callchain, you will have multiple symmetric lock and unlock pairs. Anything else horribly complicates things, and this makes me a good 95% certain that this is exactly the right fix, not that it's sweeping any true bugs under the carpet. Allowing recursive locks is pretty much the only way to solve many of the problems here because it's simply not possible to support all code paths without allowing for this recursion. The code would either be horribly complicated or non-functional. I'm certain Kirk may be able to back me up here. It seems that the cleanup is meant to make the locks recursive mostly to facilitate correct/proper call chains, and that's consistent with my understand at least :) Indeed, if you look at the comment in brelse() from the delta, you will see that the intention of allowing this very situation to occur and simply BUF_UNLOCK() was planned for and the panic()s were for debugging during the previous time that b_locks weren't LK_CANRECURSE. As always, take what I say with a grain of salt since I'm definitely not a VFS guru in any manner; I just happen to think I understand this one :) > -- > Ollivier ROBERT -=- Eurocontrol EEC/ITM -=- Ollivier.Robert@eurocontrol.fr > The Postman hits! The Postman hits! You have new mail. -- Brian Fundakowski Feldman \ FreeBSD: The Power to Serve! / green@FreeBSD.org `------------------------------' The above explanation is correct. When I made the change to allow recursive buffer locks, I should have removed that panic (but forgot that I had put it in there, sigh). I have just made the change on freefall. Sorry for the problems caused by that change. Kirk McKusick To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message