From owner-freebsd-current@FreeBSD.ORG Mon Dec 27 01:37:50 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D832116A4CE for ; Mon, 27 Dec 2004 01:37:50 +0000 (GMT) Received: from stephanie.unixdaemons.com (stephanie.unixdaemons.com [67.18.111.194]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6CA6C43D5C for ; Mon, 27 Dec 2004 01:37:50 +0000 (GMT) (envelope-from bmilekic@technokratis.com) Received: from stephanie.unixdaemons.com (bmilekic@localhost.unixdaemons.com [127.0.0.1])iBR1bj3T014459; Sun, 26 Dec 2004 20:37:45 -0500 (EST) Received: (from bmilekic@localhost) by stephanie.unixdaemons.com (8.13.2/8.12.1/Submit) id iBR1bjMa014457; Sun, 26 Dec 2004 20:37:45 -0500 (EST) (envelope-from bmilekic@technokratis.com) X-Authentication-Warning: stephanie.unixdaemons.com: bmilekic set sender to bmilekic@technokratis.com using -f Date: Sun, 26 Dec 2004 20:37:45 -0500 From: Bosko Milekic To: Peter Holm Message-ID: <20041227013745.GA5267@technokratis.com> References: <20041209144233.GA46928@peter.osted.lan> <20041220234103.GA59225@technokratis.com> <20041222210553.GA28108@peter.osted.lan> <20041222221540.GA70052@technokratis.com> <20041226161153.GA74592@peter.osted.lan> <20041226181738.GA21533@technokratis.com> <20041226225651.GA87178@peter.osted.lan> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20041226225651.GA87178@peter.osted.lan> User-Agent: Mutt/1.4.2.1i cc: current@freebsd.org Subject: Re: panic: uma_zone_slab is looping X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Dec 2004 01:37:51 -0000 On Sun, Dec 26, 2004 at 11:56:51PM +0100, Peter Holm wrote: > On Sun, Dec 26, 2004 at 01:17:38PM -0500, Bosko Milekic wrote: > > On Sun, Dec 26, 2004 at 05:11:53PM +0100, Peter Holm wrote: > > > > > > Yes, I think that I have verified your exelent analysis of the > > > problem: http://www.holm.cc/stress/log/freeze04.html > > > > > > So, do have any fix suggenstons? :-) > > > > Not yet, because the problem is non-obvious from the trace. > > > > I need to know exactly when the UMA RCntSlabs zone recurses _first_, > > and I need to confirm that it is an actual recursion. I've looked at > > the VM code and I don't see how/why recursion on the RCntSlabs zone > > would happen. > > > > Please modify the printf code to look exactly like this: > > > > if (keg->uk_flags & UMA_ZFLAG_INTERNAL && keg->uk_recurse != 0) { > > if ((zone == slabzone) || (zone == slabrefzone)) > > panic("Zone %s forced to fail due to recurse non-null: %d\n", > > zone->uz_name, keg->uk_recurse); > > return (NULL); > > } > > > > (You don't need to check any global counter -- the counter is imperfect > > anyway -- because even a single recursion on slabzone or slabrefzone > > should be illegal). > > > > I'd like to see the trace from the above panic, if possible. > > Here it is: http://www.holm.cc/stress/log/freeze05.html I have checked the code here and looked at possible code paths and have unfortunately resorted to reguessing, and now I believe I have identified a problematic scenario. Consider this particular timeline (time moves downward): [I hope you can handle ASCII art] By the way, the stack trace you show would correspond to that of thread 2. I refer to a frame number below. thread 1 (t1) thread 2 (t2) ------------------------------------------------------------------------- t1.a) Allocating from a zone, needs slab header from one of the slab header zones (either slabzone or slabrefzone). Let's assume it is slabzone, as in your trace above. The allocation is performed with M_WAITOK. t2.a) Needs to allocate from a zone, and it needs a slab header too. The allocation will be performed with M_WAITOK. Let's assume that the slab header zone we're allocating is also slabzone. t1.b) in uma_zone_slab(), has slabzone's keg lock, increments keg's uk_recurse. Enters slab_zalloc(). t2.b) Blocks on zone lock. t1.c) Drops zone lock to allocate from VM, uk_recurse for the slabzone is currently 1 (we incremented it in t1.b). t2.c) Takes zone lock for slabzone, now in uma_zone_slab() (Frame 11), and since uk_recurse is 1, it decides recursion happened. Wants to return NULL even though allocation was done with M_WAITOK. Our panic is triggered. I'll have to reserve some more time to think about this. One way I think it might be solvable would be to change that check that triggers the NULL return explicitly check for the bucketzone, and not for all UMA_ZONE_INTERNAL zones; I need to think this through a little more. Does the scenario seem likely to you? Cheers, -- Bosko Milekic bmilekic@technokratis.com bmilekic@FreeBSD.org