From owner-freebsd-net@FreeBSD.ORG Mon Dec 17 18:27:02 2007 Return-Path: Delivered-To: net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 10E3E16A417 for ; Mon, 17 Dec 2007 18:27:02 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outX.internet-mail-service.net (outX.internet-mail-service.net [216.240.47.247]) by mx1.freebsd.org (Postfix) with ESMTP id F2DA113C478 for ; Mon, 17 Dec 2007 18:27:01 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Mon, 17 Dec 2007 10:26:57 -0800 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 470A3126CE5; Mon, 17 Dec 2007 10:26:57 -0800 (PST) Message-ID: <4766BF72.7000005@elischer.org> Date: Mon, 17 Dec 2007 10:26:58 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: Maxime Henrion References: <20071213133817.GC71713@elvis.mu.org> <47617AF5.7070701@elischer.org> <20071214092539.GB14339@glebius.int.ru> <4762DD82.9070904@elischer.org> <20071217101009.GL71713@elvis.mu.org> In-Reply-To: <20071217101009.GL71713@elvis.mu.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Gleb Smirnoff , net@FreeBSD.org Subject: Re: Deadlock in the routing code X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2007 18:27:02 -0000 Maxime Henrion wrote: > Julian Elischer wrote: >> Gleb Smirnoff wrote: >>> On Thu, Dec 13, 2007 at 10:33:25AM -0800, Julian Elischer wrote: >>> J> Maxime Henrion wrote: >>> J> > Replying to myself on this one, sorry about that. >>> J> > I said in my previous mail that I didn't know yet what process was >>> J> > holding the lock of the rtentry that the routed process is dealing >>> J> > with in rt_setgate(), and I just could verify that it is held by >>> J> > the swi1: net thread. >>> J> > So, in a nutshell: >>> J> > - The routed process does its business on the routing socket, that >>> ends up >>> J> > calling rt_setgate(). While in rt_setgate() it drops the lock on >>> its >>> J> > rtentry in order to call rtalloc1(). At this point, the routed >>> J> > process hold the gateway route (rtalloc1() returns it locked), and >>> it >>> J> > now tries to re-lock the original rtentry. >>> J> > - At the same time, the swi net thread calls arpresolve() which ends >>> up >>> J> > calling rt_check(). Then rt_check() locks the rtentry, and tries to >>> J> > lock the gateway route. >>> J> > A classical case of deadlock with mutexes because of different locking >>> J> > order. Now, it's not obvious to me how to fix it :-). >>> J> >>> J> On failure to re-lock, the routed call to rt_setgate should completely >>> abort J> and restart from scratch, releasing all locks it has on the way >>> out. >>> >>> Do you suggest mtx_trylock? >> I think that would be the cleanest way.. > > So, here's what I've got. I have yet to test it at all, I hope that > I'll be able to do so today, or tomorrow. Any input appreciated. > > Cheers, > Maxime > this code is I think (from memory) called only from the user right? it is possible that on failure to lock one might delay for 1 tick or something.. (I don't have the code in front of me right now) otherwise I think that might do the job.. more comments later.