From owner-freebsd-arch@FreeBSD.ORG  Thu Jan  3 16:28:32 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9C59F16A417;
	Thu,  3 Jan 2008 16:28:32 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 02CB113C45A;
	Thu,  3 Jan 2008 16:28:31 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8q) with ESMTP id 227143444-1834499 
	for multiple; Thu, 03 Jan 2008 11:26:36 -0500
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id m03GSLgu028899;
	Thu, 3 Jan 2008 11:28:22 -0500 (EST) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Thu, 3 Jan 2008 11:05:50 -0500
User-Agent: KMail/1.9.6
References: <18378.1196596684@critter.freebsd.dk>
	<200712271805.40972.jhb@freebsd.org> <477C1604.2030905@freebsd.org>
In-Reply-To: <477C1604.2030905@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200801031105.52354.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Thu, 03 Jan 2008 11:28:22 -0500 (EST)
X-Virus-Scanned: ClamAV 0.91.2/5348/Thu Jan 3 00:26:56 2008 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Andre Oppermann <andre@freebsd.org>, Robert Watson <rwatson@freebsd.org>
Subject: Re: New "timeout" api, to replace callout
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jan 2008 16:28:32 -0000

On Wednesday 02 January 2008 05:53:56 pm Andre Oppermann wrote:
> John Baldwin wrote:
> > On Sunday 02 December 2007 07:53:18 am Andre Oppermann wrote:
> >> Poul-Henning Kamp wrote:
> >>> In message <4752998A.9030007@freebsd.org>, Andre Oppermann writes:
> >>>>  o TCP puts the timer into an allocated structure and upon close of the
> >>>>    session it has to be deallocated including stopping of all currently
> >>>>    running timers.
> >>>>    [...]
> >>>>     -> The timer facility should provide an atomic stop/remove call
> >>>>        that prevent any further callbacks upon return.  It should not
> >>>>        do a 'drain' where the callback may be run anyway.
> >>>>        Note: We hold the lock the callback would have to obtain.
> >>> It is my intent, that the implementation behind the new API will
> >>> only ever grab the specified lock when it calls the timeout function.
> >> This is the same for the current one and pretty much a given.
> >>
> >>> When you do a timeout_disable() or timeout_cleanup() you will be
> >>> sleeping on a mutex internal to the implementation, if the timeout
> >>> is currently executing.
> >> This is the problematic part.  We can't sleep in TCP when cleaning up
> >> the timer.  We're not always called from userland but from interrupt
> >> context.  And when calling the cleanup we currently hold the lock the
> >> callout wants to obtain.  We can't drop it either as the race would
> >> be back again.  What you describe here is the equivalent of callout_
> >> drain().  This is unfortunately unworkable in TCP's context.  The
> >> callout has to go away even if it is already pending and waiting on
> >> the lock.  Maybe that can only be solved by a flag in the lock saying
> >> "give up and go away".
> > 
> > The reason you need to do a drain is to allow for safe destroying of the 
lock.  
> > Specifically, drivers tend to do this:
> > 
> > 	FOO_LOCK(sc);
> > 	...
> > 	callout_stop(...);
> > 	FOO_UNLOCK(sc);
> > 	...
> > 	callout_drain(...);
> > 	...
> > 	mtx_destroy(&sc->foo_mtx);
> > 
> > If you don't have the drain and softclock is trying to acquire the backing 
> > mutex while you have it held (before the callout_stop) then Bad Things can 
> > happen if you don't do the drain.  Having the lock just "give up" doesn't 
> > work either because if the memory containing the lock is free'd and 
> > reinitialized such that it looks enough like a valid lock then softclock 
(or 
> > its equivalent) will still try to obtain it.  Also, you need to do a drain 
so 
> > it is safe to free the callout structure to prevent it from being recycled 
> > and having weird races where it gets recycled and rescheduled but the 
timer 
> > code thinks it has a pending stop for that pointer and so it aborts the 
wrong 
> > instance of the timer, etc.
> 
> This is all well known.  ;)  What isn't known is that this (the
> sleep part) is a major problem for TCP due to being run from
> interrupt context.  Hence the request for some kind of busy-drain
> or other method prevent the sleep.  A second less severe problem
> are races while the lock is dropped during the sleep.  Here some
> other part of TCP may go into the tcpcb scheduled for destruction.

My point is that there isn't really a good way to fix this that doesn't 
involve sleeping.  If you just spin you may spin forever (netisr has a higher 
priority than softclock IIRC).  One option is to not destroy pcb's directly 
in interrupt context but instead to queue them and let a task on a taskqueue 
finish the destruction in a context where it can sleep if necessary.

-- 
John Baldwin