Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jun 2016 19:45:53 +0200
From:      Julien Charbon <jch@freebsd.org>
To:        Gleb Smirnoff <glebius@FreeBSD.org>
Cc:        rrs@FreeBSD.org, hselasky@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org
Subject:   Re: panic with tcp timers
Message-ID:  <74bb31b7-a9f5-3d0c-eea0-681872e6f09b@freebsd.org>
In-Reply-To: <20160620095822.GJ1076@FreeBSD.org>
References:  <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> <20160620095822.GJ1076@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

 Hi,

On 6/20/16 11:58 AM, Gleb Smirnoff wrote:
> On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote:
> J> > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
> J> > J> > Comparing stable/10 and head, I see two changes that could
> J> > J> > affect that:
> J> > J> > 
> J> > J> > - callout_async_drain
> J> > J> > - switch to READ lock for inp info in tcp timers
> J> > J> > 
> J> > J> > That's why you are in To, Julien and Hans :)
> J> > J> > 
> J> > J> > We continue investigating, and I will keep you updated.
> J> > J> > However, any help is welcome. I can share cores.
> J> > 
> J> > Now, spending some time with cores and adding a bunch of
> J> > extra CTRs, I have a sequence of events that lead to the
> J> > panic. In short, the bug is in the callout system. It seems
> J> > to be not relevant to the callout_async_drain, at least for
> J> > now. The transition to READ lock unmasked the problem, that's
> J> > why NetflixBSD 10 doesn't panic.
> J> > 
> J> > The panic requires heavy contention on the TCP info lock.
> J> > 
> J> > [CPU 1] the callout fires, tcp_timer_keep entered
> J> > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
> J> > [CPU 2] schedules the callout
> J> > [CPU 2] tcp_discardcb called
> J> > [CPU 2] callout successfully canceled
> J> > [CPU 2] tcpcb freed
> J> > [CPU 1] unblocks... panic
> J> > 
> J> > When the lock was WLOCK, all contenders were resumed in a
> J> > sequence they came to the lock. Now, that they are readers,
> J> > once the lock is released, readers are resumed in a "random"
> J> > order, and this allows tcp_discardcb to go before the old
> J> > running callout, and this unmasks the panic.
> J> 
> J>  Highly interesting.  I should be able to reproduce that (will be useful
> J> for testing the corresponding fix).
> J> 
> J>  Fix proposal:  If callout_async_drain() returns 0 (fail) (instead of 1
> J> (success) here) when the callout cancellation is a success _but_ the
> J> callout is current running, that should fix it.
> J> 
> J>  For the history:  It comes back to my old callout question:
> J> 
> J>  Does _callout_stop_safe() is allowed to return 1 (success) even if the
> J> callout is still currently running;  a.k.a. it is not because you
> J> successfully cancelled a callout that the callout is not currently running.
> J> 
> J>  We did propose a patch to make _callout_stop_safe() returns 0 (fail)
> J> when the callout is currently running:
> J> 
> J> callout_stop() should return 0 when the callout is currently being
> J> serviced and indeed unstoppable
> J> https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most
> J> 
> J>  But this change impacted too many old code paths and was interesting
> J> only for TCP timers and thus was abandoned.
> 
> The fix I am working on now is doing exactly that. callout_reset must
> return 0 if the callout is currently running.
>
> What are the old paths impacted?

 Actually all the paths that check the callout_stop() return value AND
call both callout_reset() and callout_stop() AND use mpsafe callout().
And for each path, we would have to check our patch was ok (or not).

 Thus, if you only do the change in callout_async_drain() context, you
don't impact the "old" callout_stop() behavior and get the desired
behavior for the TCP timers.  Might be a good trade-off...

 My 2 cents.

--
Julien



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?74bb31b7-a9f5-3d0c-eea0-681872e6f09b>