Date: Mon, 20 Jun 2016 11:55:55 +0200 From: Julien Charbon <jch@freebsd.org> To: Gleb Smirnoff <glebius@FreeBSD.org>, rrs@FreeBSD.org Cc: hselasky@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org Subject: Re: panic with tcp timers Message-ID: <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> In-Reply-To: <20160620073917.GI1076@FreeBSD.org> References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe Content-Type: multipart/mixed; boundary="oQsT0aQmt8QcIiFDCxI7qFMBav9RqDwaE" From: Julien Charbon <jch@freebsd.org> To: Gleb Smirnoff <glebius@FreeBSD.org>, rrs@FreeBSD.org Cc: hselasky@FreeBSD.org, net@FreeBSD.org, current@FreeBSD.org Message-ID: <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> Subject: Re: panic with tcp timers References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> In-Reply-To: <20160620073917.GI1076@FreeBSD.org> --oQsT0aQmt8QcIiFDCxI7qFMBav9RqDwaE Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi, On 6/20/16 9:39 AM, Gleb Smirnoff wrote: > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: > J> > Comparing stable/10 and head, I see two changes that could > J> > affect that: > J> >=20 > J> > - callout_async_drain > J> > - switch to READ lock for inp info in tcp timers > J> >=20 > J> > That's why you are in To, Julien and Hans :) > J> >=20 > J> > We continue investigating, and I will keep you updated. > J> > However, any help is welcome. I can share cores. >=20 > Now, spending some time with cores and adding a bunch of > extra CTRs, I have a sequence of events that lead to the > panic. In short, the bug is in the callout system. It seems > to be not relevant to the callout_async_drain, at least for > now. The transition to READ lock unmasked the problem, that's > why NetflixBSD 10 doesn't panic. >=20 > The panic requires heavy contention on the TCP info lock. >=20 > [CPU 1] the callout fires, tcp_timer_keep entered > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); > [CPU 2] schedules the callout > [CPU 2] tcp_discardcb called > [CPU 2] callout successfully canceled > [CPU 2] tcpcb freed > [CPU 1] unblocks... panic >=20 > When the lock was WLOCK, all contenders were resumed in a > sequence they came to the lock. Now, that they are readers, > once the lock is released, readers are resumed in a "random" > order, and this allows tcp_discardcb to go before the old > running callout, and this unmasks the panic. Highly interesting. I should be able to reproduce that (will be useful for testing the corresponding fix). Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 (success) here) when the callout cancellation is a success _but_ the callout is current running, that should fix it. For the history: It comes back to my old callout question: Does _callout_stop_safe() is allowed to return 1 (success) even if the callout is still currently running; a.k.a. it is not because you successfully cancelled a callout that the callout is not currently runnin= g. We did propose a patch to make _callout_stop_safe() returns 0 (fail) when the callout is currently running: callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable https://reviews.freebsd.org/differential/changeset/?ref=3D62513&whitespac= e=3Dignore-most But this change impacted too many old code paths and was interesting only for TCP timers and thus was abandoned. My 2 cents. -- Julien --oQsT0aQmt8QcIiFDCxI7qFMBav9RqDwaE-- --WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQEcBAEBCgAGBQJXZ72vAAoJEKVlQ5Je6dhxZ3sH/2eFfPP334XgUPWLnMPJ1CeQ gGAz8hshDh9Rmrt7tR+XoG0q8fRanTLP75cOODIYiU51bFYys+0NymLTrsDtjUbF fqRp4cjRznhMEoTiUoCLCIfeIJaer3X5FQDyf1md2Mn+CbtiWswXGr0kH1mnCBwq FBLPwCLF2MEZrXdZImhWCCF+i9KJYXL7gOsu/gCg/5x+JnOK5/Rq4SY6SXvqkBYB p9NKU4E4brZYXatLG4EGaHM4nG16gtw6ZrXmJKfiYMm2en9otRwhbfHfC7xpJG2n ONIMU32WJ095xcOFs+ywUkJ8DFWa0+01AoTy/+OHmIqacrJYMb2hy7mh7O7ylSs= =W+6o -----END PGP SIGNATURE----- --WRcuaKjPkOnTUAdRFNif3MWmTfBRkeXRe--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1d18d0e2-3e42-cb26-928c-2989d0751884>