From owner-freebsd-stable@FreeBSD.ORG Sat Nov 26 16:57:11 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3660E106564A; Sat, 26 Nov 2011 16:57:11 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id B805B8FC0C; Sat, 26 Nov 2011 16:57:10 +0000 (UTC) Received: from lstewart1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id C01727E880; Sun, 27 Nov 2011 03:57:08 +1100 (EST) Message-ID: <4ED11A64.90306@freebsd.org> Date: Sun, 27 Nov 2011 03:57:08 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:7.0.1) Gecko/20111016 Thunderbird/7.0.1 MIME-Version: 1.0 To: Jeremy Chadwick References: <4ECEF6FD.5050006@freebsd.org> <4ED077BF.10205@freebsd.org> <20111126075647.GA33048@icarus.home.lan> In-Reply-To: <20111126075647.GA33048@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net Cc: kerbzo@gmail.com, freebsd-stable@freebsd.org, stb@lassitu.de, raul@turing.b2n.org, Steven Hartland , Kris Bauer , george+freebsd@m5p.com, FreeBSD Release Engineering Team Subject: Re: TCP Reassembly Issues X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Nov 2011 16:57:11 -0000 Hi Jeremy, On 11/26/11 18:56, Jeremy Chadwick wrote: > On Sat, Nov 26, 2011 at 12:49:24AM -0600, Kris Bauer wrote: >> On Fri, Nov 25, 2011 at 11:23 PM, Lawrence Stewartwrote: >> >>> On 11/25/11 13:01, Lawrence Stewart wrote: [snip] >>>> Thanks Kris, Raul and Stefan for the reports, I'll look into this. >>>> >>> >>> I think I've got it - a stupid 1 line logic bug. My apologies for missing >>> it when I reviewed the patch which introduced the bug (patch was committed >>> to head as r226113, MFCed to stable/9 as r226228). >>> >>> Due to some miscommunication, the initial patch was committed to and MFCed >>> from head much later than it should have been in the 9.0 release cycle and >>> instead of being included in the BETAs, didn't make it in until 9.0-RC1 I >>> believe i.e. only RC1 and RC2 should be experiencing the issue. >>> >>> Could those who have reported the bug and are able to recompile their >>> kernel to test a patch please try the following and report back to the list: >>> >>> >>> http://people.freebsd.org/~lstewart/patches/misctcp/tcp_reass_plugzoneleak_10.x.r227986.patch >>> >>> The patch is against head r227986 but will apply and work correctly for >>> 9.0 as well. >>> >>> Cheers, >>> Lawrence >>> >> >> I have patched, recompiled, and rebooted. net.inet.tcp.reass.cursegments >> is no longer incrementing, and connectivity is holding steady. If anything >> changes over the next couple of hours, I'll be sure to report it; but all >> preliminary signs of the problem are gone. >> >> Thanks for all the help! > > Let's not be hasty in concluding everything is fixed. Why I'm a bit on > edge about this: I took the time to find the CVS commits that induced > this issue in the first place, and it seems there is some history. > > The commit that caused this problem to begin with was supposedly a fix > for a different problem: > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/tcp_reass.c#rev1.375 The original patch you reference (equivalent to svn r226113 as noted in my previous email) was indeed for a separate problem. Unfortunately the fix introduced a new problem. > A week later, that commit went from HEAD/MAIN into RELENG_9: > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/tcp_reass.c#rev1.374.2.2 > > Be sure to read the description of the problem that was being fixed in > the first place. I've also CC'd the original problem reporter, Steven > Hartland, because we're going to need him to try the above patch from > Lawrence to make sure there aren't other problems. Meaning: for all we > know, the above fix might work great for Kris but cause problems for > Steve. Even though my patch is a multi-line diff, it only effectively changes one thing - that the "te == NULL" condition must be true for both the case where "th->th_seq != tp->rcv_nxt" (current segment does not plug the hole) and where they are equal (current segment does plug the hole; a new case introduced in r226113). I can say with confidence based on the change in the logic that my patch is not a regression as far as Steven's original bug report is concerned. > This entire situation leads me to believe very few people are doing > quality testing of RELENG_9, yet we're already into 9.0-RC2. Please > don't tell me "that's exactly why you should be running RELENG_9!"; that > is completely backwards and I refuse to get into a flame war about it, > because it's this simple: 90%+ of those running FreeBSD on servers need > something that's stable, we can't risk wonkiness (especially of this > degree!) on systems taking production traffic. Did no one actually test > the change *thoroughly*? Imagine had this lay dormant until 9.0-RELEASE. The latter half is fair criticism, more comments below. The fact we're having this discussion now prior to 9.0 being released somewhat negates the assertion in the former part of your paragraph. > Lawrence: please don't take my comments personally or to mean "you broke > it and caused this mess!" It's meant to read more along the lines of All good, not taken personally. > "you committed a fix for something that broke other bits badly, but > nobody noticed this, including the original reporter of a different > problem? How/why?" You get the idea. Your concerns are valid. To clarify, I did not propose or commit the patch which introduced this bug (r226113). Generally speaking, it is a committer's responsibility to ensure that a patch which they commit has been sufficiently tested prior to commit. Normally the committer will solicit testing from the original problem reporter and do some testing themselves. I believe Steven tested Andre's patch and reported to the mailing list that it resolved his immediate problem. I was not privy to any other testing conducted by Andre, so can't comment further on that. As to how this could have been missed: TCP is impressively robust, capable of working even when it has both arms tied behind its back and is missing a leg. It may not work well, but will limp along all the same. People tend to notice and report scenarios where something is definitively broken far more readily than the more complicated case where it degrades to an unreasonable but still working state after a non-deterministic amount of time. Even if basic testing had been performed, it would be fairly easy to miss the effects of this bug, which only manifest after a relatively long time, except for those on the receiving end of large bandwidth-delay product, non-zero-loss TCP connections which would exacerbate the issue quicker. My involvement in all this was that I introduced the original bug r226113 was designed to solve, and reviewed a pre-cursor to, but did not test the r226113 patch. Hope I've provided some useful insight. Cheers, Lawrence