From owner-freebsd-net@FreeBSD.ORG Thu Jan 24 13:28:36 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D931815C; Thu, 24 Jan 2013 13:28:36 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 6009E832; Thu, 24 Jan 2013 13:28:36 +0000 (UTC) Received: from lstewart1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id DD3BA7E88D; Fri, 25 Jan 2013 00:28:34 +1100 (EST) Message-ID: <51013702.8040707@freebsd.org> Date: Fri, 25 Jan 2013 00:28:34 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120613 Thunderbird/13.0 MIME-Version: 1.0 To: John Baldwin Subject: Re: Some questions about the new TCP congestion control code References: <201301141604.29864.jhb@freebsd.org> <50F5137F.1060207@freebsd.org> <201301151427.50932.jhb@freebsd.org> In-Reply-To: <201301151427.50932.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 13:28:36 -0000 On 01/16/13 06:27, John Baldwin wrote: > On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote: >> Hi John, >> >> On 01/15/13 08:04, John Baldwin wrote: >>> I was looking at TCP congestion control at work recently and noticed a few >> >> Poor you ;) >> >>> "odd" things in the current code. First, there is this chunk of code in >>> cc_ack_received() in tcp_input.c: >>> >>> static void inline >>> cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type) >>> { >>> INP_WLOCK_ASSERT(tp->t_inpcb); >>> >>> tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th); >>> if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd)) >>> tp->ccv->flags |= CCF_CWND_LIMITED; >>> else >>> tp->ccv->flags &= ~CCF_CWND_LIMITED; >>> >>> >>> Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not >>> integers, so the call to min() results in truncation on 64-bit hosts. >> >> Good catch, but I don't think it matters in practice as neither snd_cwnd >> or snd_wnd will grow past the 32-bit boundary. > > I have a psyhcotic case using cc_cubic where it seems to grow without bound, > though that is a bug in and of itself (and this change did not fix that > issue). I ended up not using cc_cubic (more below) and haven't been able > to track down the root cause of the delay. I can probably provide a test case > to reproduce this if you are interested. hmm I'd certainly be interested in hearing more about this issue with cubic. If you think a test case is easy to come up with, please shoot it through to me when you have the chance. >>> It should probably be ulmin() instead. However, this line seems to be a really >>> obfuscated way to just write: >>> >>> if (tp->snd_cwnd <= tp->snd_wnd) >> >> You are correct, though I'd argue the meaning of the existing code as >> written is clearer compared to your suggested change. >> >>> If that is correct, I would vote for changing this to use the much simpler >>> logic. >> >> Agreed. While I find the existing code slightly clearer in meaning, it's >> not significant enough to warrant keeping it as is when your suggested >> change is simpler, fixes a bug and achieves the same thing. Happy for >> you to change it or I can do it if you prefer. > > I'll leave that to you, thanks. Committed as r245783. >>> Secondly, in the particular case I was investigating at work (restart of an >>> idle connnection), the newreno congestion control code in 8.x and later uses a >>> different algorithm than in 7. Specifically, in 7 TCP would reuse the same >>> logic used for an initial cwnd (honoring ss_fltsz). In 8 this no longer >>> happens (instead, 2 is hardcoded). A guess at a possible fix might look >>> something like this: >>> >>> Index: cc_newreno.c >>> =================================================================== >>> --- cc_newreno.c (revision 243660) >>> +++ cc_newreno.c (working copy) >>> @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv) >>> if (V_tcp_do_rfc3390) >>> rw = min(4 * CCV(ccv, t_maxseg), >>> max(2 * CCV(ccv, t_maxseg), 4380)); >>> +#if 1 >>> else >>> rw = CCV(ccv, t_maxseg) * 2; >>> +#else >>> + /* XXX: This is missing a lot of stuff that used to be in 7. */ >>> +#ifdef INET6 >>> + else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) : >>> + in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))) >>> +#else >>> + else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr))) >>> +#endif >>> + rw = V_ss_fltsz_local * CCV(ccv, t_maxseg); >>> + else >>> + rw = V_ss_fltsz * CCV(ccv, t_maxseg); >>> +#endif >>> >>> CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); >>> } >>> >>> (But using the #else clause instead of the current #if 1 code). Was this >>> change in 8 intentional? >> >> It was. Unlike connection initialisation which still honours ss_fltsz in >> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed >> particularly dubious and as such was omitted from the refactored code. >> >> The ultimate goal was to remove the ss_fltsz hack completely and >> implement a smarter mechanism, but that hasn't quite happened yet. The >> removal of ss_fltsz from 10.x without providing a replacement mechanism >> is not ideal and should probably be addressed. >> >> I'm guessing you're not using rfc3390 because you want to override the >> initial window based on specific local knowledge of the path between >> sender and receiver? > > Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent > the congestion window from collapsing when the connection was idle. We have > a bit of a unique workload in that we are using TCP to reliably forward a > latency-sensitive datagram stream across a WAN connection with high bandwidth > and high RTT. Most of congestion control seems tuned to bulk transfers rather > than this sort of use case. The solution we have settled on here is to add a > new TCP socket option to disable idle handling so that when an idle connection > restarts it keeps its prior congestion window. I'll respond to this in detail in the other thread. > One other thing I noticed which is may or may not be odd during this, is that > if you have a connection with TCP_NODELAY enabled and you fill your cwnd and > then you get an ACK back for an earlier small segment (less than MSS), TCP > will not send out a "short" segment for the amount of window space released. > Instead, it will wait until a full MSS of space is available before sending > a packet. I'm not sure if that is the correct behavior with TCP_NODELAY or > if we should send "short" segments in that case. We try fairly hard not to send runt segments irrespective of NODELAY, but I would be happy to see that change. I'm not aware of any "correct behaviour" we have to adhere to - I think it would be perfectly reasonable to have a sysctl set the lowest number of bytes we'd be willing to send a runt segment for and then key off TCP_NODELAY as to whether we try hard to send an MSS worth or send as soon as we have the min number of bytes worth of window available. Cheers, Lawrence