From owner-freebsd-net@FreeBSD.ORG  Thu Jan 24 13:28:36 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id D931815C;
 Thu, 24 Jan 2013 13:28:36 +0000 (UTC)
 (envelope-from lstewart@freebsd.org)
Received: from lauren.room52.net (lauren.room52.net [210.50.193.198])
 by mx1.freebsd.org (Postfix) with ESMTP id 6009E832;
 Thu, 24 Jan 2013 13:28:36 +0000 (UTC)
Received: from lstewart1.loshell.room52.net
 (ppp59-167-184-191.static.internode.on.net [59.167.184.191])
 by lauren.room52.net (Postfix) with ESMTPSA id DD3BA7E88D;
 Fri, 25 Jan 2013 00:28:34 +1100 (EST)
Message-ID: <51013702.8040707@freebsd.org>
Date: Fri, 25 Jan 2013 00:28:34 +1100
From: Lawrence Stewart <lstewart@freebsd.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:13.0) Gecko/20120613 Thunderbird/13.0
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
Subject: Re: Some questions about the new TCP congestion control code
References: <201301141604.29864.jhb@freebsd.org>
 <50F5137F.1060207@freebsd.org> <201301151427.50932.jhb@freebsd.org>
In-Reply-To: <201301151427.50932.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY
 autolearn=unavailable version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Jan 2013 13:28:36 -0000

On 01/16/13 06:27, John Baldwin wrote:
> On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote:
>> Hi John,
>>
>> On 01/15/13 08:04, John Baldwin wrote:
>>> I was looking at TCP congestion control at work recently and noticed a few 
>>
>> Poor you ;)
>>
>>> "odd" things in the current code.  First, there is this chunk of code in 
>>> cc_ack_received() in tcp_input.c:
>>>
>>> static void inline
>>> cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
>>> {
>>> 	INP_WLOCK_ASSERT(tp->t_inpcb);
>>>
>>> 	tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
>>> 	if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
>>> 		tp->ccv->flags |= CCF_CWND_LIMITED;
>>> 	else
>>> 		tp->ccv->flags &= ~CCF_CWND_LIMITED;
>>>
>>>
>>> Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not 
>>> integers, so the call to min() results in truncation on 64-bit hosts.
>>
>> Good catch, but I don't think it matters in practice as neither snd_cwnd
>> or snd_wnd will grow past the 32-bit boundary.
> 
> I have a psyhcotic case using cc_cubic where it seems to grow without bound,
> though that is a bug in and of itself (and this change did not fix that
> issue).  I ended up not using cc_cubic (more below) and haven't been able
> to track down the root cause of the delay.  I can probably provide a test case
> to reproduce this if you are interested.

hmm I'd certainly be interested in hearing more about this issue with
cubic. If you think a test case is easy to come up with, please shoot it
through to me when you have the chance.

>>> It should probably be ulmin() instead.  However, this line seems to be a really 
>>> obfuscated way to just write:
>>>
>>> 	if (tp->snd_cwnd <= tp->snd_wnd)
>>
>> You are correct, though I'd argue the meaning of the existing code as
>> written is clearer compared to your suggested change.
>>
>>> If that is correct, I would vote for changing this to use the much simpler 
>>> logic.
>>
>> Agreed. While I find the existing code slightly clearer in meaning, it's
>> not significant enough to warrant keeping it as is when your suggested
>> change is simpler, fixes a bug and achieves the same thing. Happy for
>> you to change it or I can do it if you prefer.
> 
> I'll leave that to you, thanks.

Committed as r245783.

>>> Secondly, in the particular case I was investigating at work (restart of an 
>>> idle connnection), the newreno congestion control code in 8.x and later uses a 
>>> different algorithm than in 7.  Specifically, in 7 TCP would reuse the same 
>>> logic used for an initial cwnd (honoring ss_fltsz).  In 8 this no longer 
>>> happens (instead, 2 is hardcoded).  A guess at a possible fix might look 
>>> something like this:
>>>
>>> Index: cc_newreno.c
>>> ===================================================================
>>> --- cc_newreno.c	(revision 243660)
>>> +++ cc_newreno.c	(working copy)
>>> @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv)
>>>  	if (V_tcp_do_rfc3390)
>>>  		rw = min(4 * CCV(ccv, t_maxseg),
>>>  		    max(2 * CCV(ccv, t_maxseg), 4380));
>>> +#if 1
>>>  	else
>>>  		rw = CCV(ccv, t_maxseg) * 2;
>>> +#else
>>> +	/* XXX: This is missing a lot of stuff that used to be in 7. */
>>> +#ifdef INET6
>>> +	else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) :
>>> +		in_localaddr(CCV(ccv, t_inpcb->inp_faddr))))
>>> +#else
>>> +	else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))
>>> +#endif
>>> +		rw = V_ss_fltsz_local * CCV(ccv, t_maxseg);
>>> +	else
>>> +		rw = V_ss_fltsz * CCV(ccv, t_maxseg);
>>> +#endif
>>>  
>>>  	CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
>>>  }
>>>
>>> (But using the #else clause instead of the current #if 1 code).  Was this 
>>> change in 8 intentional?
>>
>> It was. Unlike connection initialisation which still honours ss_fltsz in
>> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed
>> particularly dubious and as such was omitted from the refactored code.
>>
>> The ultimate goal was to remove the ss_fltsz hack completely and
>> implement a smarter mechanism, but that hasn't quite happened yet. The
>> removal of ss_fltsz from 10.x without providing a replacement mechanism
>> is not ideal and should probably be addressed.
>>
>> I'm guessing you're not using rfc3390 because you want to override the
>> initial window based on specific local knowledge of the path between
>> sender and receiver?
> 
> Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent
> the congestion window from collapsing when the connection was idle.  We have
> a bit of a unique workload in that we are using TCP to reliably forward a
> latency-sensitive datagram stream across a WAN connection with high bandwidth
> and high RTT.  Most of congestion control seems tuned to bulk transfers rather
> than this sort of use case.  The solution we have settled on here is to add a
> new TCP socket option to disable idle handling so that when an idle connection
> restarts it keeps its prior congestion window.

I'll respond to this in detail in the other thread.

> One other thing I noticed which is may or may not be odd during this, is that
> if you have a connection with TCP_NODELAY enabled and you fill your cwnd and
> then you get an ACK back for an earlier small segment (less than MSS), TCP
> will not send out a "short" segment for the amount of window space released.
> Instead, it will wait until a full MSS of space is available before sending
> a packet.  I'm not sure if that is the correct behavior with TCP_NODELAY or
> if we should send "short" segments in that case.

We try fairly hard not to send runt segments irrespective of NODELAY,
but I would be happy to see that change. I'm not aware of any "correct
behaviour" we have to adhere to - I think it would be perfectly
reasonable to have a sysctl set the lowest number of bytes we'd be
willing to send a runt segment for and then key off TCP_NODELAY as to
whether we try hard to send an MSS worth or send as soon as we have the
min number of bytes worth of window available.

Cheers,
Lawrence