Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Jan 2013 00:28:34 +1100
From:      Lawrence Stewart <lstewart@freebsd.org>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Some questions about the new TCP congestion control code
Message-ID:  <51013702.8040707@freebsd.org>
In-Reply-To: <201301151427.50932.jhb@freebsd.org>
References:  <201301141604.29864.jhb@freebsd.org> <50F5137F.1060207@freebsd.org> <201301151427.50932.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 01/16/13 06:27, John Baldwin wrote:
> On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote:
>> Hi John,
>>
>> On 01/15/13 08:04, John Baldwin wrote:
>>> I was looking at TCP congestion control at work recently and noticed a few 
>>
>> Poor you ;)
>>
>>> "odd" things in the current code.  First, there is this chunk of code in 
>>> cc_ack_received() in tcp_input.c:
>>>
>>> static void inline
>>> cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
>>> {
>>> 	INP_WLOCK_ASSERT(tp->t_inpcb);
>>>
>>> 	tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
>>> 	if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
>>> 		tp->ccv->flags |= CCF_CWND_LIMITED;
>>> 	else
>>> 		tp->ccv->flags &= ~CCF_CWND_LIMITED;
>>>
>>>
>>> Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not 
>>> integers, so the call to min() results in truncation on 64-bit hosts.
>>
>> Good catch, but I don't think it matters in practice as neither snd_cwnd
>> or snd_wnd will grow past the 32-bit boundary.
> 
> I have a psyhcotic case using cc_cubic where it seems to grow without bound,
> though that is a bug in and of itself (and this change did not fix that
> issue).  I ended up not using cc_cubic (more below) and haven't been able
> to track down the root cause of the delay.  I can probably provide a test case
> to reproduce this if you are interested.

hmm I'd certainly be interested in hearing more about this issue with
cubic. If you think a test case is easy to come up with, please shoot it
through to me when you have the chance.

>>> It should probably be ulmin() instead.  However, this line seems to be a really 
>>> obfuscated way to just write:
>>>
>>> 	if (tp->snd_cwnd <= tp->snd_wnd)
>>
>> You are correct, though I'd argue the meaning of the existing code as
>> written is clearer compared to your suggested change.
>>
>>> If that is correct, I would vote for changing this to use the much simpler 
>>> logic.
>>
>> Agreed. While I find the existing code slightly clearer in meaning, it's
>> not significant enough to warrant keeping it as is when your suggested
>> change is simpler, fixes a bug and achieves the same thing. Happy for
>> you to change it or I can do it if you prefer.
> 
> I'll leave that to you, thanks.

Committed as r245783.

>>> Secondly, in the particular case I was investigating at work (restart of an 
>>> idle connnection), the newreno congestion control code in 8.x and later uses a 
>>> different algorithm than in 7.  Specifically, in 7 TCP would reuse the same 
>>> logic used for an initial cwnd (honoring ss_fltsz).  In 8 this no longer 
>>> happens (instead, 2 is hardcoded).  A guess at a possible fix might look 
>>> something like this:
>>>
>>> Index: cc_newreno.c
>>> ===================================================================
>>> --- cc_newreno.c	(revision 243660)
>>> +++ cc_newreno.c	(working copy)
>>> @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv)
>>>  	if (V_tcp_do_rfc3390)
>>>  		rw = min(4 * CCV(ccv, t_maxseg),
>>>  		    max(2 * CCV(ccv, t_maxseg), 4380));
>>> +#if 1
>>>  	else
>>>  		rw = CCV(ccv, t_maxseg) * 2;
>>> +#else
>>> +	/* XXX: This is missing a lot of stuff that used to be in 7. */
>>> +#ifdef INET6
>>> +	else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) :
>>> +		in_localaddr(CCV(ccv, t_inpcb->inp_faddr))))
>>> +#else
>>> +	else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))
>>> +#endif
>>> +		rw = V_ss_fltsz_local * CCV(ccv, t_maxseg);
>>> +	else
>>> +		rw = V_ss_fltsz * CCV(ccv, t_maxseg);
>>> +#endif
>>>  
>>>  	CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
>>>  }
>>>
>>> (But using the #else clause instead of the current #if 1 code).  Was this 
>>> change in 8 intentional?
>>
>> It was. Unlike connection initialisation which still honours ss_fltsz in
>> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed
>> particularly dubious and as such was omitted from the refactored code.
>>
>> The ultimate goal was to remove the ss_fltsz hack completely and
>> implement a smarter mechanism, but that hasn't quite happened yet. The
>> removal of ss_fltsz from 10.x without providing a replacement mechanism
>> is not ideal and should probably be addressed.
>>
>> I'm guessing you're not using rfc3390 because you want to override the
>> initial window based on specific local knowledge of the path between
>> sender and receiver?
> 
> Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent
> the congestion window from collapsing when the connection was idle.  We have
> a bit of a unique workload in that we are using TCP to reliably forward a
> latency-sensitive datagram stream across a WAN connection with high bandwidth
> and high RTT.  Most of congestion control seems tuned to bulk transfers rather
> than this sort of use case.  The solution we have settled on here is to add a
> new TCP socket option to disable idle handling so that when an idle connection
> restarts it keeps its prior congestion window.

I'll respond to this in detail in the other thread.

> One other thing I noticed which is may or may not be odd during this, is that
> if you have a connection with TCP_NODELAY enabled and you fill your cwnd and
> then you get an ACK back for an earlier small segment (less than MSS), TCP
> will not send out a "short" segment for the amount of window space released.
> Instead, it will wait until a full MSS of space is available before sending
> a packet.  I'm not sure if that is the correct behavior with TCP_NODELAY or
> if we should send "short" segments in that case.

We try fairly hard not to send runt segments irrespective of NODELAY,
but I would be happy to see that change. I'm not aware of any "correct
behaviour" we have to adhere to - I think it would be perfectly
reasonable to have a sysctl set the lowest number of bytes we'd be
willing to send a runt segment for and then key off TCP_NODELAY as to
whether we try hard to send an MSS worth or send as soon as we have the
min number of bytes worth of window available.

Cheers,
Lawrence



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51013702.8040707>