Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Aug 2014 05:03:15 -0700
From:      Adrian Chadd <adrian@freebsd.org>
To:        Vlad Zolotarov <vladz@cloudius-systems.com>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>, Osv Dev <osv-dev@googlegroups.com>
Subject:   Re: TCP Rx window auto sizing relies on TCP timestamp option?
Message-ID:  <CAJ-VmokJQiSeH=tj2GTD=wwMR0jSMYMnz3Xs7UW8yVD5ShK_Lw@mail.gmail.com>
In-Reply-To: <53E9FF32.3010802@cloudius-systems.com>
References:  <53E8B424.2000904@cloudius-systems.com> <20140811170606.GV83475@funkthat.com> <53E9FF32.3010802@cloudius-systems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
The TL;DR is - yes, I bet it'd be nice to have. :)


-a

On 12 August 2014 04:49, Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
> On Aug 11, 2014 8:06 PM, "John-Mark Gurney" <jmg@funkthat.com
> <mailto:jmg@funkthat.com>> wrote:
>>
>> Vlad Zolotarov wrote this message on Mon, Aug 11, 2014 at 15:16 +0300:
>> > Hi, I have the most strange question about the TCP Rx window auto sizing
>> > implementation in a FreeBSD networking stack.
>> > When I looked at the FreeBSD code (hash
>> > 9abce0e567c9a5a0520cdd94d5c633c7baf9a184) I noticed that
>> > the mentioned above feature will not be "enabled" if there isn't a TCP
>> > timestamp option present in the current TCP session:
>> >
>> > See sys/netinet/tcp_input.c: line 1813 in tcp_do_segment() function:
>> >
>> >                       if (V_tcp_do_autorcvbuf &&
>> >                       *to.to_tsecr*  && <-------- this is what I'm
>> >                       talking about
>> >                           (so->so_rcv.sb_flags & SB_AUTOSIZE))
>> >
>> > So, if i read the code correctly, if there isn't a TS option (negotiated
>> > and thus present in every received packet) the receive socket buffer
>> > won't grow thus preventing the growth of the Rx window.
>> > If that's the case this is very strange since TS option is not promised
>> > and even more - in many cases it won't be present.
>> > For example in Linux this feature is disabled by default (controlled by
>> > /proc/sys/net/ipv4/tcp_timestamps).
>> > This is how I actually noticed the problem the first place: I ran iperf
>> > test where Linux was an initiator and a transmitter (iperf -c) FreeBSD
>> > box was a receiver (iperf -s) and I noticed that the Rx window wasn't
>> > opening up because Linux box hasn't negotiated the TS option in the SYN.
>> > As a result, the throughput numbers were significantly lower compared to
>> > Linux-to-Linux setup (Linux uses a Dynamic Right-Sizing (DRS) algorithm
>> > http://public.lanl.gov/radiant/pubs.html#DRS, which doesn't rely on TS).
>> >
>> > Could anybody comment on this, pls.?
>> > Did I miss anything?
>> > Is it true that FreeBSD assumes that TS option is always present and if
>> > not how can I cause an Rx Window to open up when TS option hasn't been
>> > negotiated?
>>
>> This means the receive buffer won't grow beyond the default of 64k...
>> But, as the comment says:
>>                  * On the receive side the socket buffer memory is only
>> rarely
>>                  * used to any significant extent.  This allows us to be
>> much
>>
>> The receive buffer will only get used if the application takes too long
>> to read it's buffer, or it isn't currently waiting... If that's the
>> case, then the application should be fixed to be able to process the
>> data as quickly as it comes in...
>
> U r right about the Rx buffer and as a result the Rx window will not grow
> beyond this value too.
>
> See the following lines:
>
> tcp_output.c: tcp_output():
>
> line 509:
>
>         recwin = sbspace(&so->so_rcv);
>
>
> line 1034:
>
>         /*
>          * According to RFC1323 the window field in a SYN (i.e., a <SYN>
>          * or <SYN,ACK>) segment itself is never scaled.  The <SYN,ACK>
>          * case is handled in syncache.
>          */
>         if (flags & TH_SYN)
>                 th->th_win = htons((u_short)
>                                 (min(sbspace(&so->so_rcv), TCP_MAXWIN)));
>         else
>                 th->th_win = htons((u_short)(recwin >> tp->rcv_scale));
>
>
> As a result the Tx window of a transmitter will not grow beyond 64K as well
> and this is a single full LSO/LRO frame.
> So this will limit a transmitter by a single LSO frame (64K) frame per RTT
> since the receiver will only "see" the new bytes only after they are
> delivered by a HW and this will be after all 64KB (full LRO aggregation) are
> received and only then it will send an ACK.
>
> Now let's consider u have a 0.2ms RTT like I have on my setup with 40Gbps
> ConnectX 3 NICs connected back to back.
> So, in this case the best throughput u'll ever get with the 64K window will
> be 8*64K/0.2ms ~ 2.5Gbps which is 1/16 of a line rate and u need at least
> 64K*16 ~ 1MB window to reach the line rate. And the higher RTT the larger
> Window we'll need. And this is in case the application frees the socket
> buffer immediately once it arrives which may never be the case of course.
>
> I suppose use cases like above were exactly the motivation for Window
> Scaling option in RFC 1323.
>
>
>>
>> So, I don't see much of an issue w/ the code you pointed out, yes,
>> the receive buffer won't grow,
>
>> but there are options that you can set
>> (sysctl net.inet.tcp.recvspace) and SO_RCVBUF in the application that
>> will address it otherwise...
>
> Exactly! If there is no TS - it won't and FreeBSD will not be able to
> utilize the network link.
> Frankly, I don't understand your advice - u suggest for each and every
> application  to go and manually configure a receive socket buffer size? Or
> increase the initial socket buffer globally, which is even worse?! And which
> value should we choose? As u may see above the proper value depends on the
> RTT and RTT may change while application runs due to routing change. I doubt
> your suggestion is feasible.
>
> So, my first question stands - doesn't FreeBSD community think that it would
> be beneficial for FreeBSD to use a DRS (or similar?) algorithm when there
> are no TS negotiated?
>
> thanks,
> vlad
>
>
>>
>> Obviously setting the default too large will just waste memory...
>>
>> --
>>   John-Mark Gurney                              Voice: +1 415 225 5579
>>
>>      "All that I will do, has been done, All that I have, has not."
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmokJQiSeH=tj2GTD=wwMR0jSMYMnz3Xs7UW8yVD5ShK_Lw>