From owner-freebsd-net@FreeBSD.ORG Tue Aug 12 11:55:14 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D2D942E9 for ; Tue, 12 Aug 2014 11:55:14 +0000 (UTC) Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5B07522C4 for ; Tue, 12 Aug 2014 11:55:13 +0000 (UTC) Received: by mail-wi0-f176.google.com with SMTP id bs8so5706383wib.9 for ; Tue, 12 Aug 2014 04:55:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type; bh=oU1UE2/RfgRT+JNVjv54jI+Fsp63LmbqH0IzM10m+5k=; b=ca/BQkHUrtPezxC8V9AVp0hZu4bvmRRqKJxugA7fFYgAQFokv0tL3jhwTsco1SPZTC bGwRouMa28nkQFb4bQq7l0u+qKtxiTSYiTB1OXGA8jBSJ7lhSjbu0faJ0DQ9lbt4dBrZ af6j2cVVl/ELymadt+L/4jAM/bqPpBh0K9rBoouicEkkel3HjuT8Q5AYWUpU9re2zeNP LKr8tX5Ljs5X+EnuY/gy85wgIw6TuSfOJabaVkji5ohbn7dop89H9AKQY7F5zK8tDhXr 7mUBlxxwJluxPRIAhD+QhGWDtKqIpEtmcsaNr+uAJ4kZRlw2zn9d//eadjDM9vlC9IrT Q7Tg== X-Gm-Message-State: ALoCoQl/+3Ea0HMnADI+sQwUmk0E3kdAawcqoX7yjVYpaWb73QPkdOrwm7bxj7QD/ns4NydA6Rvn X-Received: by 10.194.22.166 with SMTP id e6mr4729704wjf.88.1407844148004; Tue, 12 Aug 2014 04:49:08 -0700 (PDT) Received: from [10.0.0.3] (bzq-79-182-26-155.red.bezeqint.net. [79.182.26.155]) by mx.google.com with ESMTPSA id es9sm8808024wjd.1.2014.08.12.04.49.06 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 12 Aug 2014 04:49:07 -0700 (PDT) Message-ID: <53E9FF32.3010802@cloudius-systems.com> Date: Tue, 12 Aug 2014 14:49:06 +0300 From: Vlad Zolotarov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: TCP Rx window auto sizing relies on TCP timestamp option? References: <53E8B424.2000904@cloudius-systems.com> <20140811170606.GV83475@funkthat.com> In-Reply-To: <20140811170606.GV83475@funkthat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.18 Cc: Osv Dev X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Aug 2014 11:55:15 -0000 On Aug 11, 2014 8:06 PM, "John-Mark Gurney" > wrote: > > Vlad Zolotarov wrote this message on Mon, Aug 11, 2014 at 15:16 +0300: > > Hi, I have the most strange question about the TCP Rx window auto sizing > > implementation in a FreeBSD networking stack. > > When I looked at the FreeBSD code (hash > > 9abce0e567c9a5a0520cdd94d5c633c7baf9a184) I noticed that > > the mentioned above feature will not be "enabled" if there isn't a TCP > > timestamp option present in the current TCP session: > > > > See sys/netinet/tcp_input.c: line 1813 in tcp_do_segment() function: > > > > if (V_tcp_do_autorcvbuf && > > *to.to_tsecr* && <-------- this is what I'm > > talking about > > (so->so_rcv.sb_flags & SB_AUTOSIZE)) > > > > So, if i read the code correctly, if there isn't a TS option (negotiated > > and thus present in every received packet) the receive socket buffer > > won't grow thus preventing the growth of the Rx window. > > If that's the case this is very strange since TS option is not promised > > and even more - in many cases it won't be present. > > For example in Linux this feature is disabled by default (controlled by > > /proc/sys/net/ipv4/tcp_timestamps). > > This is how I actually noticed the problem the first place: I ran iperf > > test where Linux was an initiator and a transmitter (iperf -c) FreeBSD > > box was a receiver (iperf -s) and I noticed that the Rx window wasn't > > opening up because Linux box hasn't negotiated the TS option in the SYN. > > As a result, the throughput numbers were significantly lower compared to > > Linux-to-Linux setup (Linux uses a Dynamic Right-Sizing (DRS) algorithm > > http://public.lanl.gov/radiant/pubs.html#DRS, which doesn't rely on TS). > > > > Could anybody comment on this, pls.? > > Did I miss anything? > > Is it true that FreeBSD assumes that TS option is always present and if > > not how can I cause an Rx Window to open up when TS option hasn't been > > negotiated? > > This means the receive buffer won't grow beyond the default of 64k... > But, as the comment says: > * On the receive side the socket buffer memory is only rarely > * used to any significant extent. This allows us to be much > > The receive buffer will only get used if the application takes too long > to read it's buffer, or it isn't currently waiting... If that's the > case, then the application should be fixed to be able to process the > data as quickly as it comes in... U r right about the Rx buffer and as a result the Rx window will not grow beyond this value too. See the following lines: tcp_output.c: tcp_output(): line 509: recwin = sbspace(&so->so_rcv); line 1034: /* * According to RFC1323 the window field in a SYN (i.e., a * or ) segment itself is never scaled. The * case is handled in syncache. */ if (flags & TH_SYN) th->th_win = htons((u_short) (min(sbspace(&so->so_rcv), TCP_MAXWIN))); else th->th_win = htons((u_short)(recwin >> tp->rcv_scale)); As a result the Tx window of a transmitter will not grow beyond 64K as well and this is a single full LSO/LRO frame. So this will limit a transmitter by a single LSO frame (64K) frame per RTT since the receiver will only "see" the new bytes only after they are delivered by a HW and this will be after all 64KB (full LRO aggregation) are received and only then it will send an ACK. Now let's consider u have a 0.2ms RTT like I have on my setup with 40Gbps ConnectX 3 NICs connected back to back. So, in this case the best throughput u'll ever get with the 64K window will be 8*64K/0.2ms ~ 2.5Gbps which is 1/16 of a line rate and u need at least 64K*16 ~ 1MB window to reach the line rate. And the higher RTT the larger Window we'll need. And this is in case the application frees the socket buffer immediately once it arrives which may never be the case of course. I suppose use cases like above were exactly the motivation for Window Scaling option in RFC 1323. > > So, I don't see much of an issue w/ the code you pointed out, yes, > the receive buffer won't grow, > but there are options that you can set > (sysctl net.inet.tcp.recvspace) and SO_RCVBUF in the application that > will address it otherwise... Exactly! If there is no TS - it won't and FreeBSD will not be able to utilize the network link. Frankly, I don't understand your advice - u suggest for each and every application to go and manually configure a receive socket buffer size? Or increase the initial socket buffer globally, which is even worse?! And which value should we choose? As u may see above the proper value depends on the RTT and RTT may change while application runs due to routing change. I doubt your suggestion is feasible. So, my first question stands - doesn't FreeBSD community think that it would be beneficial for FreeBSD to use a DRS (or similar?) algorithm when there are no TS negotiated? thanks, vlad > > Obviously setting the default too large will just waste memory... > > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not."