From owner-freebsd-net@FreeBSD.ORG  Fri Jan 25 15:37:44 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 585625F5;
 Fri, 25 Jan 2013 15:37:43 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
 [IPv6:2001:470:1f10:75::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 402733CD;
 Fri, 25 Jan 2013 15:37:43 +0000 (UTC)
Received: from pakbsde14.localnet (unknown [38.105.238.108])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 52C18B98F;
 Fri, 25 Jan 2013 10:37:42 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-net@freebsd.org
Subject: Re: Some questions about the new TCP congestion control code
Date: Fri, 25 Jan 2013 09:00:38 -0500
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p22; KDE/4.5.5; amd64; ; )
References: <201301141604.29864.jhb@freebsd.org> <51014150.50101@networx.ch>
 <5101C377.7010907@freebsd.org>
In-Reply-To: <5101C377.7010907@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201301250900.38871.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Fri, 25 Jan 2013 10:37:42 -0500 (EST)
Cc: Lawrence Stewart <lstewart@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Jan 2013 15:37:44 -0000

On Thursday, January 24, 2013 6:27:51 pm Lawrence Stewart wrote:
> On 01/25/13 01:12, Andre Oppermann wrote:
> > On 24.01.2013 14:28, Lawrence Stewart wrote:
> >> On 01/16/13 06:27, John Baldwin wrote:
> >>> One other thing I noticed which is may or may not be odd during this,
> >>> is that
> >>> if you have a connection with TCP_NODELAY enabled and you fill your
> >>> cwnd and
> >>> then you get an ACK back for an earlier small segment (less than
> >>> MSS), TCP
> >>> will not send out a "short" segment for the amount of window space
> >>> released.
> >>> Instead, it will wait until a full MSS of space is available before
> >>> sending
> >>> a packet.  I'm not sure if that is the correct behavior with
> >>> TCP_NODELAY or
> >>> if we should send "short" segments in that case.
> >>
> >> We try fairly hard not to send runt segments irrespective of NODELAY,
> >> but I would be happy to see that change. I'm not aware of any "correct
> >> behaviour" we have to adhere to - I think it would be perfectly
> >> reasonable to have a sysctl set the lowest number of bytes we'd be
> >> willing to send a runt segment for and then key off TCP_NODELAY as to
> >> whether we try hard to send an MSS worth or send as soon as we have the
> >> min number of bytes worth of window available.
> > 
> > This is classic silly window syndrome prevention applied to the CWND.
> 
> Yes, but I think we could provide knobs to relax the behaviour where the
> latency vs header/payload overhead tradeoff swings in favour of latency.
> 
> I guess, John, I should first ask if you know why you were only getting
> such small ACKs back? Were you sending full MSS segments in the first
> place or doing some sort of PUSH to try and expedite getting some
> smaller chunk of data to the other end which triggered a small segment
> and corresponding small ACK?

In general we only send very small segments as we have TCP_NODELAY on and
are effectively using this to forward small datagrams.  Thus, in the usual
case we have a lot of small segments (much smaller than MSS).  When we fill
the congestion window, the stack starts waiting for a full MSS and that
requires several ACKs from the earlier small segments.

> > Sending a small segment when the window opens just a bit isn't going to help
> > much and
> 
> I wouldn't be game to make such a blanket statement - that very much
> depends on the situation. I think John's use case is relevant and we
> currently aren't very helpful towards it.

I think in the case of TCP_NODELAY, the user is explicitly asking for
lower latency.  Think about the "typical" use case for TCP_NODELAY of
a shell session.  If you manage to build up a cwnd of backlog you'd
rather have the oldest characters get out to the remote machine as soon
as possible rather than waiting for enough ACKs to build up a full MSS.

However, there might also be cases where it's useful to fall back to
throttling to full MSS in this case.  If you were to implement my
suggestion and the local sender started spamming a bunch of data, you
could be stuck permamently sending small frames if you always responded
to each sub-MSS ACK with an equivalently-sized frame.  It may be that in
the case of filling your cwnd, this is the more common case and that the
better fix for my use case is to have a more accurate cwnd (and thus
the TCP_IGNOREIDLE change).  I have no idea what other stacks do in this
regard.

-- 
John Baldwin