From owner-freebsd-net@FreeBSD.ORG Sun Nov 30 23:11:21 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9C32F1065675 for ; Sun, 30 Nov 2008 23:11:21 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 1E2B68FC22 for ; Sun, 30 Nov 2008 23:11:20 +0000 (UTC) (envelope-from andre@freebsd.org) Received: (qmail 56766 invoked from network); 30 Nov 2008 21:34:25 -0000 Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Nov 2008 21:34:25 -0000 Message-ID: <49331DA0.3070804@freebsd.org> Date: Mon, 01 Dec 2008 00:11:28 +0100 From: Andre Oppermann User-Agent: Thunderbird 1.5.0.14 (Windows/20071210) MIME-Version: 1.0 To: David Malone References: <200811291746.aa88825@walton.maths.tcd.ie> In-Reply-To: <200811291746.aa88825@walton.maths.tcd.ie> Content-Type: multipart/mixed; boundary="------------090402010905090904010200" Cc: Rui Paulo , freebsd-net@freebsd.org, Kevin Oberman , Venkat Venkatsubra Subject: Re: FreeBSD Window updates X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Nov 2008 23:11:21 -0000 This is a multi-part message in MIME format. --------------090402010905090904010200 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit David Malone wrote: > I've got an example extract tcpdump of this at the end of the mail > - here 6 ACKs are sent, 5 of which are pure window updates and > several are 2us apart! > > I think the easy option is to delete the code that generates explicit > window updates if the window moves by 2*MSS. We then should be doing > something similar to Linux. The other easy alternative would be to > add a sysclt that lets us generate an window update every N*MSS and > by default set it to something big, like 10 or 100. That should > effectively eliminate the updates during bulk data transfer, but > may still generate some window updates after a loss. The main problem of the pure window update test in tcp_output() is its complete ignorance of delayed ACKs. Second is the strict 4.4BSD adherence to sending an update for every window increase of >= 2*MSS. The third issue of sending a slew of window updates after having received a FIN (telling us the other end won't ever send more data) I have already fixed some moons ago. In my new-tcp work I've come across the window update logic some time ago and backchecked with relevant RFCs and other implementations. Attached is a compiling but otherwise untested backport of the new logic. -- Andre > Normal ACKing for driving congestion control shouldn't be impacted > by either of these suggested changes. > > David. > > 1227622713.276609 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144097745 win 40798 (DF) > 1227622713.276611 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144097745 win 40830 (DF) > 1227622713.276613 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144097745 win 40862 (DF) > 1227622713.276615 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144097745 win 40894 (DF) > 1227622713.276852 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144097745 win 40926 (DF) > 1227622713.276855 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144097745 win 40958 (DF) > 1227622713.296585 172.16.2.2.5002 > 172.16.1.51.39077: . ack 144100641 win 40956 (DF) --------------090402010905090904010200 Content-Type: text/plain; name="tcp_wu_fix-20081130.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="tcp_wu_fix-20081130.diff" Index: tcp_output.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v retrieving revision 1.158 diff -u -p -r1.158 tcp_output.c --- tcp_output.c 27 Nov 2008 13:19:42 -0000 1.158 +++ tcp_output.c 30 Nov 2008 22:58:46 -0000 @@ -539,14 +539,32 @@ after_sack_rexmit: } /* - * Compare available window to amount of window - * known to peer (as advertised window less - * next expected input). If the difference is at least two - * max size segments, or at least 50% of the maximum possible - * window, then want to send a window update to peer. + * Compare available window to amount of window known to peer + * (as advertised window less next expected input) and decide + * if we have to send a pure window update segment. + * + * When a delayed ACK is scheduled, do nothing. It will update + * the window anyway in a few milliseconds. + * + * If the receive socket buffer has less than 1/4 of space + * available and if the difference is at least two max size + * segments, send an immediate window update to peer. + * + * Otherwise if the difference is 1/8 (or more) of the receive + * socket buffer, or at least 1/2 of the maximum possible window, + * then we send a window update too. + * * Skip this if the connection is in T/TCP half-open state. * Don't send pure window updates when the peer has closed * the connection and won't ever send more data. + * + * See RFC793, Section 3.7, page 43, Window Management Suggestions + * See RFC1122: Section 4.2.3.3, When to Send a Window Update + * + * Note: We are less aggressive with sending window update than + * recommended in RFC1122. This is fine with todays large socket + * buffers and will not stall the peer. In addition we piggy back + * window update on regular ACKs and sends. */ if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) && !TCPS_HAVERCVDFIN(tp->t_state)) { @@ -554,14 +572,24 @@ after_sack_rexmit: * "adv" is the amount we can increase the window, * taking into account that we are limited by * TCP_MAXWIN << tp->rcv_scale. + * + * NB: adv must be equal or larger than the smallest + * unscaled window increment. */ long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - (tp->rcv_adv - tp->rcv_nxt); - if (adv >= (long) (2 * tp->t_maxseg)) - goto send; - if (2 * adv >= (long) so->so_rcv.sb_hiwat) - goto send; + if (!(tp->t_flags & TF_DELACK) && + adv >= (long)0x1 << tp->rcv_scale) { + if (recwin <= (long)(so->so_rcv.sb_hiwat / 4) && + adv >= (long)(2 * tp->t_maxseg)) + goto send; + if (adv >= (long)(so->so_rcv.sb_hiwat / 8) && + adv >= (long)tp->t_maxseg) + goto send; + if (2 * adv >= (long)so->so_rcv.sb_hiwat) + goto send; + } } /* --------------090402010905090904010200--