From owner-freebsd-net@FreeBSD.ORG Fri Jan 20 09:22:52 2006 Return-Path: X-Original-To: net@FreeBSD.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EA62F16A41F for ; Fri, 20 Jan 2006 09:22:51 +0000 (GMT) (envelope-from Alexander@Leidinger.net) Received: from www.ebusiness-leidinger.de (jojo.ms-net.de [84.16.236.246]) by mx1.FreeBSD.org (Postfix) with ESMTP id DBA0843D72 for ; Fri, 20 Jan 2006 09:22:46 +0000 (GMT) (envelope-from Alexander@Leidinger.net) Received: from Andro-Beta.Leidinger.net (p54A5F632.dip.t-dialin.net [84.165.246.50]) (authenticated bits=0) by www.ebusiness-leidinger.de (8.13.1/8.13.1) with ESMTP id k0K9FcxR072681 for ; Fri, 20 Jan 2006 10:15:38 +0100 (CET) (envelope-from Alexander@Leidinger.net) Received: from localhost (localhost [127.0.0.1]) by Andro-Beta.Leidinger.net (8.13.3/8.13.3) with ESMTP id k0K9Mhd6007663 for ; Fri, 20 Jan 2006 10:22:43 +0100 (CET) (envelope-from Alexander@Leidinger.net) Received: from pslux.cec.eu.int (pslux.cec.eu.int [158.169.9.14]) by webmail.leidinger.net (Horde MIME library) with HTTP; Fri, 20 Jan 2006 10:22:43 +0100 Message-ID: <20060120102243.bzzq2uig0kgwksso@netchild.homeip.net> X-Priority: 3 (Normal) Date: Fri, 20 Jan 2006 10:22:43 +0100 From: Alexander Leidinger To: net@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline Content-Transfer-Encoding: 7bit User-Agent: Internet Messaging Program (IMP) H3 (4.0.3) / FreeBSD-4.11 X-Virus-Scanned: by amavisd-new Cc: Subject: In case you haven't noticed this: John Nagle about fixing a problem in TCP X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jan 2006 09:22:52 -0000 Hi, I just found this in the comments on slashdot (http://developers.slashdot.org/comments.pl?sid=174457&cid=14515105) ---snip--- The trouble with the Nagle algorithm I really should fix the bad interaction between the "Nagle algorithm" and "delayed ACKs". Both ideas went into TCP around the same time, and the interaction is terrible. That fixed timer for ACKs is all wrong. Here's the real problem, and its solution. The concept behind delayed ACKs is to bet, when receiving some data from the net, that the local application will send a reply very soon. So there's no need to send an ACK immediately; the ACK can be piggybacked on the next data going the other way. If that doesn't happen, after a 500ms delay, an ACK is sent anyway. The concept behind the Nagle algorithm is that if the sender is doing very tiny writes (like single bytes, from Telnet), there's no reason to have more than one packet outstanding on the connection. This prevents slow links from choking with huge numbers of outstanding tinygrams. Both are reasonable. But they interact badly in the case where an application does two or more small writes to a socket, then waits for a reply. (X-Windows is notorious for this.) When an application does that, the first write results in an immediate packet send. The second write is held up until the first is acknowledged. But because of the delayed ACK strategy, that acknowledgement is held up for 500ms. This adds 500ms of latency to the transaction, even on a LAN. The real problem is that 500ms unconditional delay. (Why 500ms? That was a reasonable response time for a time-sharing system of the 1980s.) As mentioned above, delaying an ACK is a bet that the local application will reply to the data just received. Some apps, like character echo in Telnet servers, do respond every time. Others, like X-Windows "clients" (really servers, but X is backwards about this), only reply some of the time. TCP has no strategy to decide whether it's winning or losing those bets. That's the real problem. The right answer is that TCP should keep track of whether delayed ACKs are "winning" or "losing". A "win" is when, before the 500ms timer runs out, the application replies. Any needed ACK is then coalesced with the next outgoing data packet. A "lose" is when the 500ms timer runs out and the delayed ACK has to be sent anyway. There should be a counter in TCP, incremented on "wins", and reset to 0 on "loses". Only when the counter exceeds some number (5 or so), should ACKs be delayed. That would eliminate the problem automatically, and the need to turn the "Nagle algorithm" on and off. So that's the proper fix, at the TCP internals level. But I haven't done TCP internals in years, and really don't want to get back into that. If anyone is working on TCP internals for Linux today, I can be reached at the e-mail address above. This really should be fixed, since it's been annoying people for 20 years and it's not a tough thing to fix. The user-level solution is to avoid write-write-read sequences on sockets. write-read-write-read is fine. write-write-write is fine. But write-write-read is a killer. So, if you can, buffer up your little writes to TCP and send them all at once. Using the standard UNIX I/O package and flushing write before each read usually works. John Nagle ---snip--- I've looked at the webpage which is connected to this Slashdot user and they have a "Patents" page. There a "John Nagle" is listed as the inventor of some patents. Bye, Alexander. -- http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 "Oh no, not again." -- A bowl of petunias on it's way to certain death.