From owner-freebsd-hackers Wed Nov 28 21:42:45 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from exuma.irbs.com (exuma.irbs.com [216.86.160.252]) by hub.freebsd.org (Postfix) with ESMTP id C9C2237B417 for ; Wed, 28 Nov 2001 21:42:39 -0800 (PST) Received: by exuma.irbs.com (Postfix, from userid 2500) id 1CE0A17406; Thu, 29 Nov 2001 00:42:34 -0500 (EST) Date: Thu, 29 Nov 2001 00:42:34 -0500 From: John Capo To: freebsd-hackers@freebsd.org Subject: Re: FreeBSD performing worse than Linux? Message-ID: <20011129004234.A16101@exuma.irbs.com> Reply-To: jc@irbs.com References: <20011128153817.T61580@monorchid.lemis.com> <15364.38174.938500.946169@caddis.yogotech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <15364.38174.938500.946169@caddis.yogotech.com>; from nate@yogotech.com on Wed, Nov 28, 2001 at 12:41:18AM -0700 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I started noticing some TCP weirdness when I moved my bandwidth stats site from my office to my colo facility last week. The colo is five miles away by road and 1200 miles away by network. Netscape would stop for seconds at a time while loading the graph images but there was no consistency. Worked properly sometimes and sometimes not. I also noticed a delay when dumping the contents of my spam reject db with a perl program. Output would pause for a second, start for a second, pause for a second, and so on. Piping the perl script to cat produces continuous output. I dismissed this behavior to network oddity since the web sites on my machines seemed to be running just fine. Now this thread comes along and I realize there is something wrong so I did a little testing. find / -print on one of my servers in a ssh session will fill the pipe to my office, 256K frame, and run nicely then get into the starting and stopping mode after a good amount of data has been sent. find / -print | dd obs=1 will screw up within a few seconds and stay that way. Netstat in another ssh session shows data ready to go: tcp4 0 15928 server.22 client.4427 ESTABLISHED This is a fragment from a dump on the server side while running find / -print | dd obs=1 21:41:46.328381 client.4427 > server.22: . ack 11249 win 17328 (DF) [tos 0x10] 21:41:46.335863 client.4427 > server.22: . ack 11345 win 17328 (DF) [tos 0x10] 21:41:46.342216 client.4427 > server.22: . ack 11441 win 17328 (DF) [tos 0x10] 21:41:46.396051 client.4427 > server.22: . ack 11489 win 17376 (DF) [tos 0x10] 21:41:46.418208 client.4427 > server.22: . ack 11489 win 17376 (DF) [tos 0x10] 21:41:47.460903 server.22 > client.4427: . 11489:12937(1448) ack 144 win 17376 (DF) [tos 0x10] 21:41:47.569133 client.4427 > server.22: . ack 12937 win 15928 (DF) [tos 0x10] 21:41:49.001039 client.4427 > server.22: P 144:192(48) ack 12937 win 17376 (DF) [tos 0x10] 21:41:49.001073 server.22 > client.4427: . 28049:29497(1448) ack 192 win 17328 (DF) [tos 0x10] 21:41:49.001085 server.22 > client.4427: P 29497:30313(816) ack 192 win 17328 (DF) [tos 0x10] 21:41:49.109131 client.4427 > server.22: . ack 12937 win 17376 (DF) [tos 0x10] Its been a while since I have had to analyze TCP dumps but it looks to me like the server received an ack at 21:41:47.569133 for byte 12937 but the server did not resume transmission till the duplicate ack at 21:41:49.001039. The starting and stopping continues every few seconds. The only other interesting thing I see is the client sending duplicate acks for byte 11489. Running netstat -p tcp -s on the server shows a retransmit timeout for each output pause. Full TCP stats: 689765 packets sent 208566 data packets (90677298 bytes) 1046 data packets (1187590 bytes) retransmitted 1 resend initiated by MTU discovery 292504 ack-only packets (21123 delayed) 0 URG only packets 11551 window probe packets 139170 window update packets 36928 control packets 906752 packets received 167629 acks (for 90004170 bytes) 10803 duplicate acks 0 acks for unsent data 706255 packets (792771342 bytes) received in-sequence 468 completely duplicate packets (5045 bytes) 15 old duplicate packets 10 packets with some dup. data (202 bytes duped) 480 out-of-order packets (241868 bytes) 6 packets (6 bytes) of data after window 6 window probes 3812 window update packets 33 packets received after close 2 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short There are no ip errors. I see exactly the same behavior on 3 -stable machines running kernels from late October and early November. Another -stable machine with a kernel from late September does pause but not as consistently as the later kernel machines do. The client machine is running a kernel from early November. Fxp cards nailed at 100Mbs full duplex in all machines connected to a Cisco 2924 with all ports nailed at 100Mbs full duplex. I am not seeing any link level errors on the machines or the switch. The pauses occur with or without newreno. Another difference between the machine that works better and the others that don't is the ones that reliably hang are SMP machines. Setting machdep.smp_active=0 does not change anything. Same test works fine on SMP machines in my office with kernels from the same time period. This is interesting, the same test in an ssh session from a 4.3-BETA machine to the same server pauses very briefly every minute or so but that could be a true dropped packet. I do see the retransmit counter on the server increment at the same rate. Same results with a W98 putty session running in vmware on a -stable machine. Something is borked. John Capo To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message