From owner-freebsd-net@FreeBSD.ORG  Sun Aug 17 19:15:20 2008
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 26DB61065671
	for <freebsd-net@freebsd.org>; Sun, 17 Aug 2008 19:15:20 +0000 (UTC)
	(envelope-from freebsd@chrisbuechler.com)
Received: from mail.livebsd.com (mail.livebsd.com [69.64.6.14])
	by mx1.freebsd.org (Postfix) with SMTP id D80348FC16
	for <freebsd-net@freebsd.org>; Sun, 17 Aug 2008 19:15:19 +0000 (UTC)
	(envelope-from freebsd@chrisbuechler.com)
Received: (qmail 92550 invoked by uid 89); 17 Aug 2008 19:15:18 -0000
Received: from unknown (HELO ?10.0.64.15?) (74.130.92.110)
	by 172.29.29.14 with SMTP; 17 Aug 2008 19:15:18 -0000
Message-ID: <48A878C6.9000001@chrisbuechler.com>
Date: Sun, 17 Aug 2008 15:15:18 -0400
From: Chris Buechler <freebsd@chrisbuechler.com>
User-Agent: Thunderbird 2.0.0.14 (Windows/20080421)
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: repeatable scp stalls from 7.0 to 7.0
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Aug 2008 19:15:20 -0000

I've been seeing pretty frequent and repeatable scp stalls between two 
FreeBSD 7.0 servers (7.0-RELEASE-p2 to be exact) on a 100 Mb LAN. 
They're two HP servers, an Opteron 275 and a dual Xeon 3.4 (don't recall 
the models but I can get them if it's relevant) using the onboard bge(4) 
cards. The client side (builder7) SCPs a file to the server side 
(hosting7) about 20 times a day. The stall happens about 2-4 times a 
week or so, and has happened ever since we put these two boxes online in 
their current functions. Initially they were the original 7.0 release, 
prior to the TCP fix in June. It's behaved the same way both prior to 
and after that fix. There are no apparent network issues aside from this 
with either of the boxes.

Since we had nothing to go on other than scp sessions going to "stalled" 
(no relevant logs), I setup a tcpdump on each end filtering on the TCP 
22 traffic between these hosts, grabbing 100 bytes of each frame to 
avoid chewing up too much disk space. When it happened again I split the 
end out into its own file with editcap, 4.2-4.3 MB each.

http://chrisbuechler.com/temp/lastcut-hosting7.pcap 
<http://chrisbuechler.com/temp/lastcut-hosting7.pcap> - server end, 
capture taken on host but destination IP is a jail
http://chrisbuechler.com/temp/lastcut-builder7.pcap 
<http://chrisbuechler.com/temp/lastcut-builder7.pcap> - client end, 
connection is initiated from the host, no jails involved.

The TCP window on the ACKs from server to client start decrementing [1], 
to the point where it's down to a window of 0. From that point, 
everything the server (172.29.29.181 <http://172.29.29.181>) sends back 
to the client (172.29.29.170 <http://172.29.29.170>) has a window of 0. 
Restarting the scp makes it work again. It doesn't happen every time, 
somewhere around 2-3% of the time it does. I don't see any cause for the 
decrementing window in those captures but maybe I'm missing something.

1 - lastcut-hosting7.pcap frame #21298; lastcut-builder7.pcap #25088

These are both very stock boxes, GENERIC kernels, no significant changes 
in sysctl or anything else. I'm not sure where to go from here, any 
assistance in resolving this would be appreciated.

cheers,
Chris