From owner-freebsd-current@FreeBSD.ORG  Mon Nov  2 15:52:36 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4B5D51065695
	for <freebsd-current@freebsd.org>; Mon,  2 Nov 2009 15:52:36 +0000 (UTC)
	(envelope-from weldon@excelsusphoto.com)
Received: from mx0.excelsus.net (emmett.excelsus.com [74.93.113.252])
	by mx1.freebsd.org (Postfix) with ESMTP id F13E78FC21
	for <freebsd-current@freebsd.org>; Mon,  2 Nov 2009 15:52:35 +0000 (UTC)
Received: (qmail 62060 invoked by uid 89); 2 Nov 2009 15:52:33 -0000
Received: from unknown (HELO localhost) (127.0.0.1)
	by localhost.excelsus.com with SMTP; 2 Nov 2009 15:52:33 -0000
Date: Mon, 2 Nov 2009 10:52:33 -0500 (EST)
From: Weldon S Godfrey 3 <weldon@excelsusphoto.com>
X-X-Sender: weldon@emmett.excelsus.com
To: freebsd-current@freebsd.org
Message-ID: <alpine.BSF.2.00.0911020747560.80499@emmett.excelsus.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
X-Mailman-Approved-At: Mon, 02 Nov 2009 17:39:32 +0000
Subject: FreeBSD 8.0 - network stack crashes?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Nov 2009 15:52:36 -0000


Up until yesterday, we have been running FreeBSD-CURRENT of 12/08.  We 
started to see a couple months ago some very odd network behavior. 
Something happens to the stack that causes processes accessing the network 
to just hang.  After the problem happens, usually (but not always), you 
can't ssh in.  Always, you can't ssh or telnet out, and nothing can access 
the NFS shares on the server.  You can ping everything from the server. 
You can't even do a route add, you can't ssh if you use just the IP 
address (although pinging with hostnames it doesn't have cached or in 
hosts table resolves).  When you try to ssh out, do a route add from the 
box, the process just hangs.  You can't control C it at all, it hangs 
forever.  There is nothing in dmesg or messages to indicate an issue.  I 
try to up/down the interfaces.  In CURRENT-12/08, it may allow things to 
work for like 30s.

We upgraded to 8.0-RC2 yesterday and, at first, the problem appeared to 
happen a lot more often.  We expected that was related with the increase 
in network performance.  At least in 8.0-RC2, I did see a large amount of 
input errors with netstat -in on the heavily loaded interface before it 
started the locking up behavior.  I have replaced the ethernet cable and 
move ports.  The Catalyst 3650 never records any errors.  The problem 
would reoccur in about 5 minutes once our load kicked in this morning.


One change in this upgrade, we switched from NFS v2 to v3.  When we 
downgraded to the previous OS, we stayed at v3.  The problem was just 
about as bad with v3 with the 12/08 OS

We went back to RC2 with NFS v2 and appeared to stabilize to a degree.
It ran for about an hour and a half and then the issue came up

We are currently back to the 12/08 version using NFS2 and watching things.

We are using a Dell PowerEdge 2950-iii, the problem happens when using the 
onboard nics using the bce driver and with an Intel card using the em 
driver

I am hunting down any MTU/duplex/speed problems that could cause it 
(haven't found any so far).  Of course, any problems on the network 
wouldn't (ideally) freak out the network stack on the server).  I don't 
know how to troubleshoot this further on the server since I am not getting 
any problems indicated in logging, panics, cores, etc.

Any help is appreciated.

Thanks,

Weldon