From owner-freebsd-current@FreeBSD.ORG Mon Nov 2 15:52:36 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4B5D51065695 for ; Mon, 2 Nov 2009 15:52:36 +0000 (UTC) (envelope-from weldon@excelsusphoto.com) Received: from mx0.excelsus.net (emmett.excelsus.com [74.93.113.252]) by mx1.freebsd.org (Postfix) with ESMTP id F13E78FC21 for ; Mon, 2 Nov 2009 15:52:35 +0000 (UTC) Received: (qmail 62060 invoked by uid 89); 2 Nov 2009 15:52:33 -0000 Received: from unknown (HELO localhost) (127.0.0.1) by localhost.excelsus.com with SMTP; 2 Nov 2009 15:52:33 -0000 Date: Mon, 2 Nov 2009 10:52:33 -0500 (EST) From: Weldon S Godfrey 3 X-X-Sender: weldon@emmett.excelsus.com To: freebsd-current@freebsd.org Message-ID: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Mailman-Approved-At: Mon, 02 Nov 2009 17:39:32 +0000 Subject: FreeBSD 8.0 - network stack crashes? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Nov 2009 15:52:36 -0000 Up until yesterday, we have been running FreeBSD-CURRENT of 12/08. We started to see a couple months ago some very odd network behavior. Something happens to the stack that causes processes accessing the network to just hang. After the problem happens, usually (but not always), you can't ssh in. Always, you can't ssh or telnet out, and nothing can access the NFS shares on the server. You can ping everything from the server. You can't even do a route add, you can't ssh if you use just the IP address (although pinging with hostnames it doesn't have cached or in hosts table resolves). When you try to ssh out, do a route add from the box, the process just hangs. You can't control C it at all, it hangs forever. There is nothing in dmesg or messages to indicate an issue. I try to up/down the interfaces. In CURRENT-12/08, it may allow things to work for like 30s. We upgraded to 8.0-RC2 yesterday and, at first, the problem appeared to happen a lot more often. We expected that was related with the increase in network performance. At least in 8.0-RC2, I did see a large amount of input errors with netstat -in on the heavily loaded interface before it started the locking up behavior. I have replaced the ethernet cable and move ports. The Catalyst 3650 never records any errors. The problem would reoccur in about 5 minutes once our load kicked in this morning. One change in this upgrade, we switched from NFS v2 to v3. When we downgraded to the previous OS, we stayed at v3. The problem was just about as bad with v3 with the 12/08 OS We went back to RC2 with NFS v2 and appeared to stabilize to a degree. It ran for about an hour and a half and then the issue came up We are currently back to the 12/08 version using NFS2 and watching things. We are using a Dell PowerEdge 2950-iii, the problem happens when using the onboard nics using the bce driver and with an Intel card using the em driver I am hunting down any MTU/duplex/speed problems that could cause it (haven't found any so far). Of course, any problems on the network wouldn't (ideally) freak out the network stack on the server). I don't know how to troubleshoot this further on the server since I am not getting any problems indicated in logging, panics, cores, etc. Any help is appreciated. Thanks, Weldon