From owner-freebsd-current@FreeBSD.ORG Sun Sep 12 16:25:54 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9818F16A4CE; Sun, 12 Sep 2004 16:25:54 +0000 (GMT) Received: from alpha.siliconlandmark.com (alpha.siliconlandmark.com [209.69.98.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2D0E843D5D; Sun, 12 Sep 2004 16:25:54 +0000 (GMT) (envelope-from andy@siliconlandmark.com) Received: from alpha.siliconlandmark.com (andy@localhost [127.0.0.1]) i8CGPoN1052900; Sun, 12 Sep 2004 12:25:50 -0400 (EDT) (envelope-from andy@siliconlandmark.com) Received: from localhost (andy@localhost)i8CGPnb3052897; Sun, 12 Sep 2004 12:25:49 -0400 (EDT) (envelope-from andy@siliconlandmark.com) X-Authentication-Warning: alpha.siliconlandmark.com: andy owned process doing -bs Date: Sun, 12 Sep 2004 12:25:49 -0400 (EDT) From: Andre Guibert de Bruet To: Robert Watson In-Reply-To: Message-ID: <20040912110720.D84468@alpha.siliconlandmark.com> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean cc: current@FreeBSD.ORG Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 12 Sep 2004 16:25:54 -0000 On Sun, 12 Sep 2004, Robert Watson wrote: > On Sun, 12 Sep 2004, Andre Guibert de Bruet wrote: >> On Sun, 12 Sep 2004, Kris Kennaway wrote: >>> On Sun, Sep 12, 2004 at 02:42:03AM -0400, Andre Guibert de Bruet wrote: >>> >>>>> I've also noticed data corruption in the form of failed CRCs (And hence >>>>> dropped SSH connections) while transferring large amounts of data via SSH >>>>> over gige to a machine on its subnet. These problems started occuring >>>>> after the giant-less networking megacommit. Older kernels check out >>>>> without any such issues. >>> >>> Does it go away if you turn off debug.mpsafenet? If not, it's >>> probably not related to that commit. >> >> Setting debug.mpsafenet to 0 allows the SSH transfers to complete. The >> MD5 checksums and sizes match. Where do we go from here? > > I think I'd look at the following next: > > - Does your network interface driver support checksum offload? If so, > what happens if you disable that? It appears that it does, based on the options field reported by ifconfig: nge0: flags=108843 mtu 1500 options=13 I can still reproduce the problem after passing -rxcsum and -txcsum while bringing the interface up. > - Is the network interface driver marked as INTR_MPSAFE and/or not > IFF_NEEDSGIANT. If either, try setting the driver to run with Giant by > removing INTR_MPSAFE and adding IFF_NEEDSGIANT. dev/nge/if_nge.c has the interface marked as IFF_NEEDSGIANT, with no trace of INTR_MPSAFE. My dmesg confirms this: "nge0: [GIANT-LOCKED]" > After that I think we want to try and produce a non-SSH reproduction > scenario using a very simple test program... Attempting to bring a local FreeBSD repo up-to-date causes the issue to manifest itself. If portupgrade is run and execs a fetch for a large tarball from a fast mirror (100KB/s+), the problem manifests itself as well. I cannot yet make any conclusive determination, but preliminary pattern analysis seems to indicate that large bursts of network traffic on this gige interface aid the reproduction of this condition. The machine in question acts as a dns resolver for my small home network and appears to handle light amounts of traffic without any issues. Thanks for the help, Andy | Andre Guibert de Bruet | Enterprise Software Consultant > | Silicon Landmark, LLC. | http://siliconlandmark.com/ >