From owner-freebsd-fs@FreeBSD.ORG Mon Feb 4 16:03:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 01947B70 for ; Mon, 4 Feb 2013 16:03:34 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) by mx1.freebsd.org (Postfix) with ESMTP id 86C62639 for ; Mon, 4 Feb 2013 16:03:33 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.28]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0MWdT3-1UU6WC28QV-00XsiY for ; Mon, 04 Feb 2013 17:03:27 +0100 Received: (qmail invoked by alias); 04 Feb 2013 16:03:27 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp028) with SMTP; 04 Feb 2013 17:03:27 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1+5VDIdOpW5cENwgu60EcrKhlqveTlCSfIzW2baTP /2qjwW6T33xepm From: Christian Gusenbauer To: John Baldwin Subject: [SOLVED] Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Mon, 4 Feb 2013 17:05:31 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <20130124212212.GM2522@kib.kiev.ua> <201301241721.51102.jhb@freebsd.org> In-Reply-To: <201301241721.51102.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201302041705.31461.c47g@gmx.at> X-Y-GMX-Trusted: 0 Cc: freebsd-fs@freebsd.org, yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Feb 2013 16:03:34 -0000 On Thursday 24 January 2013 23:21:50 John Baldwin wrote: > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > > > > > Hi! > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic > > > > > > > > below if I execute the following commands (as single user): > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > # mount -u / > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach the > > > > > > > > stack trace. > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit > > > > > > > > network, maybe that's the cause for the panic, because the > > > > > > > > bcopy (see stack frame #15) fails. > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of rsize=32768 > > > > > > > and mtu 6144, but the machine runs HEAD and em instead of age. > > > > > > > I was unable to reproduce the panic on the copy of the 5GB > > > > > > > file from nfs mount. > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just > > > > > configuring age0 with > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > then I can copy all files from the mounted directory without any > > > > > problems, too. So it's probably age0 related? > > > > > > > > From your backtrace and the buffer printout, I see somewhat strange > > > > thing. The buffer data address is 0xffffff8171418000, while kernel > > > > faulted at the attempt to write at 0xffffff8171413000, which is is > > > > lower then the buffer data pointer, at the attempt to bcopy to the > > > > buffer. > > > > > > > > The other data suggests that there were no overflow of the data from > > > > the server response. So it might be that mbuf_len(mp) returned > > > > negative number ? I am not sure is it possible at all. > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc to > > > > the kernel config. > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct > > > > uio *uiop, int siz) } > > > > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > len = mbuf_len(mp); > > > > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > > > > } > > > > xfer = (left > len) ? len : left; > > > > > > > > #ifdef notdef > > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct > > > > uio *uiop, int siz) uiop->uio_resid -= xfer; > > > > > > > > } > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > + uiop->uio_iovcnt)); > > > > > > > > uiop->uio_iovcnt--; > > > > uiop->uio_iov++; > > > > > > > > } else { > > > > > > > > I thought that server have returned too long response, but it seems > > > > to be not the case from your data. Still, I think the patch below > > > > might be due. > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, > > > > struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > > > > > eof = fxdr_unsigned(int, *tl); > > > > > > > > } > > > > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > + NFSM_STRSIZ(retlen, len); > > > > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > if (error) > > > > > > > > goto nfsmout; > > > > > > I applied your patches and now I get a > > > > > > panic: len -4 > > > cpuid = 1 > > > KDB: enter: panic > > > Dumping 377 out of 6116 > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > This means that the age driver either produced corrupted mbuf chain, > > or filled wrong negative value into the mbuf len field. I am quite > > certain that the issue is in the driver. > > > > I added the net@ to Cc:, hopefully you could get help there. > > And I've cc'd Pyun who has written most of this driver and is likely the > one most familiar with its handling of jumbo frames. Hi All! I was in contact with Pyun. We quickly found out that it is indeed a driver problem. Pyun solved it and will commit the fix within the next few days. There's only one (minor) problem open, which I can not tell if it really is one: Konstantin sent me an initial patch for the NFS code where he added an KASSERT(uiop->uio_iovcnt > 1) which triggers even with Pyun's fix. Without that assert my tests show now problem at all. So is this a problem? Thanks guys (especially Pyun) for helping & fixing! Ciao, Christian.