From owner-freebsd-fs@FreeBSD.ORG Thu May 1 18:20:59 2014 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8ACA5AEC for ; Thu, 1 May 2014 18:20:59 +0000 (UTC) Received: from albert.catwhisker.org (mx.catwhisker.org [198.144.209.73]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 564D81AC6 for ; Thu, 1 May 2014 18:20:58 +0000 (UTC) Received: from albert.catwhisker.org (localhost [127.0.0.1]) by albert.catwhisker.org (8.14.8/8.14.8) with ESMTP id s41IKvqJ034000; Thu, 1 May 2014 11:20:57 -0700 (PDT) (envelope-from david@albert.catwhisker.org) Received: (from david@localhost) by albert.catwhisker.org (8.14.8/8.14.8/Submit) id s41IKvhU033999; Thu, 1 May 2014 11:20:57 -0700 (PDT) (envelope-from david) Date: Thu, 1 May 2014 11:20:57 -0700 From: David Wolfskill To: Kirk McKusick Subject: Re: SU+J: 185 processes in state "suspfs" for >8 hrs. ... not good, right? Message-ID: <20140501182057.GJ1120@albert.catwhisker.org> References: <20140501161856.GH1120@albert.catwhisker.org> <201405011651.s41GphgX089174@chez.mckusick.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="94Ornb/7sD1MvElF" Content-Disposition: inline In-Reply-To: <201405011651.s41GphgX089174@chez.mckusick.com> User-Agent: Mutt/1.5.23 (2014-03-12) Cc: fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: fs@freebsd.org, David Wolfskill List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 May 2014 18:20:59 -0000 --94Ornb/7sD1MvElF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, May 01, 2014 at 09:51:43AM -0700, Kirk McKusick wrote: > ... >=20 > The following fix for related problems was made to head and MFC'ed > to stable/10 but not stable/9. >=20 > *** stable/9/sys/ufs/ffs/ffs_vnops.c 2014-03-05 08:51:48.000000000 -0800 > --- stable/9/sys/ufs/ffsffs_vnops.c 2014-05-01 09:41:35.000000000 -0700 > *************** > *** 258,266 **** > continue; > if (bp->b_lblkno > lbn) > panic("ffs_syncvnode: syncing truncated data."); > ! if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL)) > continue; > - BO_UNLOCK(bo); > if ((bp->b_flags & B_DELWRI) =3D=3D 0) > panic("ffs_fsync: not dirty"); > /* > --- 258,274 ---- > continue; > if (bp->b_lblkno > lbn) > panic("ffs_syncvnode: syncing truncated data."); > ! if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) =3D=3D 0) { > ! BO_UNLOCK(bo); > ! } else if (wait !=3D 0) { > ! if (BUF_LOCK(bp, > ! LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK, > ! BO_LOCKPTR(bo)) !=3D 0) { > ! bp->b_vflags &=3D ~BV_SCANNED; > ! goto next; > ! } > ! } else > continue; > if ((bp->b_flags & B_DELWRI) =3D=3D 0) > panic("ffs_fsync: not dirty"); > /* >=20 > The associated comment is: >=20 > If we fail to do a non-blocking acquire of a buf lock while doing a > waiting sync pass we need to do a blocking acquire and restart. > Another thread, typically the buf daemon, may have this buf locked and > if we don't wait we can fail to sync the file. This lead to a great > variety of softdep panics and deadlocks because we rely on all > dependencies being flushed before proceeding in several cases. >=20 > Let me know if it helps your problem. If it does, I will MFC it to 9. > There have been several other fixes made to SU+J that are more likely > to be the cause of your problem, but they are not easily back-ported > to stable/9. So if this does not fix your problem my only suggestions > are to turn off journaling or move to running on stable/10. > ... Hrrrmmm... Looks as if the above reflects stable/10's r251171 (in particular, "Convert the bufobj lock to rwlock.") -- stable/9 doesn't seem to know about BO_LOCKPTR(), and gcc makes some assumptions. That doesn't turn out well. I think that migrating to stable/10 might make more sense than figuring out how to fix this, especially if there are other causes of the observed failure that are fixed in stable/10. Thanks.... Peace, david --=20 David H. Wolfskill david@catwhisker.org Taliban: Evil cowards with guns afraid of truth from a 14-year old girl. See http://www.catwhisker.org/~david/publickey.gpg for my public key. --94Ornb/7sD1MvElF Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (FreeBSD) iQJ8BAEBCgBmBQJTYpCHXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQ4RThEMDY4QTIxMjc1MDZFRDIzODYzRTc4 QTY3RjlDOERFRjQxOTNCAAoJEIpn+cje9Bk7l9EQAJ1hsxQq+vRo+KQy81yGcy/u r8n+SsL1PdFQX3VPHjaHs/fUY0if37rdlIAiwFbQP4EjPR1MSMBU4e9XLI6rB4Zh jxyTt9BlCZpx/jP3LveyM+F2weX6gFM8tiu5MpRTuiEQu4yYqGBJ1HygEj8isSDb kdA5TN/MBKLsbAS5B/WpI9/OD0Q4E1Q5sQArpzJYgVH/NTOo3HI71IhtuDwm+NDb BJBZVOP+TWjEnlS9BqYKdfiZgDaaaHq3YQC/mD+eEqqx51CtnF0RXTV/D+1hyyHc wfA8Y5l3s/NB31BER6okNkp+5Fh32f+dhdzYE3b6KO42j/KmE8j+5him03TDG2jC 4aBsPbhlgLsO1El+JMJeai4YOjJV27UG+yDDC4P4yUsV2l080QyBiTyi3NXvOM6C Fc71X4fEfE+1BZtbNvIaAB0i4RSyClSMaBca2IjoI7eAE9uxXt9+p9dTiDD3P7bs HhCMNV01KRKsmZBOLSBQxAPupUNw1MS/dNOxi573Wn45zJQVYjvP1u3xoTe3+5Ul zyWsTHT84laDTTj2S1R4SbHPH1ZV/Gvp61kfA+tOM6MsoZwcnm2csdBk5NH/WILp ECpLeSfRoSU/hK46XVXwO1LSmMzj0ALQ1hy0yqB7NW/ESJY+yqtWyvdSityfj2/8 9kbpr6oLpaUA9CWZ3D8H =hsUU -----END PGP SIGNATURE----- --94Ornb/7sD1MvElF--