From owner-svn-src-projects@FreeBSD.ORG Wed Jan 12 08:05:55 2011 Return-Path: Delivered-To: svn-src-projects@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 83F4B106564A; Wed, 12 Jan 2011 08:05:55 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id B60748FC08; Wed, 12 Jan 2011 08:05:54 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 79D4145C89; Wed, 12 Jan 2011 09:05:53 +0100 (CET) Received: from localhost (89-73-192-49.dynamic.chello.pl [89.73.192.49]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 623E945683; Wed, 12 Jan 2011 09:05:47 +0100 (CET) Date: Wed, 12 Jan 2011 09:05:38 +0100 From: Pawel Jakub Dawidek To: Warner Losh Message-ID: <20110112080538.GI1812@garage.freebsd.pl> References: <201101112052.p0BKqYig035221@svn.freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="JSkcQAAxhB1h8DcT" Content-Disposition: inline In-Reply-To: <201101112052.p0BKqYig035221@svn.freebsd.org> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 9.0-CURRENT amd64 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: svn-src-projects@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r217284 - projects/graid/head/sys/geom/raid X-BeenThere: svn-src-projects@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the src " projects" tree" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Jan 2011 08:05:55 -0000 --JSkcQAAxhB1h8DcT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jan 11, 2011 at 08:52:34PM +0000, Warner Losh wrote: > Author: imp > Date: Tue Jan 11 20:52:34 2011 > New Revision: 217284 > URL: http://svn.freebsd.org/changeset/base/217284 >=20 > Log: > Implement range locking. When a range of a volume is locked, any > writes to that range will be deferred. Range locking either succeeds > right away, or is deferred until all the in-flight writes complete. > Once a range is locked, write requests are queued. These requests are > resumed when the range is unlocked. This is intended to be used while > rebuilding a drive, or when doing bad-sector recovery via writing disk > sectors that fail to read properly. Writes to areas of the volume > that aren't locked are unaffected. > =20 > Users of this facility should ensure that they lock small ranges for > short periods to keep too many requests from queueing up (possiblely > forcing a resource shortage that may prevent the range from being > unlocked). I'm not sure how to trigger this problem, nor what > remediation is necessary to prevent/reduce this problem. IIRC UFS with SU can generate big I/O flood under load. > All the hooks are in place. Nothing that I'm checking in uses this > facility yet, but in testing so far it doesn't affect anything by it > being there unused. > =20 > Also, create a callback for the RAID1 transform so I can start using > it for bad sector recovery and then rebuilds. [...] > +static int > +g_raid_bio_overlaps(const struct bio *bp, off_t off, off_t len) > +{ > + /* > + * 5 cases: > + * (1) bp entirely below NO > + * (2) bp entirely above NO > + * (3) bp start below, but end in range YES > + * (4) bp entirely within YES > + * (5) bp starts within, ends above YES > + * > + * lock range 10-19 (offset 10 length 10) > + * (1) 1-5: first if kicks it out > + * (2) 30-35: second if kicks it out > + * (3) 5-15: passes both ifs > + * (4) 12-14: passes both ifs > + * (5) 19-20: passes both > + */ > + > + if (bp->bio_offset + bp->bio_length - 1 < off) > + return (0); > + if (bp->bio_offset < off + len - 1) > + return (0); > + return (1); > +} I think the second if should be: if (bp->bio_offset > off + len - 1) (greater than, not less than) or to make it more consistent with the first if: if (off + len - 1 < bp->bio_offset) or to simplify it a bit further, but it probably looks less obvious: static int g_raid_bio_overlaps(const struct bio *bp, off_t lstart, off_t lend) { off_t bstart, bend; bstart =3D bp->bio_offset; bend =3D bstart + bp->bio_length; return (bend > lstart && bstart < lend); } if (g_raid_bio_overlaps(bp, lp->l_offset, lp->l_offset + lp->l_length)=20 /* do something */ > +int > +g_raid_unlock_range(struct g_raid_volume *vol, off_t off, off_t len) > +{ > + struct g_raid_lock *lp, *tmp; > + struct g_raid_softc *sc; > + struct bio *bp; > + > + sc =3D vol->v_softc; > + LIST_FOREACH_SAFE(lp, &vol->v_locks, l_next, tmp) { No need to use LIST_FOREACH_SAFE() here, as after LIST_REMOVE() you never continue the loop. > + if (lp->l_offset =3D=3D off && lp->l_length =3D=3D len) { > + LIST_REMOVE(lp, l_next); > + /* XXX > + * Right now we just put them all back on the queue > + * and hope for the best. We hope this because any > + * locked ranges will go right back on this list > + * when the worker thread runs. > + */ > + mtx_lock(&sc->sc_queue_mtx); > + while ((bp =3D bioq_takefirst(&sc->sc_queue)) !=3D NULL) > + bioq_disksort(&sc->sc_queue, bp); Look out for bioq_disksort() as this can mess up your ordering in some cases. File system should never send two overlapping writes and then bioq_disksort() is safe, but in your case imagine sync write or "healing" write to overlap with regular write: regular write: offset=3D2048, length=3D2048 sync write: offset=3D0, length=3D131072 The bioq_disksort() will change the order based on offset. Not sure if it affects you, as when syncing you should lock range before you read and unlock after write is done, but something to keep in mind. > + free(lp, M_RAID); Because you will use range locking all the time (not only during sync, because of "healing" writes that might occur at any time), you may consider using UMA. I'm sure you know that, also that keeping range locks on a list won't scale and you will need probably a faster stucture to impelemnt it in the end. I can just warn you that I implemented range locking on a list as well with intention to optimize it later, but it never happend:) Although in my case range locking happens only during synchronization, as I don't implement "I/O healing". --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --JSkcQAAxhB1h8DcT Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAk0tYNIACgkQForvXbEpPzSn8gCdFl9b/Hs0gVc2SkCtXsSOxFb/ b28AnjNaLrCaE2NAZU1lFpctVHO2rxgX =T27k -----END PGP SIGNATURE----- --JSkcQAAxhB1h8DcT--