From owner-svn-src-projects@FreeBSD.ORG  Wed Jan 12 08:05:55 2011
Return-Path: <owner-svn-src-projects@FreeBSD.ORG>
Delivered-To: svn-src-projects@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 83F4B106564A;
	Wed, 12 Jan 2011 08:05:55 +0000 (UTC)
	(envelope-from pawel@dawidek.net)
Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60])
	by mx1.freebsd.org (Postfix) with ESMTP id B60748FC08;
	Wed, 12 Jan 2011 08:05:54 +0000 (UTC)
Received: by mail.garage.freebsd.pl (Postfix, from userid 65534)
	id 79D4145C89; Wed, 12 Jan 2011 09:05:53 +0100 (CET)
Received: from localhost (89-73-192-49.dynamic.chello.pl [89.73.192.49])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.garage.freebsd.pl (Postfix) with ESMTP id 623E945683;
	Wed, 12 Jan 2011 09:05:47 +0100 (CET)
Date: Wed, 12 Jan 2011 09:05:38 +0100
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Warner Losh <imp@FreeBSD.org>
Message-ID: <20110112080538.GI1812@garage.freebsd.pl>
References: <201101112052.p0BKqYig035221@svn.freebsd.org>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="JSkcQAAxhB1h8DcT"
Content-Disposition: inline
In-Reply-To: <201101112052.p0BKqYig035221@svn.freebsd.org>
User-Agent: Mutt/1.4.2.3i
X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc
X-OS: FreeBSD 9.0-CURRENT amd64
X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on 
	mail.garage.freebsd.pl
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL 
	autolearn=no version=3.0.4
Cc: svn-src-projects@freebsd.org, src-committers@freebsd.org
Subject: Re: svn commit: r217284 - projects/graid/head/sys/geom/raid
X-BeenThere: svn-src-projects@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the src &quot; projects&quot;
	tree" <svn-src-projects.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-projects>, 
	<mailto:svn-src-projects-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-projects>
List-Post: <mailto:svn-src-projects@freebsd.org>
List-Help: <mailto:svn-src-projects-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-projects>, 
	<mailto:svn-src-projects-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jan 2011 08:05:55 -0000


--JSkcQAAxhB1h8DcT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Jan 11, 2011 at 08:52:34PM +0000, Warner Losh wrote:
> Author: imp
> Date: Tue Jan 11 20:52:34 2011
> New Revision: 217284
> URL: http://svn.freebsd.org/changeset/base/217284
>=20
> Log:
>   Implement range locking.  When a range of a volume is locked, any
>   writes to that range will be deferred.  Range locking either succeeds
>   right away, or is deferred until all the in-flight writes complete.
>   Once a range is locked, write requests are queued.  These requests are
>   resumed when the range is unlocked.  This is intended to be used while
>   rebuilding a drive, or when doing bad-sector recovery via writing disk
>   sectors that fail to read properly.  Writes to areas of the volume
>   that aren't locked are unaffected.
>  =20
>   Users of this facility should ensure that they lock small ranges for
>   short periods to keep too many requests from queueing up (possiblely
>   forcing a resource shortage that may prevent the range from being
>   unlocked).  I'm not sure how to trigger this problem, nor what
>   remediation is necessary to prevent/reduce this problem.

IIRC UFS with SU can generate big I/O flood under load.

>   All the hooks are in place.  Nothing that I'm checking in uses this
>   facility yet, but in testing so far it doesn't affect anything by it
>   being there unused.
>  =20
>   Also, create a callback for the RAID1 transform so I can start using
>   it for bad sector recovery and then rebuilds.
[...]
> +static int
> +g_raid_bio_overlaps(const struct bio *bp, off_t off, off_t len)
> +{
> +	/*
> +	 * 5 cases:
> +	 * (1) bp entirely below NO
> +	 * (2) bp entirely above NO
> +	 * (3) bp start below, but end in range YES
> +	 * (4) bp entirely within YES
> +	 * (5) bp starts within, ends above YES
> +	 *
> +	 * lock range 10-19 (offset 10 length 10)
> +	 * (1) 1-5: first if kicks it out
> +	 * (2) 30-35: second if kicks it out
> +	 * (3) 5-15: passes both ifs
> +	 * (4) 12-14: passes both ifs
> +	 * (5) 19-20: passes both
> +	 */
> +
> +	if (bp->bio_offset + bp->bio_length - 1 < off)
> +		return (0);
> +	if (bp->bio_offset < off + len - 1)
> +		return (0);
> +	return (1);
> +}

I think the second if should be:

	if (bp->bio_offset > off + len - 1)

(greater than, not less than)

or to make it more consistent with the first if:

	if (off + len - 1 < bp->bio_offset)

or to simplify it a bit further, but it probably looks less obvious:

static int
g_raid_bio_overlaps(const struct bio *bp, off_t lstart, off_t lend)
{
	off_t bstart, bend;

	bstart =3D bp->bio_offset;
	bend =3D bstart + bp->bio_length;

	return (bend > lstart && bstart < lend);
}

	if (g_raid_bio_overlaps(bp, lp->l_offset, lp->l_offset + lp->l_length)=20
		/* do something */

> +int
> +g_raid_unlock_range(struct g_raid_volume *vol, off_t off, off_t len)
> +{
> +	struct g_raid_lock *lp, *tmp;
> +	struct g_raid_softc *sc;
> +	struct bio *bp;
> +
> +	sc =3D vol->v_softc;
> +	LIST_FOREACH_SAFE(lp, &vol->v_locks, l_next, tmp) {

No need to use LIST_FOREACH_SAFE() here, as after LIST_REMOVE() you
never continue the loop.

> +		if (lp->l_offset =3D=3D off && lp->l_length =3D=3D len) {
> +			LIST_REMOVE(lp, l_next);
> +			/* XXX
> +			 * Right now we just put them all back on the queue
> +			 * and hope for the best.  We hope this because any
> +			 * locked ranges will go right back on this list
> +			 * when the worker thread runs.
> +			 */
> +			mtx_lock(&sc->sc_queue_mtx);
> +			while ((bp =3D bioq_takefirst(&sc->sc_queue)) !=3D NULL)
> +				bioq_disksort(&sc->sc_queue, bp);

Look out for bioq_disksort() as this can mess up your ordering in some
cases. File system should never send two overlapping writes and then
bioq_disksort() is safe, but in your case imagine sync write or
"healing" write to overlap with regular write:

	regular write:	offset=3D2048, length=3D2048
	sync write:	offset=3D0, length=3D131072

The bioq_disksort() will change the order based on offset.

Not sure if it affects you, as when syncing you should lock range before
you read and unlock after write is done, but something to keep in mind.

> +			free(lp, M_RAID);

Because you will use range locking all the time (not only during sync,
because of "healing" writes that might occur at any time), you may
consider using UMA. I'm sure you know that, also that keeping range
locks on a list won't scale and you will need probably a faster stucture
to impelemnt it in the end. I can just warn you that I implemented range
locking on a list as well with intention to optimize it later, but it
never happend:) Although in my case range locking happens only during
synchronization, as I don't implement "I/O healing".

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--JSkcQAAxhB1h8DcT
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAk0tYNIACgkQForvXbEpPzSn8gCdFl9b/Hs0gVc2SkCtXsSOxFb/
b28AnjNaLrCaE2NAZU1lFpctVHO2rxgX
=T27k
-----END PGP SIGNATURE-----

--JSkcQAAxhB1h8DcT--