Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 9 Dec 2010 00:53:52 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Kirk McKusick <mckusick@mckusick.com>
Cc:        freebsd-fs@freebsd.org, pjd@freebsd.org, Oliver Fromme <olli@lurza.secnetix.de>
Subject:   Re: TRIM support for UFS?
Message-ID:  <20101208225352.GK33073@deviant.kiev.zoral.com.ua>
In-Reply-To: <201012082223.oB8MNqGV020879@chez.mckusick.com>
References:  <20101208214909.GH33073@deviant.kiev.zoral.com.ua> <201012082223.oB8MNqGV020879@chez.mckusick.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--DfnuYBTqzt7sVGu3
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Dec 08, 2010 at 02:23:52PM -0800, Kirk McKusick wrote:
> > Date: Wed, 8 Dec 2010 23:49:09 +0200
> > From: Kostik Belousov <kostikbel@gmail.com>
> > To: Julian Elischer <julian@freebsd.org>
> > Cc: freebsd-fs@freebsd.org, pjd@freebsd.org,
> >         Oliver Fromme <olli@lurza.secnetix.de>,
> >         Kirk McKusick <mckusick@mckusick.com>
> > Subject: Re: TRIM support for UFS?
> >=20
> > On Wed, Dec 08, 2010 at 09:07:56AM -0800, Julian Elischer wrote:
> > >
> > > From:    Kirk McKusick <mckusick@chez.mckusick.com>
> > > Date:    Wed, 21 May 2008 13:19:18 -0700
> > > To:      Poul-Henning Kamp <phk@critter.freebsd.dk>
> > > Subject: UFS and BIO_DELETE
> > > X-URL: http://WWW.McKusick.COM/
> > > Reply-To: Kirk McKusick <mckusick@McKusick.COM>
> > >
> > > I enclose below my proposed patch to add BIO_DELETE to UFS
> > > (goes at the end of ffs_blkfree). As I have no way to test
> > > it, I am wondering if you could let me know if it works.
> > >
> > > Also, I am thinking of only enabling it for filesystems mounted
> > > with a new flag requesting the behavior since the geteblk is a
> > > rather expensive call for the usual no-op case.
> > >
> > > I did look at just allocating a `struct bio' as a local variable
> > > and using that, but it looked like I needed to also come up with a
> > > `producer' and/or `consumer' if I wanted to pass it to GEOM directly,
> > > so in the end I went with this more expensive solution. If there
> > > is an easy way to just pass a bio structure to GEOM, I would much
> > > prefer that approach.
> > >
> > > 	~Kirk
> > >
> > >
> > > *** ffs_alloc.c	Wed May 21 20:11:04 2008
> > > --- ffs_alloc.c.new	Wed May 21 20:10:50 2008
> > > ***************
> > > *** 1945,1950 ****
> > > --- 1945,1962 ----
> > >   	ACTIVECLEAR(fs, cg);
> > >   	UFS_UNLOCK(ump);
> > >   	bdwrite(bp);
> > > + 	/*
> > > + 	 * Request that the block be cleared.
> > > + 	 */
> > > + 	bp =3D geteblk(size);
> > > + 	bp->b_iocmd =3D BIO_DELETE;
> > > + 	bp->b_vp =3D devvp;
> > > + 	bp->b_blkno =3D fsbtodb(fs, bno);
> > > + 	bp->b_offset =3D dbtob(bp->b_blkno);
> > > + 	bp->b_iooffset =3D bp->b_offset;
> > > + 	bp->b_bcount =3D size;
> > > + 	BUF_KERNPROC(bp);
> > > + 	BO_STRATEGY(&devvp->v_bufobj, bp);
> > >   }
> > >=20
> > >   #ifdef INVARIANTS
> > >
> > The UFS devvp contains the pointer to struct g_consumer in
> > devpp->v_bufobj->bo_private, so g_alloc_bio() and g_io_request() can be
> > used to start BIO_DELETE command without fake bufer allocation.
> > g_io_strategy() may be used as the sample.
>=20
> Thanks for that update to moderize the patch!
>=20
> > But, I have a question about the patch. The bitmap block in ffs_blkfree=
()
> > is handled with delayed write, by bdwrite() call. I think that it is
> > quite possible for BIO_DELETE to be executed before the bitmap block
> > is written and on-disk bit is set in bitmap. Would not this allow
> > for the blocks to be deleted before the bitmap is updated,
> > and potentially before on-disk pointers are cleared that points to
> > the blocks ? SU just rolls back the unfinished dependencies on write,
> > but BIO_DELETE cannot.
>=20
> It is safe to do the BIO_DELETE here because at this point we have
> progressed through the file delete to the point where the inode that
> claimed this block has been updated on the disk with a NULL pointer.
> The only step that remains is to update the on-disk bitmap to reflect
> that the block is available. If the bitmap write is not done before
> a system crash the only action that will be taken with respect to
> the freed block is that either background fsck will restore it to
> the bitmap (SU) or the journal will restore it to the bitmap (SUJ).
> There is no case where either SU or SUJ will put it back into the
> file once we have reached this point.
Thanks for the explanation. On the other hand, can my scenario be real
for async mounts ?

>=20
> > Also, shouldn't the BIO_DELETE only be issued for real devices,
> > and not the snapshots ?
>=20
> They should also be issued for snapshots. The only time that snapshots
> give up blocks is when they are deleted. The blocks that they give up
> at that time are all copies of blocks that were later changed in the
> filesystem. With the snapshot gone there will be no remaining references
> to those blocks and they will be made available for reallocation. As
> such, it would be helpful if they had been TRIMMED.
>=20
> > diff --git a/sys/ufs/ffs/ffs_alloc.c b/sys/ufs/ffs/ffs_alloc.c
> > index b740bbb..4ff57ae 100644
> > --- a/sys/ufs/ffs/ffs_alloc.c
> > +++ b/sys/ufs/ffs/ffs_alloc.c
> > @@ -83,6 +83,8 @@ __FBSDID("$FreeBSD$");
> >=20
> >  #include <security/audit/audit.h>
> >=20
> > +#include <geom/geom.h>
> > +
> >  #include <ufs/ufs/dir.h>
> >  #include <ufs/ufs/extattr.h>
> >  #include <ufs/ufs/quota.h>
> > @@ -1850,6 +1852,7 @@ ffs_blkfree(ump, fs, devvp, bno, size, inum, deph=
d)
> >  	u_int cg;
> >  	u_int8_t *blksfree;
> >  	struct cdev *dev;
> > +	struct bio *bip;
> >=20
> >  	cg =3D dtog(fs, bno);
> >  	if (devvp->v_type =3D=3D VREG) {
> > @@ -1962,6 +1965,16 @@ ffs_blkfree(ump, fs, devvp, bno, size, inum, dep=
hd)
> >  		softdep_setup_blkfree(UFSTOVFS(ump), bp, bno,
> >  		    numfrags(fs, size), dephd);
> >  	bdwrite(bp);
> > +
> > +	if (devvp->v_type !=3D VREG) {
> > +		bip =3D g_alloc_bio();
> > +		bip->bio_cmd =3D BIO_DELETE;
> > +		bip->bio_offset =3D dbtob(fsbtodb(fs, bno));
> > +		bip->bio_done =3D g_destroy_bio;
> > +		bip->bio_length =3D size;
> > +		g_io_request(bip,
> > +		    (struct g_consumer *)devvp->v_bufobj.bo_private);
> > +	}
> >  }
> >=20
> >  #ifdef INVARIANTS
>=20
> The above patch looks good though I would do it unconditionally
> (e.g., also for snapshots). It seems sensible to make it conditional
> on a sysctl variable so that folks could experiment with it more
> easily. And I would leave it off by default as non-SSD disks are
> unlikely to benefit from it. If it does prove useful for SSD disks
> then I would make it conditional on a filesystem flag so that it
> is possible to enable it on a filesystem-by-filesystem basis (e.g.,
> on for filesystems on the SSD and off for the rest).

I think it is better to have a flag in superblock instead of the mount
option.=20

diff --git a/sbin/dumpfs/dumpfs.c b/sbin/dumpfs/dumpfs.c
index 38c05f6..fa562dc 100644
--- a/sbin/dumpfs/dumpfs.c
+++ b/sbin/dumpfs/dumpfs.c
@@ -253,9 +253,11 @@ dumpfs(const char *name)
 		printf("fs_flags expanded ");
 	if (fsflags & FS_NFS4ACLS)
 		printf("nfsv4acls ");
+	if (fsflags & FS_TRIM)
+		printf("trim ");
 	fsflags &=3D ~(FS_UNCLEAN | FS_DOSOFTDEP | FS_NEEDSFSCK | FS_INDEXDIRS |
 		     FS_ACLS | FS_MULTILABEL | FS_GJOURNAL | FS_FLAGS_UPDATED |
-		     FS_NFS4ACLS | FS_SUJ);
+		     FS_NFS4ACLS | FS_SUJ | FS_TRIM);
 	if (fsflags !=3D 0)
 		printf("unknown flags (%#x)", fsflags);
 	putchar('\n');
diff --git a/sbin/tunefs/tunefs.c b/sbin/tunefs/tunefs.c
index b2ea602..0ed713d 100644
--- a/sbin/tunefs/tunefs.c
+++ b/sbin/tunefs/tunefs.c
@@ -82,11 +82,13 @@ int
 main(int argc, char *argv[])
 {
 	char *avalue, *jvalue, *Jvalue, *Lvalue, *lvalue, *Nvalue, *nvalue;
+	char *tvalue;
 	const char *special, *on;
 	const char *name;
 	int active;
 	int Aflag, aflag, eflag, evalue, fflag, fvalue, jflag, Jflag, Lflag;
 	int lflag, mflag, mvalue, Nflag, nflag, oflag, ovalue, pflag, sflag;
+	int tflag;
 	int svalue, Sflag, Svalue;
 	int ch, found_arg, i;
 	const char *chg[2];
@@ -101,7 +103,8 @@ main(int argc, char *argv[])
 	evalue =3D fvalue =3D mvalue =3D ovalue =3D svalue =3D Svalue =3D 0;
 	active =3D 0;
 	found_arg =3D 0;		/* At least one arg is required. */
-	while ((ch =3D getopt(argc, argv, "Aa:e:f:j:J:L:l:m:N:n:o:ps:S:")) !=3D -=
1)
+	while ((ch =3D getopt(argc, argv, "Aa:e:f:j:J:L:l:m:N:n:o:ps:S:t:"))
+	    !=3D -1)
 		switch (ch) {
=20
 		case 'A':
@@ -268,6 +271,18 @@ main(int argc, char *argv[])
 			Sflag =3D 1;
 			break;
=20
+		case 't':
+			found_arg =3D 1;
+			name =3D "trim";
+			tvalue =3D optarg;
+			if (strcmp(tvalue, "enable") !=3D 0 &&
+			    strcmp(tvalue, "disable") !=3D 0) {
+				errx(10, "bad %s (options are %s)",
+				    name, "`enable' or `disable'");
+			}
+			tflag =3D 1;
+			break;
+
 		default:
 			usage();
 		}
@@ -493,6 +508,24 @@ main(int argc, char *argv[])
 			sblock.fs_avgfpdir =3D svalue;
 		}
 	}
+	if (tflag) {
+		name =3D "issue TRIM to the disk";
+ 		if (strcmp(tvalue, "enable") =3D=3D 0) {
+			if (sblock.fs_flags & FS_TRIM)
+				warnx("%s remains unchanged as enabled", name);
+			else {
+ 				sblock.fs_flags |=3D FS_TRIM;
+ 				warnx("%s set", name);
+			}
+ 		} else if (strcmp(tvalue, "disable") =3D=3D 0) {
+			if ((~sblock.fs_flags & FS_TRIM) =3D=3D FS_TRIM)
+				warnx("%s remains unchanged as disabled", name);
+			else {
+ 				sblock.fs_flags &=3D ~FS_TRIM;
+ 				warnx("%s cleared", name);
+			}
+ 		}
+	}
=20
 	if (sbwrite(&disk, Aflag) =3D=3D -1)
 		goto err;
@@ -1011,12 +1044,13 @@ out:
 void
 usage(void)
 {
-	fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n",
+	fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n%s\n",
 "usage: tunefs [-A] [-a enable | disable] [-e maxbpg] [-f avgfilesize]",
 "              [-J enable | disable] [-j enable | disable]",=20
 "              [-L volname] [-l enable | disable] [-m minfree]",
 "              [-N enable | disable] [-n enable | disable]",
-"              [-o space | time] [-p] [-s avgfpdir] special | filesystem");
+"              [-o space | time] [-p] [-s avgfpdir] [-t enable | disable]",
+"              special | filesystem");
 	exit(2);
 }
=20
@@ -1035,6 +1069,8 @@ printfs(void)
 		(sblock.fs_flags & FS_SUJ)? "enabled" : "disabled");
 	warnx("gjournal: (-J)                                     %s",
 		(sblock.fs_flags & FS_GJOURNAL)? "enabled" : "disabled");
+	warnx("trim: (-p)                                         %s",=20
+		(sblock.fs_flags & FS_TRIM)? "enabled" : "disabled");
 	warnx("maximum blocks per file in a cylinder group: (-e)  %d",
 	      sblock.fs_maxbpg);
 	warnx("average file size: (-f)                            %d",
diff --git a/sys/ufs/ffs/ffs_alloc.c b/sys/ufs/ffs/ffs_alloc.c
index b740bbb..ed39aa3 100644
--- a/sys/ufs/ffs/ffs_alloc.c
+++ b/sys/ufs/ffs/ffs_alloc.c
@@ -83,6 +83,8 @@ __FBSDID("$FreeBSD$");
=20
 #include <security/audit/audit.h>
=20
+#include <geom/geom.h>
+
 #include <ufs/ufs/dir.h>
 #include <ufs/ufs/extattr.h>
 #include <ufs/ufs/quota.h>
@@ -1850,6 +1852,7 @@ ffs_blkfree(ump, fs, devvp, bno, size, inum, dephd)
 	u_int cg;
 	u_int8_t *blksfree;
 	struct cdev *dev;
+	struct bio *bip;
=20
 	cg =3D dtog(fs, bno);
 	if (devvp->v_type =3D=3D VREG) {
@@ -1962,6 +1965,16 @@ ffs_blkfree(ump, fs, devvp, bno, size, inum, dephd)
 		softdep_setup_blkfree(UFSTOVFS(ump), bp, bno,
 		    numfrags(fs, size), dephd);
 	bdwrite(bp);
+
+	if (fs->fs_flags & FS_TRIM) {
+		bip =3D g_alloc_bio();
+		bip->bio_cmd =3D BIO_DELETE;
+		bip->bio_offset =3D dbtob(fsbtodb(fs, bno));
+		bip->bio_done =3D g_destroy_bio;
+		bip->bio_length =3D size;
+		g_io_request(bip,
+		    (struct g_consumer *)devvp->v_bufobj.bo_private);
+	}
 }
=20
 #ifdef INVARIANTS
diff --git a/sys/ufs/ffs/fs.h b/sys/ufs/ffs/fs.h
index ba98ed3..13d9ede 100644
--- a/sys/ufs/ffs/fs.h
+++ b/sys/ufs/ffs/fs.h
@@ -417,6 +417,7 @@ CTASSERT(sizeof(struct fs) =3D=3D 1376);
 #define FS_FLAGS_UPDATED 0x0080	/* flags have been moved to new location */
 #define FS_NFS4ACLS	0x0100	/* file system has NFSv4 ACLs enabled */
 #define FS_INDEXDIRS	0x0200	/* kernel supports indexed directories */
+#define	FS_TRIM		0x0400	/* issue BIO_DELETE for deleted blocks */
=20
 /*
  * Macros to access bits in the fs_active array.

--DfnuYBTqzt7sVGu3
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAk0ADH8ACgkQC3+MBN1Mb4hV0gCglghMNOcFLRx0OWWGMIFNtzZU
g18AoJEJFYVO8ggUEbTTLLNHbrb4bfyF
=NflN
-----END PGP SIGNATURE-----

--DfnuYBTqzt7sVGu3--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101208225352.GK33073>