From owner-freebsd-fs@FreeBSD.ORG Thu Dec 9 10:34:18 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 055B8106564A; Thu, 9 Dec 2010 10:34:18 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id DBD638FC27; Thu, 9 Dec 2010 10:34:16 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id oB9AYCGS097299 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 9 Dec 2010 12:34:12 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id oB9AYCQU016138; Thu, 9 Dec 2010 12:34:12 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id oB9AYBCI016137; Thu, 9 Dec 2010 12:34:11 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 9 Dec 2010 12:34:11 +0200 From: Kostik Belousov To: Kirk McKusick Message-ID: <20101209103411.GL33073@deviant.kiev.zoral.com.ua> References: <20101208225352.GK33073@deviant.kiev.zoral.com.ua> <201012082356.oB8NuxwX040914@chez.mckusick.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="wA9WyeW1yVBM2Q32" Content-Disposition: inline In-Reply-To: <201012082356.oB8NuxwX040914@chez.mckusick.com> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-fs@freebsd.org, pjd@freebsd.org, Oliver Fromme Subject: Re: TRIM support for UFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Dec 2010 10:34:18 -0000 --wA9WyeW1yVBM2Q32 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 08, 2010 at 03:56:59PM -0800, Kirk McKusick wrote: > > Date: Thu, 9 Dec 2010 00:53:52 +0200 > > From: Kostik Belousov > > To: Kirk McKusick > > Cc: Julian Elischer , freebsd-fs@freebsd.org, > > pjd@freebsd.org, Oliver Fromme > > Subject: Re: TRIM support for UFS? > >=20 > > On Wed, Dec 08, 2010 at 02:23:52PM -0800, Kirk McKusick wrote: > > > > > > It is safe to do the BIO_DELETE here because at this point we have > > > progressed through the file delete to the point where the inode that > > > claimed this block has been updated on the disk with a NULL pointer. > >=20 > > Thanks for the explanation. On the other hand, can my scenario be real > > for async mounts ? >=20 > Async mounts are not compatible with SU or SUJ. You are running the > filesystem in a mode where there is no guarantee of recovery after > a crash. So yes, you may TRIM blocks that are still claimed by on-disk > inodes. But that is likely the least of your worries. Since async makes > no claims about filesystem consistency after a crash, adding one more > way for it to break does not seem like a big deal :-) >=20 > > > The above patch looks good though I would do it unconditionally > > > (e.g., also for snapshots). It seems sensible to make it conditional > > > on a sysctl variable so that folks could experiment with it more > > > easily. And I would leave it off by default as non-SSD disks are > > > unlikely to benefit from it. If it does prove useful for SSD disks > > > then I would make it conditional on a filesystem flag so that it > > > is possible to enable it on a filesystem-by-filesystem basis (e.g., > > > on for filesystems on the SSD and off for the rest). > >=20 > > I think it is better to have a flag in superblock instead of the mount > > option. > > > > Most of diff deleted... > >=20 > > diff --git a/sys/ufs/ffs/fs.h b/sys/ufs/ffs/fs.h > > index ba98ed3..13d9ede 100644 > > --- a/sys/ufs/ffs/fs.h > > +++ b/sys/ufs/ffs/fs.h > > @@ -417,6 +417,7 @@ CTASSERT(sizeof(struct fs) =3D=3D 1376); > > #define FS_FLAGS_UPDATED 0x0080 /* flags have been moved to new locati= on */ > > #define FS_NFS4ACLS 0x0100 /* file system has NFSv4 ACLs enabled */ > > #define FS_INDEXDIRS 0x0200 /* kernel supports indexed directories */ > > +#define FS_TRIM 0x0400 /* issue BIO_DELETE for deleted blocks */ > >=20 > > /* > > * Macros to access bits in the fs_active array. >=20 > I approve of this change and am happy to see it (finally) get added. >=20 > > On Wed, 8 Dec 2010 at 23:33:27 +0100, Pawel Jakub Dawidek wrote: > >> On Wed, Dec 08, 2010 at 02:23:52PM -0800, Kirk McKusick wrote: > >> The above patch looks good though I would do it unconditionally > >> (e.g., also for snapshots). It seems sensible to make it conditional > >> on a sysctl variable so that folks could experiment with it more > >> easily. And I would leave it off by default as non-SSD disks are > >> unlikely to benefit from it. If it does prove useful for SSD disks > >> then I would make it conditional on a filesystem flag so that it > >> is possible to enable it on a filesystem-by-filesystem basis (e.g., > >> on for filesystems on the SSD and off for the rest). > >=20 > > We already have a flag for disks: DISKFLAG_CANDELETE, which tells if the > > disk support TRIM or not. Next we should add BIO_GETATTR attribute for > > DISK class to return true if TRIM is supported. This way UFS can ask if > > TRIM is supported on mount and don't bother sending BIO_DELETE if it is > > not supported. >=20 > Clearly asking the underlying device is the most automated way to decide > whether to do TRIM. Once it is possible to ask the underlying disk if it > supports TRIM, your change could be modified to do the querry to the disk > and set the FS_TRIM flag accordingly each time that the filesystem is > mounted. If it is easy to add a BIO_GETATTR attribute for the DISK class, > then we can skip the intermediate stage of using tunefs to enable/disable= it. I still think that TRIM should be opt-in, at least for the first time. Below is the BIO_GETATTR change integrated. Now, both FS_TRIM should be set and device shall report the GEOM::candelete attribute as true to have BIO_DELETE issued. geom_disk and md(4) report GEOM::candelete. Also, I updated tunefs(8) manpage, missed in the previous version. diff --git a/sbin/dumpfs/dumpfs.c b/sbin/dumpfs/dumpfs.c index 38c05f6..fa562dc 100644 --- a/sbin/dumpfs/dumpfs.c +++ b/sbin/dumpfs/dumpfs.c @@ -253,9 +253,11 @@ dumpfs(const char *name) printf("fs_flags expanded "); if (fsflags & FS_NFS4ACLS) printf("nfsv4acls "); + if (fsflags & FS_TRIM) + printf("trim "); fsflags &=3D ~(FS_UNCLEAN | FS_DOSOFTDEP | FS_NEEDSFSCK | FS_INDEXDIRS | FS_ACLS | FS_MULTILABEL | FS_GJOURNAL | FS_FLAGS_UPDATED | - FS_NFS4ACLS | FS_SUJ); + FS_NFS4ACLS | FS_SUJ | FS_TRIM); if (fsflags !=3D 0) printf("unknown flags (%#x)", fsflags); putchar('\n'); diff --git a/sbin/tunefs/tunefs.8 b/sbin/tunefs/tunefs.8 index a883cd4..c62af94 100644 --- a/sbin/tunefs/tunefs.8 +++ b/sbin/tunefs/tunefs.8 @@ -28,7 +28,7 @@ .\" @(#)tunefs.8 8.2 (Berkeley) 12/11/93 .\" $FreeBSD$ .\" -.Dd March 6, 2010 +.Dd December 9, 2010 .Dt TUNEFS 8 .Os .Sh NAME @@ -51,6 +51,7 @@ .Op Fl p .Op Fl s Ar avgfpdir .Op Fl S Ar size +.Op Fl t Cm enable | disable .Ar special | filesystem .Sh DESCRIPTION The @@ -143,6 +144,10 @@ Specify the expected number of files per directory. .It Fl S Ar size Specify the softdep journal size in bytes. The minimum is 4M. +.It Fl t Cm enable | disable +Turn on/off the TRIM enable flag. +If enabled, and if the underlaying device supports BIO_DELETE +command, kernel will delete the freed blocks. .El .Pp At least one of the above flags is required. diff --git a/sbin/tunefs/tunefs.c b/sbin/tunefs/tunefs.c index b2ea602..0ed713d 100644 --- a/sbin/tunefs/tunefs.c +++ b/sbin/tunefs/tunefs.c @@ -82,11 +82,13 @@ int main(int argc, char *argv[]) { char *avalue, *jvalue, *Jvalue, *Lvalue, *lvalue, *Nvalue, *nvalue; + char *tvalue; const char *special, *on; const char *name; int active; int Aflag, aflag, eflag, evalue, fflag, fvalue, jflag, Jflag, Lflag; int lflag, mflag, mvalue, Nflag, nflag, oflag, ovalue, pflag, sflag; + int tflag; int svalue, Sflag, Svalue; int ch, found_arg, i; const char *chg[2]; @@ -101,7 +103,8 @@ main(int argc, char *argv[]) evalue =3D fvalue =3D mvalue =3D ovalue =3D svalue =3D Svalue =3D 0; active =3D 0; found_arg =3D 0; /* At least one arg is required. */ - while ((ch =3D getopt(argc, argv, "Aa:e:f:j:J:L:l:m:N:n:o:ps:S:")) !=3D -= 1) + while ((ch =3D getopt(argc, argv, "Aa:e:f:j:J:L:l:m:N:n:o:ps:S:t:")) + !=3D -1) switch (ch) { =20 case 'A': @@ -268,6 +271,18 @@ main(int argc, char *argv[]) Sflag =3D 1; break; =20 + case 't': + found_arg =3D 1; + name =3D "trim"; + tvalue =3D optarg; + if (strcmp(tvalue, "enable") !=3D 0 && + strcmp(tvalue, "disable") !=3D 0) { + errx(10, "bad %s (options are %s)", + name, "`enable' or `disable'"); + } + tflag =3D 1; + break; + default: usage(); } @@ -493,6 +508,24 @@ main(int argc, char *argv[]) sblock.fs_avgfpdir =3D svalue; } } + if (tflag) { + name =3D "issue TRIM to the disk"; + if (strcmp(tvalue, "enable") =3D=3D 0) { + if (sblock.fs_flags & FS_TRIM) + warnx("%s remains unchanged as enabled", name); + else { + sblock.fs_flags |=3D FS_TRIM; + warnx("%s set", name); + } + } else if (strcmp(tvalue, "disable") =3D=3D 0) { + if ((~sblock.fs_flags & FS_TRIM) =3D=3D FS_TRIM) + warnx("%s remains unchanged as disabled", name); + else { + sblock.fs_flags &=3D ~FS_TRIM; + warnx("%s cleared", name); + } + } + } =20 if (sbwrite(&disk, Aflag) =3D=3D -1) goto err; @@ -1011,12 +1044,13 @@ out: void usage(void) { - fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n", + fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n%s\n", "usage: tunefs [-A] [-a enable | disable] [-e maxbpg] [-f avgfilesize]", " [-J enable | disable] [-j enable | disable]",=20 " [-L volname] [-l enable | disable] [-m minfree]", " [-N enable | disable] [-n enable | disable]", -" [-o space | time] [-p] [-s avgfpdir] special | filesystem"); +" [-o space | time] [-p] [-s avgfpdir] [-t enable | disable]", +" special | filesystem"); exit(2); } =20 @@ -1035,6 +1069,8 @@ printfs(void) (sblock.fs_flags & FS_SUJ)? "enabled" : "disabled"); warnx("gjournal: (-J) %s", (sblock.fs_flags & FS_GJOURNAL)? "enabled" : "disabled"); + warnx("trim: (-p) %s",=20 + (sblock.fs_flags & FS_TRIM)? "enabled" : "disabled"); warnx("maximum blocks per file in a cylinder group: (-e) %d", sblock.fs_maxbpg); warnx("average file size: (-f) %d", diff --git a/sys/dev/md/md.c b/sys/dev/md/md.c index 6c3484e..c294d1b 100644 --- a/sys/dev/md/md.c +++ b/sys/dev/md/md.c @@ -713,11 +713,12 @@ md_kthread(void *arg) } mtx_unlock(&sc->queue_mtx); if (bp->bio_cmd =3D=3D BIO_GETATTR) { - if (sc->fwsectors && sc->fwheads && + if ((sc->fwsectors && sc->fwheads && (g_handleattr_int(bp, "GEOM::fwsectors", sc->fwsectors) || g_handleattr_int(bp, "GEOM::fwheads", - sc->fwheads))) + sc->fwheads))) || + g_handleattr_int(bp, "GEOM::candelete", 1)) error =3D -1; else error =3D EOPNOTSUPP; diff --git a/sys/geom/geom_disk.c b/sys/geom/geom_disk.c index 25d2e3b..eda702c 100644 --- a/sys/geom/geom_disk.c +++ b/sys/geom/geom_disk.c @@ -297,6 +297,9 @@ g_disk_start(struct bio *bp) } while (bp2 !=3D NULL); break; case BIO_GETATTR: + if (g_handleattr_int(bp, "GEOM::candelete", + (dp->d_flags & DISKFLAG_CANDELETE) !=3D 0)) + break; if (g_handleattr_int(bp, "GEOM::fwsectors", dp->d_fwsectors)) break; else if (g_handleattr_int(bp, "GEOM::fwheads", dp->d_fwheads)) diff --git a/sys/ufs/ffs/ffs_alloc.c b/sys/ufs/ffs/ffs_alloc.c index b740bbb..926229c 100644 --- a/sys/ufs/ffs/ffs_alloc.c +++ b/sys/ufs/ffs/ffs_alloc.c @@ -83,6 +83,8 @@ __FBSDID("$FreeBSD$"); =20 #include =20 +#include + #include #include #include @@ -1850,6 +1852,7 @@ ffs_blkfree(ump, fs, devvp, bno, size, inum, dephd) u_int cg; u_int8_t *blksfree; struct cdev *dev; + struct bio *bip; =20 cg =3D dtog(fs, bno); if (devvp->v_type =3D=3D VREG) { @@ -1962,6 +1965,16 @@ ffs_blkfree(ump, fs, devvp, bno, size, inum, dephd) softdep_setup_blkfree(UFSTOVFS(ump), bp, bno, numfrags(fs, size), dephd); bdwrite(bp); + + if (ump->um_candelete) { + bip =3D g_alloc_bio(); + bip->bio_cmd =3D BIO_DELETE; + bip->bio_offset =3D dbtob(fsbtodb(fs, bno)); + bip->bio_done =3D g_destroy_bio; + bip->bio_length =3D size; + g_io_request(bip, + (struct g_consumer *)devvp->v_bufobj.bo_private); + } } =20 #ifdef INVARIANTS diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 72f40da..f578382 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -895,6 +895,21 @@ ffs_mountfs(devvp, mp, td) mp->mnt_stat.f_mntonname); #endif } + if ((fs->fs_flags & FS_TRIM) !=3D 0) { + size =3D sizeof(int); + if (g_io_getattr("GEOM::candelete", cp, &size, + &ump->um_candelete) =3D=3D 0) { + if (!ump->um_candelete) + printf( +"WARNING: %s: TRIM flag on fs but disk does not support TRIM\n", + mp->mnt_stat.f_mntonname); + } else { + printf( +"WARNING: %s: TRIM flag on fs but cannot get whether disk supports TRIM\n", + mp->mnt_stat.f_mntonname); + ump->um_candelete =3D 0; + } + } =20 ump->um_mountp =3D mp; ump->um_dev =3D dev; diff --git a/sys/ufs/ffs/fs.h b/sys/ufs/ffs/fs.h index ba98ed3..13d9ede 100644 --- a/sys/ufs/ffs/fs.h +++ b/sys/ufs/ffs/fs.h @@ -417,6 +417,7 @@ CTASSERT(sizeof(struct fs) =3D=3D 1376); #define FS_FLAGS_UPDATED 0x0080 /* flags have been moved to new location */ #define FS_NFS4ACLS 0x0100 /* file system has NFSv4 ACLs enabled */ #define FS_INDEXDIRS 0x0200 /* kernel supports indexed directories */ +#define FS_TRIM 0x0400 /* issue BIO_DELETE for deleted blocks */ =20 /* * Macros to access bits in the fs_active array. diff --git a/sys/ufs/ufs/ufsmount.h b/sys/ufs/ufs/ufsmount.h index 03edd73..b13db40 100644 --- a/sys/ufs/ufs/ufsmount.h +++ b/sys/ufs/ufs/ufsmount.h @@ -95,6 +95,7 @@ struct ufsmount { time_t um_itime[MAXQUOTAS]; /* inode quota time limit */ char um_qflags[MAXQUOTAS]; /* quota specific flags */ int64_t um_savedmaxfilesize; /* XXX - limit maxfilesize */ + int um_candelete; /* devvp supports TRIM */ int (*um_balloc)(struct vnode *, off_t, int, struct ucred *, int, struct = buf **); int (*um_blkatoff)(struct vnode *, off_t, char **, struct buf **); int (*um_truncate)(struct vnode *, off_t, int, struct ucred *, struct thr= ead *); --wA9WyeW1yVBM2Q32 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk0AsKMACgkQC3+MBN1Mb4gTWgCgr4ImFVQhYac2hVOHZOnTneXv VR0AoKXYQX99T79attXilSYn6/kjX86+ =RLhi -----END PGP SIGNATURE----- --wA9WyeW1yVBM2Q32--