Date: Thu, 30 Oct 2003 11:07:20 -0800 From: Ken Marx <kmarx@vicor.com> To: Don Lewis <truckman@FreeBSD.org> Cc: mckusick@beastie.mckusick.com Subject: Re: 4.8 ffs_dirpref problem Message-ID: <3FA16168.2010209@vicor.com> In-Reply-To: <200310300641.h9U6fWeF031328@gw.catspoiler.org> References: <200310300641.h9U6fWeF031328@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Don Lewis wrote: > On 29 Oct, Ken Marx wrote: > >>Don Lewis wrote: > > >>>I think the real problem is the following code in ffs_dirpref(): >>> >>> avgifree = fs->fs_cstotal.cs_nifree / fs->fs_ncg; >>> avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg; >>> avgndir = fs->fs_cstotal.cs_ndir / fs->fs_ncg; >>>[snip] >>> maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg); >>> minifree = avgifree - fs->fs_ipg / 4; >>> if (minifree < 0) >>> minifree = 0; >>> minbfree = avgbfree - fs->fs_fpg / fs->fs_frag / 4; >>> if (minbfree < 0) >>> minbfree = 0; >>>[snip] >>> prefcg = ino_to_cg(fs, pip->i_number); >>> for (cg = prefcg; cg < fs->fs_ncg; cg++) >>> if (fs->fs_cs(fs, cg).cs_ndir < maxndir && >>> fs->fs_cs(fs, cg).cs_nifree >= minifree && >>> fs->fs_cs(fs, cg).cs_nbfree >= minbfree) { >>> if (fs->fs_contigdirs[cg] < maxcontigdirs) >>> return ((ino_t)(fs->fs_ipg * cg)); >>> } >>> for (cg = 0; cg < prefcg; cg++) >>> if (fs->fs_cs(fs, cg).cs_ndir < maxndir && >>> fs->fs_cs(fs, cg).cs_nifree >= minifree && >>> fs->fs_cs(fs, cg).cs_nbfree >= minbfree) { >>> if (fs->fs_contigdirs[cg] < maxcontigdirs) >>> return ((ino_t)(fs->fs_ipg * cg)); >>> } >>> >>>If the file system is more than 75% full, minbfree will be zero, which >>>will allow new directories to be created in cylinder groups that have no >>>free blocks for either the directory itself, or for any files created in >>>that directory. If this happens, allocating the blocks for the >>>directory and its files will require ffs_alloc() to do an expensive >>>search across the cylinder groups for each block. It looks to me like >>>minbfree needs to equal, or at least a lot closer to avgbfree. > > > Actually, I think the expensive search will only happen for the first > block in each file (and the other blocks will be allocated in the same > cylinder group), but if you are creating tons of files that are only one > block long ... > > >>>A similar situation exists with minifree. Please note that the fallback >>>algorithm uses the condition: >>> fs->fs_cs(fs, cg).cs_nifree >= avgifree >>> >>> >>> >> >>Interesting. We (Vicor) will defer to experts here, but are very willing to >>test anything you come up with. > > > You might try the lightly tested patch below. It tweaks the dirpref > algorithm so that cylinder groups with free space >= 75% of the average > free space and free inodes >= 75% of the average number of free inodes > are candidates for allocating the directory. It will not chose a > cylinder group that does not have at least one free block and one free > inode. > > It also decreases maxcontigdirs as the free space decreases so that a > cluster of directories is less likely to cause the cylinder group to > overflow. I think it would be better to tune maxcontigdirs individually > for each cylinder group, based on the free space in that cylinder group, > but that is more complex ... > > Index: sys/ufs/ffs/ffs_alloc.c > =================================================================== > RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_alloc.c,v > retrieving revision 1.64.2.2 > diff -u -r1.64.2.2 ffs_alloc.c > --- sys/ufs/ffs/ffs_alloc.c 21 Sep 2001 19:15:21 -0000 1.64.2.2 > +++ sys/ufs/ffs/ffs_alloc.c 30 Oct 2003 06:01:38 -0000 > @@ -696,18 +696,18 @@ > * optimal allocation of a directory inode. > */ > maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg); > - minifree = avgifree - fs->fs_ipg / 4; > - if (minifree < 0) > - minifree = 0; > - minbfree = avgbfree - fs->fs_fpg / fs->fs_frag / 4; > - if (minbfree < 0) > - minbfree = 0; > + minifree = avgifree - avgifree / 4; > + if (minifree < 1) > + minifree = 1; > + minbfree = avgbfree - avgbfree / 4; > + if (minbfree < 1) > + minbfree = 1; > cgsize = fs->fs_fsize * fs->fs_fpg; > dirsize = fs->fs_avgfilesize * fs->fs_avgfpdir; > curdirsize = avgndir ? (cgsize - avgbfree * fs->fs_bsize) / avgndir : 0; > if (dirsize < curdirsize) > dirsize = curdirsize; > - maxcontigdirs = min(cgsize / dirsize, 255); > + maxcontigdirs = min((avgbfree * fs->fs_bsize) / dirsize, 255); > if (fs->fs_avgfpdir > 0) > maxcontigdirs = min(maxcontigdirs, > fs->fs_ipg / fs->fs_avgfpdir); > > Thanks Don, re: ... > cylinder group), but if you are creating tons of files that are only one > block long ... Not terribly scientific, but when our test bogs down, it's often in a directory with 6400 1-block files. So, your comment seems plausible. Anyway - I just tested your patch. Again, unloaded system, repeatedly untaring a 1.5gb file, starting at 97% capacity. and: tunefs: average file size: (-f) 49152 tunefs: average number of files in a directory: (-s) 1500 ... Takes about 74 system secs per 1.5gb untar: ------------------------------------------- /dev/da0s1e 558889580 497843972 16334442 97% 6858407 63316311 10% /raid 119.23 real 1.28 user 73.09 sys /dev/da0s1e 558889580 499371100 14807314 97% 6879445 63295273 10% /raid 111.69 real 1.32 user 73.65 sys /dev/da0s1e 558889580 500898228 13280186 97% 6900483 63274235 10% /raid 116.67 real 1.44 user 74.19 sys /dev/da0s1e 558889580 502425356 11753058 98% 6921521 63253197 10% /raid 114.73 real 1.25 user 75.01 sys /dev/da0s1e 558889580 503952484 10225930 98% 6942559 63232159 10% /raid 116.95 real 1.30 user 74.10 sys /dev/da0s1e 558889580 505479614 8698800 98% 6963597 63211121 10% /raid 115.29 real 1.39 user 74.25 sys /dev/da0s1e 558889580 507006742 7171672 99% 6984635 63190083 10% /raid 114.01 real 1.16 user 74.04 sys /dev/da0s1e 558889580 508533870 5644544 99% 7005673 63169045 10% /raid 119.95 real 1.32 user 75.05 sys /dev/da0s1e 558889580 510060998 4117416 99% 7026711 63148007 10% /raid 114.89 real 1.33 user 74.66 sys /dev/da0s1e 558889580 511588126 2590288 99% 7047749 63126969 10% /raid 114.91 real 1.58 user 74.64 sys /dev/da0s1e 558889580 513115254 1063160 100% 7068787 63105931 10% /raid tot: 1161.06 real 13.45 user 742.89 sys Compares pretty favorably to our naive, retro 4.4 dirpref hack that averages in the mid-high 60's: -------------------------------------------------------------------- /dev/da0s1e 558889580 497843952 16334462 97% 6858406 63316312 10% /raid 110.19 real 1.42 user 65.54 sys /dev/da0s1e 558889580 499371080 14807334 97% 6879444 63295274 10% /raid 105.47 real 1.47 user 65.09 sys /dev/da0s1e 558889580 500898208 13280206 97% 6900482 63274236 10% /raid 110.17 real 1.48 user 64.98 sys /dev/da0s1e 558889580 502425336 11753078 98% 6921520 63253198 10% /raid 131.88 real 1.49 user 71.20 sys /dev/da0s1e 558889580 503952464 10225950 98% 6942558 63232160 10% /raid 111.61 real 1.62 user 67.47 sys /dev/da0s1e 558889580 505479594 8698820 98% 6963596 63211122 10% /raid 131.36 real 1.67 user 90.79 sys /dev/da0s1e 558889580 507006722 7171692 99% 6984634 63190084 10% /raid 115.34 real 1.49 user 65.61 sys /dev/da0s1e 558889580 508533850 5644564 99% 7005672 63169046 10% /raid 110.26 real 1.39 user 65.26 sys /dev/da0s1e 558889580 510060978 4117436 99% 7026710 63148008 10% /raid 116.15 real 1.51 user 65.47 sys /dev/da0s1e 558889580 511588106 2590308 99% 7047748 63126970 10% /raid 112.74 real 1.37 user 65.01 sys /dev/da0s1e 558889580 513115234 1063180 100% 7068786 63105932 10% /raid 1158.36 real 15.01 user 686.57 sys Without either, we'd expect timings of 5-20 minutes when things are going poorly. Happy to test further if you have tweaks to your patch or things you'd like us to test in particular. E.g., load, newfs, etc. k. -- Ken Marx, kmarx@vicor-nb.com As a company we must not put the cart before the horse and set up weekly meetings on the solution space. - http://www.bigshed.com/cgi-bin/speak.cgi
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3FA16168.2010209>