Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Sep 2010 15:30:13 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Brandon Gooch <jamesbrandongooch@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ext2fs now extremely slow
Message-ID:  <201009291530.13434.jhb@freebsd.org>
In-Reply-To: <AANLkTim_0mPVZeP1b2z38apvskaLeKnjCvVwH0BK9dAN@mail.gmail.com>
References:  <20100929031825.L683@besplex.bde.org> <201009290917.05269.jhb@freebsd.org> <AANLkTim_0mPVZeP1b2z38apvskaLeKnjCvVwH0BK9dAN@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, September 29, 2010 2:50:16 pm Brandon Gooch wrote:
> On Wed, Sep 29, 2010 at 8:17 AM, John Baldwin <jhb@freebsd.org> wrote:
> > On Wednesday, September 29, 2010 12:16:55 am Aditya Sarawgi wrote:
> >> On Wed, Sep 29, 2010 at 09:14:57AM +1000, Bruce Evans wrote:
> >> > On Wed, 29 Sep 2010, Bruce Evans wrote:
> >> >
> >> > > On Wed, 29 Sep 2010, Bruce Evans wrote:
> >> > >
> >> > >> For benchmarks on ext2fs:
> >> > >>
> >> > >> Under FreeBSD-~5.2 rerun today:
> >> > >> untar:     59.17 real
> >> > >> tar:       19.52 real
> >> > >>
> >> > >> Under -current run today:
> >> > >> untar:    101.16 real
> >> > >> tar:      172.03 real
> >> > >>
> >> > >> So, -current is 8.8 times slower for tar, but only 1.7 times slower 
for
> >> > >> untar.
> >> > >> ...
> >> > >> So it seems that only 1 block in every 8 is used, and there is a 
seek
> >> > >> after every block.  This asks for an 8-fold reduction in throughput,
> >> > >> and it seems to have got that and a bit more for reading although 
not
> >> > >> for writing.  Even (or especially) with perfect hardware, it must 
give
> >> > >> an 8-fold reduction.  And it is likely to give more, since it 
defeats
> >> > >> vfs clustering by making all runs of contiguous blocks have length 
1.
> >> > >>
> >> > >> Simple sequential allocation should be used unless the allocation 
policy
> >> > >> and implementation are very good.
> >> > >
> >> > > This work a bit better after zapping the 8-fold way:
> >> >    Things
> >> > > ...
> >> > > This gives an improvement of:
> >> > >
> >> > > untar:    101.16 real -> 63.46
> >> > > tar:      172.03 real -> 50.70
> >> > >
> >> > > Now -current is only 1.1 times slower for untar and 2.6 times slower 
for
> >> > > tar.
> >> > >
> >> > > There must be a problem with bpref for things to have been so bad. 
 There
> >> > > is some point to leaving a gap of 7 blocks for expansion, but the gap 
was
> >> > > left even between blocks in a single file.
> >> > > ...
> >> > > I haven't tried the bde_blkpref hack in the above.  It should kill 
bpref
> >> > > completely so that there is no jump between lbn0 and lbn1, and break
> >> > > cylinder group based allocation even better.  Setting bde_blkpref to 
1
> >> > > restores the bug that was present in ext2fs in FreeBSD between 1995 
and
> >> > > 2010.  This bug gave seqential allocation starting at the beginning 
of
> >> > > the disk in almost all cases, so map searches were slow and early 
groups
> >> > > filled up before later groups were used at all.
> >> >
> >> > Tried this (patch repeated below), and it gave essentially the same
> >> > speed as old versions.
> >> >
> >> > The main problem seems to be that the `goal' variables aren't 
initialized.
> >> > After restoring bits verbatim from an old version, things seem to work 
as
> >> > expected:
> >> >
> >> > % Index: ext2_alloc.c
> >> > % ===================================================================
> >> > % RCS file: /home/ncvs/src/sys/fs/ext2fs/ext2_alloc.c,v
> >> > % retrieving revision 1.2
> >> > % diff -u -2 -r1.2 ext2_alloc.c
> >> > % --- ext2_alloc.c  1 Sep 2010 05:34:17 -0000       1.2
> >> > % +++ ext2_alloc.c  28 Sep 2010 21:08:42 -0000
> >> > % @@ -1,2 +1,5 @@
> >> > % +int bde_blkpref = 0;
> >> > % +int bde_alloc8 = 0;
> >> > % +
> >> > %  /*-
> >> > %   *  modified for Lites 1.1
> >> > % @@ -117,4 +120,8 @@
> >> > %                                                   ext2_alloccg);
> >> > %          if (bno > 0) {
> >> > % +         /* set next_alloc fields as done in block_getblk */
> >> > % +         ip->i_next_alloc_block = lbn;
> >> > % +         ip->i_next_alloc_goal = bno;
> >> > % +
> >> > %                  ip->i_blocks += btodb(fs->e2fs_bsize);
> >> > %                  ip->i_flag |= IN_CHANGE | IN_UPDATE;
> >> >
> >> > The only things that changed recently in this block were the 4 deleted
> >> > lines and 4 lines with tabs corrupted to spaces.  Perhaps an editing
> >> > error.
> >> >
> >> > % @@ -542,6 +549,12 @@
> >> > %      then set the goal to what we thought it should be
> >> > %   */
> >> > % +if (bde_blkpref == 0) {
> >> > %   if(ip->i_next_alloc_block == lbn && ip->i_next_alloc_goal != 0)
> >> > %           return ip->i_next_alloc_goal;
> >> > % +} else if (bde_blkpref == 1) {
> >> > % + if(ip->i_next_alloc_block == lbn)
> >> > % +         return ip->i_next_alloc_goal;
> >> > % +} else
> >> > % + return 0;
> >> > %
> >> > %   /* now check whether we were provided with an array that basically
> >> >
> >> > Not needed now.
> >> >
> >> > % @@ -662,4 +675,5 @@
> >> > %    * block.
> >> > %    */
> >> > % +if (bde_alloc8 == 0) {
> >> > %   if (bpref)
> >> > %           start = dtogd(fs, bpref) / NBBY;
> >> > % @@ -679,4 +693,5 @@
> >> > %           }
> >> > %   }
> >> > % +}
> >> > %
> >> > %   bno = ext2_mapsearch(fs, bbp, bpref);
> >> >
> >> > The code to skip to the next 8-block boundary should be removed 
permanently.
> >> > After fixing the initialization, it doesn't generate holes inside files 
but
> >> > it still generates holes between files.  The holes are quite large with
> >> > 4K-blocks.
> >> >
> >> > Benchmark results with just the initialization of `goal' variables 
restored:
> >> >
> >> > %%%
> >> > ext2fs-1024-1024:
> >> > tarcp /f srcs:                 78.79 real         0.31 user         
4.94 sys
> >> > tar cf /dev/zero srcs:         24.62 real         0.19 user         
1.82 sys
> >> > ext2fs-1024-1024-as:
> >> > tarcp /f srcs:                 52.07 real         0.26 user         
4.95 sys
> >> > tar cf /dev/zero srcs:         24.80 real         0.10 user         
1.93 sys
> >> > ext2fs-4096-4096:
> >> > tarcp /f srcs:                 74.14 real         0.34 user         
3.96 sys
> >> > tar cf /dev/zero srcs:         33.82 real         0.10 user         
1.19 sys
> >> > ext2fs-4096-4096-as:
> >> > tarcp /f srcs:                 53.54 real         0.36 user         
3.87 sys
> >> > tar cf /dev/zero srcs:         33.91 real         0.14 user         
1.15 sys
> >> > %%%
> >> >
> >> > The much larger holes between the files are apparently responsible for 
the
> >> > decreased speed with 4K-blocks.  1K-blocks are really too small, so 4K-
blocks
> >> > should be faster.
> >> >
> >> > Benchmark results with the fix and bde_alloc8 = 1.
> >> >
> >> > ext2fs-1024-1024:
> >> > tarcp /f srcs:                 71.60 real         0.15 user         
2.04 sys
> >> > tar cf /dev/zero srcs:         22.34 real         0.05 user         
0.79 sys
> >> > ext2fs-1024-1024-as:
> >> > tarcp /f srcs:                 46.03 real         0.14 user         
2.02 sys
> >> > tar cf /dev/zero srcs:         21.97 real         0.05 user         
0.80 sys
> >> > ext2fs-4096-4096:
> >> > tarcp /f srcs:                 59.66 real         0.13 user         
1.63 sys
> >> > tar cf /dev/zero srcs:         19.88 real         0.07 user         
0.46 sys
> >> > ext2fs-4096-4096-as:
> >> > tarcp /f srcs:                 37.30 real         0.12 user         
1.60 sys
> >> > tar cf /dev/zero srcs:         19.93 real         0.05 user         
0.49 sys
> >> >
> >> > Bruce
> >>
> >> Hi,
> >>
> >> I see what you are saying. The gap of 8 block between the files
> >> is due to the old preallocation which used to allocate additional
> >> 8 blocks in advance for a particular inode when allocating a block
> >> for it. The gap between blocks of the same file shouldn't be there
> >> too. Both of these cases should be removed. I will look into this
> >> during this week. The slowness is also due to lack of preallocation
> >> in the new code.
> >
> > One of the GSoC students worked on a patch to add preallocation back to
> > ext2fs this summer.  Would you be interested in reviewing and/or testing
> > that patch?  (I've attached it).  Here is his original e-mail:
> >
> > <quote>
> > Hi all,
> >
> > There is a patch in attachment which implements a preallocation
> > algorithm in ext2fs. I implement this algorithm during FreeBSD SoC 2010.
> >
> > This patch implements the in-memory ext2/3 block preallocation algorithm
> > from reservation window. It uses a RB-tree to index block allocation
> > request and reserve a number of blocks for each file which has requested
> > to allocate a block. When a file request to allocate a block, it will
> > find a block to allocate to this file. When it find the block to
> > allocate, it will try to allocate a block, which is in the same cylinder
> > group with inode and is not in other reservation window in RB-tree.
> > Meanwhile there are some contiguous free blocks after this block. It
> > uses a data structure to store this block's position and the length of
> > contiguous free blocks. Then it inserts this data structure into
> > RB-tree. When this file request to allocate a block again, It will find
> > corresponding data structure in RB-tree. If it can find, the next free
> > block will be allocated to this file directly. Otherwise, it will search
> > a new block again.
> >
> > I have run some benchmarks to test this algorithm. Please review it in
> > wiki page (' http://wiki.freebsd.org/SOC2010ZhengLiu'). The performance
> > is better when the number of threads is smaller than 4. When the number
> > of threads is greater than 4, the performance can be increased a little.
> >
> > Please test it.
> >
> >
> > Thanks and best regards,
> >
> > lz
> > </quote>
> 
> Wow, this is really awesome! What are the chances of this code being
> committed before a 9.0 release (assuming we have enough user testing)?

Good if it gets testing and review.  He also worked on read-only support for 
ext4 (in a second patch).  Both patches were posted to this list (fs@) several 
weeks ago.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201009291530.13434.jhb>