Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Jun 2013 09:39:07 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        fs@FreeBSD.org
Subject:   Re: missed clustering for small block sizes in cluster_wbuild()
Message-ID:  <20130612085648.L836@besplex.bde.org>
In-Reply-To: <20130612053543.X900@besplex.bde.org>
References:  <20130607044845.O24441@besplex.bde.org> <20130611063446.GJ3047@kib.kiev.ua> <20130612053543.X900@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 12 Jun 2013, Bruce Evans wrote:

> On Tue, 11 Jun 2013, Konstantin Belousov wrote:
>
>> On Fri, Jun 07, 2013 at 05:28:11AM +1000, Bruce Evans wrote:
>>> I think this is best fixed be fixed by removing the check above and
>>> checking here.  Then back out of the changes.  I don't know this code
>>> well enough to write the backing out easily.
>> 
>> Could you test this, please ?
>
> It works in limited testing.

> ...
> - there were a lot of contiguous dirty buffers, and this loop happily built
>  up a cluster with 17 pages, though mnt_iosize_max was only 17 pages.
>  Perhaps the extra page is necessary if the part of the buffer to be
>  written starts at a nonzero offset, but there was no offset in the case
>  that I observed (can there be one, and if so, is it limited to an offset
>  within the first page?  The general case needs 16 4K extra pages to write
>  a 64K-block (when the offset of the area to be written is 64K-512).

I now remember a bit more about how this works.  There is only a limited
amount of offseting.  The buffer might not be page-aligned relative to
the start of the disk.  Then the first page in the buffer must not all
be accessed (via this buffer) for i/o.  The first page is mapped at
bp->b_kvabase, but disk drivers must only access data starting at
bp->b_data, which is offset from bp->b_kvabase in the misaligned case.

I think this is the only relevant complication.  When misaligned buffers
are merged into a cluster buffer, they must all have the same misalignment
and size for the merge to work.  1 "extra" page, but no more, is always
required in the misaligned case to reach the full mnt_iosize_max.

msdosfs buffers may even be misaligned if they have size 64K!  All
msdosfs clusters may be misaligned if they have size >= PAGE_SIZE!
This is not good for performance, but should work.  Misalignment used
to be the usual case, since msdsofs metadata before the data clusters
tends to have an odd size in sectors and when the cluster size is >=
PAGE_SIZE the misalignment is preserved.  FreeBSD newfs_msdos shouldn't
produce misaligned buffers, but other systems' utilities might.  This
may also cause problems with the MAXBSIZE limit of 64K.  If it is a
hard limit on b_kvasize, then misaligned buffers of this size won't
be allowed.  If it only a limit on b_bcount, then there may be
fragmentation problems.

> ...
> I think it would work and fix other bugs to check (tbp->b_bcount +
> bp->b_bcount <= vp->v_mount->mnt_iosize_max) up front.  Drivers should
> be able to handle an i/o size of b_bcount however many pages that
> takes.  There must be a limit on b_pages, but it seems to be
> non-critical and the limit on b_bcount gives one of
> (mnt_iosize_max / PAGE_SIZE) rounded in some way and possibly increased
> by 1 or doubled to account for offsets.  If mnt_iosize_max is not a
> multiple of PAGE_SIZE, then the limit using pages doesn't even allow
> covering mnt_iosize_max using pages, since the rounding down is
> non-null.

I'm now trying the b_bcount check and not doing any backout later (just
print debugging info when it is reached).  The backout case is reached
even with the b_bcount check.  This is in the misaligned case.  The
misaligned case shouldn't break clustering since it is quite common.
It happens whenever the blocksize is small and the start of the cluster
is misaligned relative to the start of the disk.  If the block size is
larger, then all blocks may be misaligned.

> [read-before-write fix for msdosfs and generic problems with read-b4-write]
> ...  Then I noticed another problem.  MAXPHYS twice mnt_iosize_max,
> so the cluster size is only mnt_iosize_max = DFLTPHYS = 64K.  This
> apparently acts badly with vfs.read_max = 256 512-blocks.  I think
> it breaks read-ahead.  Throughput drops by a factor of 4 for read-before
> write relative to direct writes (not counting the factor of 2 for the
> doubled i/o from the reads), although all the i/o sizes are 64K.
> Increasing vfs.read_max by just 16 fixes this.  The throughput drop
> is then only 10-20% (there must be some drop for the extra seeks).
> I'm not sure if extra read-ahead is good or bad here.  More read-ahead
> in read-before-write reduces seeks, but it may also break drives'
> caching and sequential heuristics.  My drives are old and have small
> caches and are very sensitive to the i/o pattern for read-before-write.

I confirmed that this has something to do with the drive.  After reaching
a quiescent pattern with "dd bs=1k count=1024k conv=notrunc" for almost-
contiguous files (and 1k < fs block size, and fs = msdosfs with MAXPHYS
read-before-write), reads and writes alternate with reads some constant
distance ahead of writes.  The difference depends on vfs.read_max.  It is
sometimes a multiple of 128 512-blocks, but often not.  My drives don't
like some fixed distances.  I don't understand their pattern.  They seem
to prefer non-power-of-2 differnces.  Turning off read-ahead by setting
vfs.read_max to 0 gives the worset performance (reduce by another power
of 2).  The levels of reduced performance are quantized: one level at
7 times slower, one level at 4 times slower and one level at 10-20% slower.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130612085648.L836>