Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Apr 2010 16:38:28 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        arch@FreeBSD.org, Rick Macklem <rmacklem@uoguelph.ca>
Subject:   Re: (in)appropriate uses for MAXBSIZE
Message-ID:  <20100414144336.L12587@delplex.bde.org>
In-Reply-To: <4BC34402.1050509@freebsd.org>
References:  <4BBEE2DD.3090409@freebsd.org> <Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <4BC34402.1050509@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 12 Apr 2010, Andriy Gapon wrote:

> on 11/04/2010 05:56 Bruce Evans said the following:
>> On Fri, 9 Apr 2010, Andriy Gapon wrote:
> [snip]
>>> I have lightly tested this under qemu.
>>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
>>> I removed size > MAXBSIZE check in getblk (see a parallel thread
>>> "panic: getblk:
>>> size(%d) > MAXBSIZE(%d)").
>>
>> Did you change the other known things that depend on this?  There is the
>> b_pages limit of MAXPHYS bytes which should be checked for in another
>> way
>
> I changed the check the way I described in the parallel thread.

I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE.
Was an explicit check needed?  (An implicit check would probably have
worked: most clients were limited by the MAXBSIZE check, and the pbuf
client always uses MAXPHYS or DFLTPHYS.)

>> and the soft limits for hibufspace and lobufspace which only matter
>> under load conditions.
>
> And what these should be?
> hibufspace and lobufspace seem to be auto-calculated.  One thing that I noticed
> and that was a direct cause of the problem described below, is that difference
> between hibufspace and lobufspace should be at least the maximum block size
> allowed in getblk() (perhaps it should be strictly equal to that value?).
> So in my case I had to make that difference MAXPHYS.

Hard to say.  They are mostly only heuristics which mostly only matter under
heavy loads.  You can change the defaults using sysctl but it is even harder
to know what changes might be good without knowing the details of the
implementation.

>>> And I bumped MAXPHYS to 1MB.
>>> ...
>>> But I observed reading process (dd bs=1m on avgfs) spending a lot of
>>> time sleeping
>>> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
>>> Apparently there was some shortage of free buffers.
>>> Perhaps some limits/counts were incorrectly auto-tuned.
>>
>> This is not surprising, since even 64K is 4 times too large to work
>> well.  Buffer sizes of larger than BKVASIZE (16K) always cause
>> fragmentation of buffer kva.  ...
>
> So, BKVASIZE is the best read size from the point of view of buffer space usage?

It is the best buffer size, which is almost independent of the best read
size.  First, userland reads will be re-blocked into file-system-block-size
reads...

> E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but
> leads to buffer space map fragmentation, because of size > BKVASIZE.
> On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from
> buffer space point of view (no fragmentation potential) but they result in 4 GEOM

Clustering occurs above geom, so geom only sees small requests for small
files, random accesses, and buggy cases for sequential accesses to large
files where the bugs give partial randomness.

E.g., a single 64K read from userland normally gives 4 16K ffs blocks
in the buffer cache.  Clustering turns these into 1 128K block in a
pbuf (64K for the amount read now and 128K for read-ahead; there may
be more read-ahead but it would go in another pbuf).  geom then sees
the 128K (MAXPHYS) block.  Most device drivers still only support i/o's
of size <= DFLTPHYS, but geom confuses the clustering code into producing
clusters larger than its intended maximum of what the device supports
by advertising support for MAXPHYS (v_mount->mnt_iosize_max).  So geom
normally turns the 128K request into 2 64K requests.  Clustering
finishes by converting the 128K request into 8 16K requests (4 for use
now and 4 later for read-ahead).

OTOH, the first block of 4 sequential reads of 16K produces the same
128K block at the geom level, modulo bugs in the read-ahead.  This now
consists 1 and 7 blocks of normal read and read-ahead, respectively,
instead of 4 and 4.  Then the next 3 blocks are found in the buffer
cache as read-ahead instead of read from the disk (actually, this is
insignificantly different from the first case after ffs splits up the 
64K into 4 times 16K).

So the block size makes almost no difference at the syscall level
(512-blocks take significantly more CPU but improve latency, while
hige blocks take significantly less CPU but significantly unimprove
latency).

The file system block size makes only secondary differences:
- clustering only works to turn small logical i/o's into large physical
   ones when sequential blocks are allocated sequentially, but always
   allocating blocks sequentially is hard to do and using large file
   system blocks reduces the loss when the allocation is not sequential
- large file system blocks also reduce the amount of work that clustering
   has to do to reblock.  This benefit is much smaller than the previous
   one.
- the buffer cache is only designed to handle medium-sized blocks well.
   With 512-blocks, it can only hold 1/32 as much as with 16K-blocks,
   so it will thrash 32 times as much with the former.  Now that the
   thrashing is to VMIO instead of to the disk, this only wastes CPU.
   With any block size larger than BKVASIZE, the buffer cache may become
   fragmented, depending on the combination of block sizes.  Mixed
   combinations are the worst, and the system doesn't do anything to
   try to avoid them.  The worst case is a buffer cache full of 512-blocks,
   with getblk() wanting to allocate a 64K-block.  Then it needs to
   wait for 32 contiguous blocks to become free, or forcibly free some,
   or move some...


> I/O requests.
> The thing is that a single read requires a single contiguous virtual address space
> chunk.  Would it be possible to take the best of both worlds by somehow allowing a
> single large I/O request to work with several buffers (with b_kvasize == BKVASIZE)
> in a iovec-like style?
> Have I just reinvented bicycle? :)
> Probably not, because an answer to my question is probably 'not (without lots of
> work in lots of places)' as well.

Separate buffers already partly provided, this, and combined with command
queuing in the hardware they provided it completely in perhaps a better
way than can be done in software.

vfs clustering attempts much less but still complicated.  It mainly wants
to convert buffers that have contiguous disk addresses into a super-buffer
that has contiguous virtual memory and combine this with read-ahead, to
reduce the number of i/o's.  All drives less than 10 years old benefit
only marginally from this, since the same cases that vfs clustering can
handle are also easy for drive clustering, caching and read-ahead/write-
behind (especially the latter) to handle even better, so I occasionally
try turning off vfs clustering to see if it makes a difference;
unfortunately it still seems to help on all drives, including even
reducing total CPU usage despite its own large CPU usage.

> I see that breadn() certainly doesn't work that way.  As I understand, it works
> like bread() for one block plus starts something like 'asynchronous breads()' for
> a given count of other blocks.

Usually breadn() isn't called, but clustering reads to the end of the current
cluster or maybe the next cluster.  breadn() was designed when reading
ahead a single cluster was helpful.  Now, drives read-ahead a whole track
or similar probably hundreds of sectors, so reading ahead a single sector
is almost useless.  It doesn't even reduce the number of i/o's unless it is
clustered with the previous i/o.

> I am not sure about details of how cluster_read() works, though.
> Could you please explain the essence of it?

See above.  It is essentially the old hack of reading ahead a whole
track in software, done in a sophisticated way but with fewer attempts
to satisfy disk geometry timing requirements.  Long ago, everything
was so slow that sequential reads done from userland could not keep
up with even a floppy disk, but sequential i/o's done from near the
driver could, even with i/o's of only 1 sector.  I've only ever seen
this working well for floppy disks.  For hard disks, the i/o's need
to be multi-sector, and needed to be related to the disk geometry
(handle full tracks and don't keep results from intermediate sectors
that are not needed yet iff doing so wouldn't thrash the cache).  Now,
it is unreasonable to try to know the disk geometry, and vfs clustering
doesn't try.  Fortunately, this is not much needed, since newer drives
have their own track caches which, although they don't form a complete
replacement for vfs clustering (see above), they reduces the losses
to extra non-physical reads.  Similarly for another problem with vfs:
all buffers and their clustering are file (vnode) based, which almost
forces missing intermediate sectors when reading a file, but a working
device (track or similar) in the drive mostly compensates for not having
one in the OS.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100414144336.L12587>