Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Jul 2009 01:54:14 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alexander Motin <mav@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: DFLTPHYS vs MAXPHYS
Message-ID:  <20090707011217.O43961@delplex.bde.org>
In-Reply-To: <4A50F619.4020101@FreeBSD.org>
References:  <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 5 Jul 2009, Alexander Motin wrote:

> Bruce Evans wrote:

>> My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2
>> de-geomed):
>> 
>> regular file:
>> 
>> block size    %idle
>> ----------    -----
>> 1M            87
>> 16K           91
>> 4K            88 (?)
>> 512           72 (?)
>> 
>> disk file:
>> 
>> block size    %idle
>> ----------    -----
>> 1M            96
>> 64K           96
>> 32K           93
>> 16K           87
>> 8K            82 (firmware can't keep up and rate drops to 37MB/S)
>> 
>> In the case of the regular file, almost all i/o is clustered so the driver
>> sees mainly the cluster size (driver max size of 64K before geom).  Upper
>> layers then do a good job of only adding a few percent CPU when 
>> declustering
>> to 16K fs-blocks.
>
> In this tests you've got almost only negative side of effect, as you have 
> said, due to cache misses.

No, I got negative and positive for the regular file (due to cache misses
for large block sizes and too many transactions for very small block sizes
(< 16K), and only positive for the disk file (due to cache misses not
being tested).

> Do you really have CPU with so small L2 cache? 
> Some kind of P3 or old Celeron?

It is 1M as stated on an A64 (not stated).  Since the disk file case
ses a pbuf, it only thrashes about half as much cache as the regular
file, provided the used part of the pbuf data is small compared with
the cache size.  I forgot to test with a user buffer size of 2M.

> But with 64K MAXPHYS you just didn't get any 
> benefit from using bigger block size.

MAXPHYS is 128K.  The ata driver has a limit of 64K so anything larger
than 64K wouldn't do much except increase cache misses.  In physio(),
it would just causes physio() to ask the driver to read 64K at a time.
My claim is partly that 64K such a large size that the extra CPU caused
by splitting up into 64K-blocks is insignificant.

Here are better results for the disk file test, with cache accesses and
misses counted by perfmon:

% dd if=/dev/ad2 of=/dev/null bs=16384 count=16384
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.857302 secs (55264313 bytes/sec)
% 146378905
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.782373 secs (56130180 bytes/sec)
% 946562
% dd if=/dev/ad2 of=/dev/null bs=32768 count=8192
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.715802 secs (56922546 bytes/sec)
% 79404995
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.749098 secs (56523463 bytes/sec)
% 640427
% dd if=/dev/ad2 of=/dev/null bs=65536 count=4096
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.740766 secs (56622802 bytes/sec)
% 45633277
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.882316 secs (54981173 bytes/sec)
% 424469

Cache misses are minimized here using a user buffer size of 64K.

% dd if=/dev/ad2 of=/dev/null bs=131072 count=2048
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.873972 secs (55075298 bytes/sec)
% 42296347
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.940565 secs (54332946 bytes/sec)
% 497104
% dd if=/dev/ad2 of=/dev/null bs=262144 count=1024
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.982193 secs (53878976 bytes/sec)
% 38617107
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.715697 secs (56923816 bytes/sec)
% 522888
% dd if=/dev/ad2 of=/dev/null bs=524288 count=512
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.957179 secs (54150849 bytes/sec)
% 37115853
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.923855 secs (54517338 bytes/sec)
% 521308
% dd if=/dev/ad2 of=/dev/null bs=1048576 count=256
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.707334 secs (57024946 bytes/sec)
% 36526303

Cache accesses are minimized here using a user buffer size of 1M.

% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.715655 secs (56924319 bytes/sec)
% 541909
% dd if=/dev/ad2 of=/dev/null bs=2097152 count=128
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.715631 secs (56924610 bytes/sec)
% 36628946
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.707306 secs (57025284 bytes/sec)
% 534541

Cache misses are only increased a little here with a user buffer size
of 2M.  I can't explain this.  Maybe I misremember my CPU's cache size.

% dd if=/dev/ad2 of=/dev/null bs=4194304 count=64
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.965433 secs (54060837 bytes/sec)
% 37688487
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.740570 secs (56625145 bytes/sec)
% 2443717

Cache misses increased by a factor of 5 going from user buffer size
2M to 4M.

% dd if=/dev/ad2 of=/dev/null bs=8388608 count=32
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 5.056997 secs (53081988 bytes/sec)
% 39425354
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.907099 secs (54703493 bytes/sec)
% 589090
% dd if=/dev/ad2 of=/dev/null bs=16777216 count=16
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.998672 secs (53701354 bytes/sec)
% 49361807
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.732208 secs (56725202 bytes/sec)
% 603496
% dd if=/dev/ad2 of=/dev/null bs=33554432 count=8
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.965315 secs (54062119 bytes/sec)
% 61536416
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.882041 secs (54984269 bytes/sec)
% 3947985
% dd if=/dev/ad2 of=/dev/null bs=67108864 count=4
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.857003 secs (55267715 bytes/sec)
% 78234741
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.931896 secs (54428448 bytes/sec)
% 8580752
% dd if=/dev/ad2 of=/dev/null bs=134217728 count=2
% # s/kx-dc-accesses 
% 268435456 bytes transferred in 4.815146 secs (55748145 bytes/sec)
% 124758517
% # s/kx-dc-misses 
% 268435456 bytes transferred in 4.865137 secs (55175312 bytes/sec)
% 13808781

Cache misses increased by a another factor of 5 going from user buffer
size 4M to 128M.  I can't explain why there are as many as 13.8 million
-- I would have expected 2*256M/64 = 8M only, but in more cases.  8
million cache misses in only 4.8 seconds is a lot, and you would get
that many in only 1.3 seconds at 200MB/S.  Of course, 128M is a silly
buffer size, but I would expect the cache effects to show up at about
half the L2 size under more realistic loads.

Cache accesses varied significantly, between 146 million (block size
16384), 37 million (block size 1M) and 138 million (block size 128M).
I can only partly explain this.  I think the minimum number is
2*256M/16 = 32M (for fetching from L2 to L1 16 bytes at a time).
128M might result from fetching 4 bytes at a time or thrashing causing
the equivalent.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090707011217.O43961>