Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 6 Jul 2009 04:32:11 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alexander Motin <mav@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: DFLTPHYS vs MAXPHYS
Message-ID:  <20090706034250.C2240@besplex.bde.org>
In-Reply-To: <4A50DEE8.6080406@FreeBSD.org>
References:  <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 5 Jul 2009, Alexander Motin wrote:

> Bruce Evans wrote:
>> I was thinking more of transfers to userland.  Increasing user buffer
>> sizes above about half the L2 cache size guarantees busting the L2
>> cache, if the application actually looks at all of its data.  If the
>> data is read using read(), then the L2 cache will be busted twice (or
>> a bit less with nontemporal copying), first by copying out the data
>> and then by looking at it.  If the data is read using mmap(), then the
>> L2 cache will only be busted once.  This effect has always been very
>> noticeable using dd.  Larger buffer sizes are also bad for latency.
> ...
> How to reproduce that dd experiment? I have my system running with MAXPHYS of 
> 512K and here is what I have:

I used a regular file with the same size as main memory (1G), and for
today's test, not quite dd, but a program that throws away the data
(so as to avoid overcall for write syscalls) and prints status info
in a more suitable form than even dd's ^T.

Your results show that physio() behaves quite differently than copying
reading a regular file.  I see similar behaviour input from a disk file.

> # dd if=/dev/ada0 of=/dev/null bs=512k count=1000
> 1000+0 records in
> 1000+0 records out
> 524288000 bytes transferred in 2.471564 secs (212128024 bytes/sec)

512MB would be too small with buffering for a regular file, but should
be OK with a disk file.

> # dd if=/dev/ada0 of=/dev/null bs=256k count=2000
> 2000+0 records in
> 2000+0 records out
> 524288000 bytes transferred in 2.666643 secs (196609752 bytes/sec)
> # dd if=/dev/ada0 of=/dev/null bs=128k count=4000
> 4000+0 records in
> 4000+0 records out
> 524288000 bytes transferred in 2.759498 secs (189993969 bytes/sec)
> # dd if=/dev/ada0 of=/dev/null bs=64k count=8000
> 8000+0 records in
> 8000+0 records out
> 524288000 bytes transferred in 2.718900 secs (192830927 bytes/sec)
>
> CPU load instead grows from 10% at 512K to 15% at 64K. May be trashing effect 
> will only be noticeable at block comparable to cache size, but modern CPUs 
> have megabytes of cache.

I used systat -v to estimate the load.  Its average jumps around more than I
like, but I don't have anything better.  Sys time from dd and others is even
more useless than it used to be since lots of the i/o runs in threads and
the system doesn't know how to charge the application for thread time.

My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2
de-geomed):

regular file:

block size    %idle
----------    -----
1M            87
16K           91
4K            88 (?)
512           72 (?)

disk file:

block size    %idle
----------    -----
1M            96
64K           96
32K           93
16K           87
8K            82 (firmware can't keep up and rate drops to 37MB/S)

In the case of the regular file, almost all i/o is clustered so the driver
sees mainly the cluster size (driver max size of 64K before geom).  Upper
layers then do a good job of only adding a few percent CPU when declustering
to 16K fs-blocks.

In the case of the disk file, I can't explain why the overhead is so low
(~0.5% intr 3.5% sys) for large block sizes.  Uncached copies on the
test machine go at 850MB/S so 50MB/S should take 1/19 of the CPU or 5.3%.

Another difference with the disk file test is that physio() uses a single
pbuf so the test doesn't thrash the buffer cache's memory.  dd of a large
regular file will thrash the L2 cache even if the user buffer size is small,
but still goes faster with a smaller user buffer since the user buffer
stays cached.

Faster disks will of course want larger block sizes.  I'm still suprised
that this makes more difference to CPU than throughput.  Maybe it doesn't
really, but the measurement becomes differently accurate when the CPU
becomes more loaded.  At 100% load there would be nowhere to hide things
like speculative cache fetches.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090706034250.C2240>