Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Mar 2010 12:17:33 -0600
From:      Scott Long <scottl@samsco.org>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        Alexander Motin <mav@freebsd.org>, FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: Increasing MAXPHYS
Message-ID:  <891E2580-8DE3-4B82-81C4-F2C07735A854@samsco.org>
In-Reply-To: <201003201753.o2KHrH5x003946@apollo.backplane.com>
References:  <4BA4E7A9.3070502@FreeBSD.org> <201003201753.o2KHrH5x003946@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:
>=20
> :All above I have successfully tested last months with MAXPHYS of 1MB =
on
> :i386 and amd64 platforms.
> :
> :So my questions are:
> :- does somebody know any issues denying increasing MAXPHYS in HEAD?
> :- are there any specific opinions about value? 512K, 1MB, MD?
> :
> :--=20
> :Alexander Motin
>=20
>    (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you
>    might hit up against KVM exhaustion issues in unrelated subsystems.
>    nswbuf typically maxes out at around 256.  For i386 1MB is probably
>    too large (256M of reserved KVM is a lot for i386).  On amd64 there
>    shouldn't be a problem.
>=20

Yes, this needs to be addressed.  I've never gotten a clear answer from
VM people like Peter Wemm and Alan Cox on what should be done.

>    Diminishing returns get hit pretty quickly with larger MAXPHYS =
values.
>    As long as the I/O can be pipelined the reduced transaction rate
>    becomes less interesting when the transaction rate is less than a
>    certain level.  Off the cuff I'd say 2000 tps is a good basis for
>    considering whether it is an issue or not.  256K is actually quite
>    a reasonable value.  Even 128K is reasonable.
>=20

I agree completely.  I did quite a bit of testing on this in 2008 and =
2009.
I even added some hooks into CAM to support this, and I thought that I =
had
discussed this extensively with Alexander at the time.  Guess it was yet =
another
wasted conversation with him =3D-(  I'll repeat it here for the record.

What I call the silly-i/o-test, filling a disk up with the dd command, =
yields
performance improvements up to a MAXPHYS of 512K.  Beyond that and
it's negligible, and actually starts running into contention on the VM =
page
queues lock.  There is some work to break down this lock, so it's worth
revisiting in the future.

For the non-silly-i/o-test, where I do real file i/o using various =
sequential and
random patterns, there was a modest improvement up to 256K, and a slight
improvement up to 512K.  This surprised me as I figured that most =
filesystem
i/o would be in UFS block sized chunks.  Then I realized that the UFS =
clustering
code was actually taking advantage of the larger I/O's.  The improvement =
really
depends on the workload, of course, and I wouldn't expect it to be =
noticeable
for most people unless they're running something like a media server.

Besides the nswbuf sizing problem, there is a real problem that a lot of =
drivers
have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are
particular values, and they've sized their data structures accordingly.  =
Before
these values are changed, an audit needs to be done OF EVERY SINGLE
STORAGE DRIVER.  No exceptions.  This isn't a case of changing MAXHYS
in the ata driver, testing that your machine boots, and then committing =
the change
to source control.  Some drivers will have non-obvious restrictions =
based on
the number of SG elements allowed in a particular command format.  MPT
comes to mind (its multi message SG code seems to be broken when I tried
testing large MAXPHYS on it), but I bet that there are others.

Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
odd number less than 512k.  For the purpose of benchmarking against =
these
OS's, having comparable capabilities is essential; Linux easily beats =
FreeBSD
in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD =
typically
stomps linux in real I/O because of vastly better latency and caching =
algorithms).
I'm fine with raising MAXPHYS in production once the problems are =
addressed.


>    Nearly all the issues I've come up against in the last few years =
have
>    been related more to pipeline algorithms breaking down and less =
with
>    I/O size.  The cluster_read() code is especially vulnerable to
>    algorithmic breakdowns when fast media (such as a SSD) is involved.
>    e.g.  I/Os queued from the previous cluster op can create stall
>    conditions in subsequent cluster ops before they can issue new I/Os
>    to keep the pipeline hot.
>=20

Yes, this is another very good point.  It's time to start really =
figuring out what SSD
means for FreeBSD I/O.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?891E2580-8DE3-4B82-81C4-F2C07735A854>