Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Mar 2010 10:27:15 -0600
From:      Scott Long <scottl@samsco.org>
To:        Alexander Sack <pisymbol@gmail.com>
Cc:        FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: Increasing MAXPHYS
Message-ID:  <50456989-F196-4907-A170-85806A73D25F@samsco.org>
In-Reply-To: <3c0b01821003220852r61ca0ae3o95bea1c23ddc34d9@mail.gmail.com>
References:  <4BA4E7A9.3070502@FreeBSD.org> <4BA6517C.3050509@FreeBSD.org> <20100322124018.7430f45e@ernst.jennejohn.org> <201003220839.12907.jhb@freebsd.org> <3c0b01821003220852r61ca0ae3o95bea1c23ddc34d9@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 22, 2010, at 9:52 AM, Alexander Sack wrote:
> On Mon, Mar 22, 2010 at 8:39 AM, John Baldwin <jhb@freebsd.org> wrote:
>> On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote:
>>> On Sun, 21 Mar 2010 19:03:56 +0200
>>> Alexander Motin <mav@FreeBSD.org> wrote:
>>>=20
>>>> Scott Long wrote:
>>>>> Are there non-CAM drivers that look at MAXPHYS, or that silently =
assume
>> that
>>>>> MAXPHYS will never be more than 128k?
>>>>=20
>>>> That is a question.
>>>>=20
>>>=20
>>> I only did a quick&dirty grep looking for MAXPHYS in /sys.
>>>=20
>>> Some drivers redefine MAXPHYS to be 512KiB.  Some use their own =
local
>>> MAXPHYS which is usually 128KiB.
>>>=20
>>> Some look at MAXPHYS to figure out other things; the details escape =
me.
>>>=20
>>> There's one driver which actually uses 100*MAXPHYS for something, =
but I
>>> didn't check the details.
>>>=20
>>> Lots of them were non-CAM drivers AFAICT.
>>=20
>> The problem is the drivers that _don't_ reference MAXPHYS.  The =
driver author
>> at the time "knew" that MAXPHYS was 128k, so he did the =
MAXPHYS-dependent
>> calculation and just put the result in the driver (e.g. only =
supporting up to
>> 32 segments (32 4k pages =3D=3D 128k) in a bus dma tag as a magic =
number to
>> bus_dma_tag_create() w/o documenting that the '32' was derived from =
128k or
>> what the actual hardware limit on nsegments is).  These cannot be =
found by a
>> simple grep, they require manually inspecting each driver.
>=20
> 100% awesome comment.  On another kernel, I myself was guilty of this
> crime (I did have a nice comment though above the def).
>=20
> This has been a great thread since our application really needs some
> of the optimizations that are being thrown around here.  We have found
> in real live performance testing that we are almost always either
> controller bound (i.e. adding more disks to spread IOPs has little to
> no effect in large array configurations on throughput, we suspect that
> is hitting the RAID controller's firmware limitations) or tps bound,
> i.e. I never thought going from 128k -> 256k per transaction would
> have a dramatic effect on throughput (but I never verified).
>=20
> Back to HBAs,  AFAIK, every modern iteration of the most popular HBAs
> can easily do way more than a 128k scatter/gather I/O.  Do you guys
> know of any *modern* (circa within the last 3-4 years) that can not do
> more than 128k at a shot?

>64K broken in MPT at the moment.  The hardware can do it, the driver =
thinks it can do it, but it fails.  AAC hardware traditionally cannot, =
but maybe the firmware has been improved in the past few years.  I know =
that there are other low-performance devices that can't do more than 64 =
or 128K, but none are coming to mind at the moment.  Still, it shouldn't =
be a universal assumption that all hardware can do big I/O's.

Another consideration is that some hardware can do big I/O's, but not =
very efficiently.  Not all DMA engines are created equal, and moving to =
compound commands and excessively long S/G lists can be a pessimization. =
 For example, MFI hardware does a hinted prefetch on the segment list, =
but once you exceed a certain limit, that prefetch doesn't work anymore =
and the firmware has to take the slow path to execute the i/o.  I =
haven't quantified this penalty yet, but it's something that should be =
thought about.

>=20
> In other words, I've always thought the limit was kernel imposed and
> not what the memory controller on the card can do (I certainly never
> got the impression talking with some of the IHVs over the years that
> they were designing their hardware for a 128k limit - I sure hope
> not!).

You'd be surprised at the engineering compromises and handicaps that are =
committed at IHVs because of misguided marketters.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50456989-F196-4907-A170-85806A73D25F>