Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Oct 2008 04:08:24 -0600
From:      Scott Long <scottl@samsco.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@FreeBSD.org, Marius Strobl <marius@FreeBSD.org>, src-committers@FreeBSD.org, svn-src-all@FreeBSD.org
Subject:   Re: svn commit: r183896 - head/sys/dev/bge
Message-ID:  <48F5C118.3040209@samsco.org>
In-Reply-To: <20081015184833.N43215@delplex.bde.org>
References:  <200810142028.m9EKShoL015514@svn.freebsd.org> <48F5053D.7070705@samsco.org> <20081015184833.N43215@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote:
> On Tue, 14 Oct 2008, Scott Long wrote:
> 
>> Marius Strobl wrote:
>>> Author: marius
>>> Date: Tue Oct 14 20:28:42 2008
>>> New Revision: 183896
>>> URL: http://svn.freebsd.org/changeset/base/183896
>>>
>>> Log:
>>>   Use bus_{read,write}_4(9) instead of bus_space_{read,write}_4(9)
>>>   in order to get rid of the bus space handle and tag in the softc.
>>
>> Has anyone looked at the generated code from this interface switch and
>> compared it what was getting generated previously?  Way back when,
> 
> I just looked at the source code.  This seems to be only a small
> pessimization since it involves just one extra indirection and
> related loss of optimization possibilities: (sc->sc_bt, sc->sc_bh)
> becomes r = sc->sc_br; (r->r_bustag, r->r_bushandle).  In theory,
> the compiler could optimize by caching the tag and handle in registers
> in either case so that only 1 extra indirection is needed per function,
> but I've never seen that being done enough to make much difference.
> 
> However, some drivers (e.g., ata) were too stupid to cache sc_bt and
> sc_bh, and converting these to use bus_nonspace*() was a relatively large
> optimization: r = sc->sc_br; (rman_get_bustag(r), rman_get_bushandle(r))
> became r = sc->sc_br; (r->r_bustag, r->r_bushandle).  Since 
> rman_get_bustag()
> and rman_get_bushandle() have never been inline, calling both of them
> on every bus space access gave enormous code space and instruction
> count bloat and corresponding (relatively tiny) runtime bloat.  The
> instructions normally run so much faster than the i/o that you can do
> hundreds or thousands of them per i/o before noticing the runtime
> bloat, and the instruction count bloat here is only about a factor of 20.

It'll be a long time before disk i/o is fast enough to saturate the CPU
to the point of this issue being important.  Even the fastest SAS
hardware that I've written a driver for struggles to get over 100,000
transactions/sec before saturating the attached disks, leaving plenty
of CPU remaining.

> 
>> including <machine/bus_memio.h> made bus_space_read|write_4() compile
>> into a direct memory access on machines that supported it.  The dubious
>> removal of bus_memio.h and bus_pio.h took away that benefit, and I'm
>> afraid that it's only getting worse now.  Bus writes to card memory are
>> still very important to high-performance devices and shouldn't be
>> pessimized in the name of simpler-looking C code.
> 
> I hate the bloat too, but it rarely matters (see above).  I mainly noticed
> it using ddb.  Where nice drivers are optimized to use a single instruction
> per i/o (or perhaps 2 with minimal bus space bloat), the ones that use
> rman_get*() on every i/o take 20+, and with a primitive debugger like ddb
> it is painful to even skip over all the function calls.

It's really the opportunity for gratuitous D-cache line misses that
worries me, and to a lesser extend the I-cache bloat, though that's
probably minimal.  In the past I've been able to demonstrate
considerably better micro-performance, and modestly better
macro-performance, in the bge and em drivers from optimizing cache
efficiency in busdma.

> 
> Which devices have fast enough i/o for the extra indirection to matter?
> bge tries hard to avoid all PCI access in time-critical code.  I think
> we reduced the PCI accesses to 1 PCI write per interrupt.  Most device
> accesses for bge use host memory which is DMAed to/from by the hardware,
> so bus space doesn't apply.  This type of access seems to be the best
> or only way for a PCI device to go fast enough (though it is still too
> slow on at least i386 since the DMA always causes cache misses).
> 
> Bruce

I'm thinking of 10Gb ethernet drivers, which have already shown to fully
saturate the host CPU long before wire bandwidth is filled.  I'm only
familiar with cxgb and mxge; cxgb does a single bus write per
transaction, while mxge cheats and aliases the card memory registers via
rman_get_virtual(), making the single-instruction-access automatic =-)

Scott



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?48F5C118.3040209>