Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Jun 2001 00:58:43 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Matt Dillon <dillon@earth.backplane.com>
Cc:        Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.ORG, cvs-committers@FreeBSD.ORG, cvs-all@FreeBSD.ORG
Subject:   Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinet tcp_subr.c)
Message-ID:  <Pine.BSF.4.21.0106260024430.8175-100000@besplex.bde.org>
In-Reply-To: <200106241549.f5OFn6J78347@earth.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 24 Jun 2001, Matt Dillon wrote:

> :I benchmarked the following version using lmbench2:
> :
> :#define	bzero(p, n) ({						\
> :	if (__builtin_constant_p(n) && (n) <= 16)		\
> :		__builtin_memset((p), 0, (n));			\
> :	else							\
> :		(bzero)((p), (n));				\
> :})
> :
> :The results were uninteresting: essentially no change.  lmbench2 is a
> :micro-benchmark, so it tends to show larger improvements for micro-
> :optimizations than can be expected in normal use.
> 
>     I wouldn't expect lmbench to be useful here.

I would expect the opposite.  If the bzero's in the networking code don't
show up in the network latency benchmarks, where would they show up?  ISTR
that a Linux hacker who made lmbench1 go faster for Linux saying that the
bzero() at the start of the FreeBSD tcp_input() is a really stupid thing
to do.  But I think even completely eliminating it would be just another
micro-optimization, worth 1% in favourable cases, so you need 10 more like
it to give a useful speedup.

> :One point that I noticed after writing my original reply: the gcc
> :builtins depend on misaligned accesses not trapping.  This is reasonable
> :on i386's, although it is broken if alignment checking is enabled
> :(but other things are broken, e.g., copying of structs essentially
> :uses the builtin memcpy and does misaligned copies for some structs
> 
>     I added an alignment check to my bzerol() inline and it blew it up...
>     it added 6ns to the loop, which is fine, but it blew up the constant
>     optimization and wound up adding a switch table and a dozen
>     instructions inline (hundreds of bytes!).

Yes, it's clear that alignment is not worth doing in the kernel.  Userland
is different -- the application might have turned on alignment checking,
or it might be poorly behaved and pass a lot of unaligned buffers.  gcc
is primarily a userland compiler, so it's a little surprising that its
builtins don't worry about alignment.

>     I added alignment checks to i586_bzero but it ate 20nS.  Also,
>     it should be noted that i586_bzero() as it currently stands does not
>     do any alignment checks either - it checks only the size argument,
>     it doesn't check the base pointer.

Neither does generic_bzero().  i586_bzero() just turns itself into
generic_bzero() for small sizes.  I'm fairly sure that I benchmarked
this, and came to the conclusion that there is nothing significanttly
better than "rep movsl" when the size isn't know at compile time.  In
particular, lots of jumps as in i486_bzero are actively bad.  This may
be P5-specific (branch prediction is not very good on original Pentiums).

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0106260024430.8175-100000>