Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Jun 2001 00:12:43 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Matt Dillon <dillon@earth.backplane.com>
Cc:        Peter Wemm <peter@wemm.org>, Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.org, cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org
Subject:   Re: kernel size w/ optimized bzero() & patch set (was Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinettcp_subr.c)) 
Message-ID:  <Pine.BSF.4.21.0106252337370.7918-100000@besplex.bde.org>
In-Reply-To: <200106250134.f5P1YsN01440@earth.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 24 Jun 2001, Matt Dillon wrote:

[Peter Wemm wrote]
> :Just think.. This new ``improved'' bzero code can now fill up all 4K of L1
> :instruction cache on most of my systems, and most of my 8K L1 instruction
> :cache on >= coppermine cpus.  I'm impressed.  Those microbenchmarks had
> 
>     Huh?  Peter, you obviously haven't been listening.  I strongly recommend
>     that you review the last few postings I've made.  The suggested bzero
>     code certainly does NOT in any way blow up the L1 cache, and I think
>     I'm pretty clear on that.  I wouldn't be doing it if it did.

It was an intermediate version that blew up the cache.  I have been trying
slightly different versions, and found that gcc's builtin version doesn't
make all that much difference in the code size, either up or down.  With
the following version of bzero:

#define	bzero(p, n) ({						\
	if (__builtin_constant_p(n) && (n) <= X)		\
		__builtin_memset((p), 0, (n));			\
	else							\
		(bzero)((p), (n));				\
})

for X = 0, 4, 8, 12, 16, 32 and "infinity", the kernel sizes were:

   text	   data	    bss	    dec	    hex	filename
1962434	 151436	 349824	2463694	 2597ce	kernel.4
1962442	 151436	 349824	2463702	 2597d6	kernel.8
1962446	 151436	 349824	2463706	 2597da	kernel.12
1962466	 151436	 349824	2463726	 2597ee	kernel.0
1962802	 151436	 349824	2464062	 25993e	kernel.16
1962866	 151436	 349824	2464126	 25997e	kernel.20
1963538	 151436	 349824	2464798	 259c1e	kernel.32
1964098	 151436	 349824	2465358	 259e4e	kernel.infinity

Summary: it's hard for the inline version to be smaller; even when it
only needs to do one store-immediate operation, the kernel is only 32
bytes smaller than the one using function calls which have to push
2 args, do the call, and clean up.  This is presumably due to increased
register pressure for the inlined versions.

OTOH, the recent uninlining of the mbuf macros somehow reduced the
size of my standard kernel by more than 5% (more than 100K).  It also
reduced the compilation time by more than 10%.  Kernel compilation
times are still 65% larger than in RELENG_3 for kernels with essentially
the same options (this is using -current's compiler; they are 85%
larger using RELENG_3's compiler).

> :better be damn good, because it may end up the only thing that the system
> :will do well now since all this excessive inlining looks like it is blowing
> :the L1 cache out the door.
> :
> :(I also apply the same complaint to the vm/* inlines).
> 
>     And you are just as wrong.  The few functions inlined in vm/* are inlined
>     mainly because (A) they are called with constant arguments, which means

Some seem to have rotted a bit.  E.g., _vm_map_lock_upgrade() (adding
an mtx_lock() to anything will bloat it in both space and time).

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0106252337370.7918-100000>