Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 24 Dec 2011 09:37:53 +0000
From:      Alexander Best <arundel@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-current@freebsd.org, freebsd-arch@freebsd.org
Subject:   Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?
Message-ID:  <20111224093753.GA12377@freebsd.org>
In-Reply-To: <20111224160050.T1141@besplex.bde.org>
References:  <20111223235642.GA37495@freebsd.org> <20111224160050.T1141@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--Kj7319i9nmIyA2yE
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Sat Dec 24 11, Bruce Evans wrote:
> On Fri, 23 Dec 2011, Alexander Best wrote:
> 
> >is -mpreferred-stack-boundary=2 really necessary for i386 builds any 
> >longer?
> >i built GENERIC (including modules) with and without that flag. the results
> >are:
> 
> The same as it has always been.  It avoids some bloat.
> 
> >1654496	bytes with the flag set
> >vs.
> >1654952	bytes with the flag unset
> 
> I don't believe this.  GENERIC is enormously bloated, so it has size
> more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes
> is hard to believe.  I get a savings of 9K (text) in a 5MB kernel.
> Changing the default target arch from i386 to pentium-undocumented has
> reduced the text space savings a little, since the default for passing
> args is now to preallocate stack space for them and store to this,
> instead of to push them; this preallocation results in more functions
> needing to allocate some stack space explicitly, and when some is
> allocated explicitly, the text space cost for this doesn't depend on
> the size of the allocation.
> 
> Anyway, the savings are mostly from from avoiding cache misses from
> sparse allocation on stacks.
> 
> Also, FreeBSD-i386 hasn't been programmed to support aligned stacks:
> - KSTACK_PAGES on i386 is 2, while on amd64 it is 4.  Using more
>   stack might push something over the edge
> - not much care is taken to align the initial stack or to keep the
>   stack aligned in calls from asm code.  E.g., any alignment for
>   mi_startup() (and thus proc0?) is accidental.  This may result
>   in perfect alignment or perfect misalignment.  Hopefully, more
>   care is taken with thread startup.  For gcc, the alignment is
>   done bogusly in main() in userland, but there is no main() in
>   the kernel.  The alignment doesn't matter much (provided the
>   perfect misalignment is still to a multiple of 4), but when it
>   matters, the random misalignment that results from not trying to
>   do it at all is better than perfect misalignment from getting it
>   wrong.  With 4-byte alignment, the only cases that it helps are
>   with 64-bit variables.
> 
> >the gcc(1) man page states the following:
> >
> >"
> >This extra alignment does consume extra stack space, and generally
> >increases code size.  Code that is sensitive to stack space usage,
> >such as embedded systems and operating system kernels, may want to
> >reduce the preferred alignment to -mpreferred-stack-boundary=2.
> >"
> >
> >the comment in sys/conf/kern.mk however sorta suggests that the default
> >alignment of 4 bytes might improve performance.
> 
> The default stack alignment is 16 bytes, which unimproves performance.

maybe the part of the comment in sys/conf/kern.mk, which mentions that a stack
alignment of 16 bytes might improve micro benchmark results should be removed.
this would prevent people (like me) from thinking, using a stack alignment of
4 bytes is a compromise between size and efficiently. it isn't! currently a
stack alignment of 16 bytes has no advantages towards one with 4 bytes on i386.
so specifying -mpreferred-stack-boundary=2 on i386 is absolutely mandatory.

please see the attached patch, which also introduduces a line break in order to
describe the stack alignment issue in a paragraph of its own.

cheers.
alex

> 
> clang handles stack alignment correctly (only does it when it is needed)
> so it doesn't need a -mpreferred-stack-boundary option and doesn't
> always break without alignment in main().  Well, at least it used to,
> IIRC.  Testing it now shows that it does the necessary andl of the
> stack pointer for __aligned(32), but for __aligned(16) it now assumes
> that the stack is aligned by the caller.  So it now needs
> -mpreferred-stack-boundary=2, but doesn't have it.  OTOH, clang doesn't
> do the andl in main() like gcc does (unless you put a dummy __aligned(32)
> there), but requires crt to pass an aligned stack.
> 
> Bruce

--Kj7319i9nmIyA2yE
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="kern.mk.diff"

Index: /usr/src/sys/conf/kern.mk
===================================================================
--- /usr/src/sys/conf/kern.mk	(revision 228845)
+++ /usr/src/sys/conf/kern.mk	(working copy)
@@ -30,12 +30,12 @@
 # On i386, do not align the stack to 16-byte boundaries.  Otherwise GCC 2.95
 # and above adds code to the entry and exit point of every function to align the
 # stack to 16-byte boundaries -- thus wasting approximately 12 bytes of stack
-# per function call.  While the 16-byte alignment may benefit micro benchmarks,
-# it is probably an overall loss as it makes the code bigger (less efficient
-# use of code cache tag lines) and uses more stack (less efficient use of data
-# cache tag lines).  Explicitly prohibit the use of FPU, SSE and other SIMD
-# operations inside the kernel itself.  These operations are exclusively
-# reserved for user applications.
+# per function call.  This makes the code bigger (less efficient use of code
+# cache tag lines) and uses more stack (less efficient use of data cache tag
+# lines).
+# Explicitly prohibit the use of FPU, SSE and other SIMD operations inside the
+# kernel itself.  These operations are exclusively reserved for user
+# applications.
 #
 # gcc:
 # Setting -mno-mmx implies -mno-3dnow

--Kj7319i9nmIyA2yE--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111224093753.GA12377>