Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Nov 2014 06:03:29 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Hans Petter Selasky <hps@selasky.org>, Mateusz Guzik <mjguzik@gmail.com>, Mateusz Guzik <mjg@freebsd.org>, jmallett@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, svn-src-head@freebsd.org, Julian Elischer <julian@freebsd.org>
Subject:   Re: svn commit: r274017 - head/sys/kern
Message-ID:  <20141104045159.E1605@besplex.bde.org>
In-Reply-To: <20141103100847.GK53947@kib.kiev.ua>
References:  <201411030746.sA37kpPu037113@svn.freebsd.org> <54573AEE.9010602@freebsd.org> <54573B87.7000801@freebsd.org> <54573CD2.1000702@selasky.org> <20141103092132.GH29497@dft-labs.eu> <20141103100847.GK53947@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 3 Nov 2014, Konstantin Belousov wrote:

> On Mon, Nov 03, 2014 at 10:21:32AM +0100, Mateusz Guzik wrote:
>> On Mon, Nov 03, 2014 at 09:29:06AM +0100, Hans Petter Selasky wrote:
>>> On 11/03/14 09:23, Julian Elischer wrote:
>>>> On 11/3/14, 4:21 PM, Julian Elischer wrote:
>>>>> On 11/3/14, 3:46 PM, Mateusz Guzik wrote:
>>>>>> Author: mjg
>>>>>> Date: Mon Nov  3 07:46:51 2014
>>>>>> New Revision: 274017
>>>>>> URL: https://svnweb.freebsd.org/changeset/base/274017
>>>>>>
>>>>>> Log:
>>>>>>   Provide an on-stack temporary buffer for small ioctl requests.
>>>>> I'm not sure I like this. We don't know how many more levels
>>>>> of stack may be needed.
>>>>> I know that machines are getting faster with more memory,
>>>>> but the current move towards bloating out the
>>>> ... "bloating out the stack" ...
>>>>> worries me.  we started out with a single page of stack (SHARED
>>>>> with the U-area!). I think we are now at several pages.. I forget, is
>>>>> it 8?
>>>>> I'm open to being persuaded but I think we need to have a discussion
>>>>> about stack usage. We used to say that anything greater that, say
>>>>> 64 bytes should probably be allocated.
>>>
>>> I think this patch can give a benefit for the USB stack and CUSE4BSD,
>>> because it does frequent IOCTLs. Regarding the stack usage, maybe
>>> this general purpose optimisation can be allocated from the thread
>>> structure?
>>
>> I was considering something like that. The idea is that frequently
>> allocating threads could allocate, say, 1KB from malloc and use that
>> to satisfy their needs. This definitely helps offenders with longer
>> lifespan, but does not help short-lived ones.
> Allocating 1KB by malloc() consumes that memory for the whole
> lifespan of the thread, and the memory sit unused except some
> short periods.  For the stack allocation which you did, at least
> the memory is reused when needed (i.e. when syscalls other than
> ioctl(2) are executing).

Indeed, the top level of a syscall can allocate a fixed amount in the
thread struct, but this is little different from allocating a fixed
amount on the stack, except with in the thread struct it always costs,
while on the stack it is free provided you don't expand the stack to
match.

This optimization is probably minor, but reminds me of other syscalls
that would benefit using a single largish allocation up front:
- all vfs calls that start with namei().  They allocate PATH_MAX (1K)
   bytes and more.  An 8K stack has plently to spare after allocating
   1K.  However, if malloc() is used then the PATH_MAX limit shouldn't
   exist.   Only a limit to prevent denial of service is needed.
- send()/recv() begin with getsockaddr() which uses plain malloc()
   (not even a uma zone) to allocate a small struct.  The overhead from
   this is significant for small packets if there is not too much bloat
   elsewhere.

> That said, you could slightly improve the code by using
> __builtin_alloc() instead of unconditionally reserving memory on stack.
   __builtin_alloca()
> This will simultaneosly reduce the stack usage, since you only allocate
> as much as needed, and also avoid stack space waste when fallback to
> malloc() is required anyway.

Why not just use a C99 VLA?  It doesn't require any compiler magic except
being a C99 compiler (I haven't seen any of those yet, but some approximate
C99 for VLAs).

You (kib) didn't like using alloca() when I tried to talk you into using
it to get aligned structs for SSE long ago.  __builtin_alloca() has more
chance of forcing the alignment than VLA,s but this is fragile.  FreeBSD
still uses -mpreferred-stack-boundary on i386 with gcc.  This option is
broken (unsupported) with clang, but is less needed since clang does
alignment better.

A quick check shows that alloca and __builtin_alloca() but not arrays
(variable or fixed size) with __aligned(16) work with gcc-4.2.1.  gcc
aligns the stack to a 16-byte boundary using andl with %esp on every
alloca() call even without -mpreferred-stack-boundary=2, but for objects
on the stack aligned using __aligned(16) it depends on its pessimization
of the caller aligning things to a 16-byte boundary, and
-mpreferred-stack-boundary=2 gives only 4-byte alignment.   Alignments
of more than 16 ae just broken in gcc, except possibly using
-mpreferred-stack-boundary=5 to get 32-bit alignment, etc.

Everything seems to work perfectly with clang and arrays, but not using
alloca() or __builtin_alloca():

 	char *s = __builtin_alloca(127);

This gives 16-byte alignment with gcc in cases where it works, but with
clang it gives perfect misalignment (start with an aligned address and
subtract 127 from it).  By choosing the allocation amount to be a
multiple of 16 or that plus a suitable multiple of 4, a final alignment
of 16 can be arranged for clang too.

 	char s[127];

I think all compilers align this.

 	char s[127] __aligned(N);

For N = 8, 16, 32, ..., clang generates the necessary andl with %esp
to align things.  For N = 4, no andl is necessary.  gcc-4.2.1 is just
broken for N > 4 with -mpreferred-stack-boundary=2 and for N > 16 by
default, since it never does the necessary andl's.  (clang also does
"andl $-8,%esp" as necessary to align for int64_t and double
variables, but interestingly, generates no andl for long double
variables.  Its default stack boundary is just 4, and that is best --
only align as required in the few functions that have 64-bit variables
or __aligned() directives.)

So if you need 4-byte alignment, then __builtin_alloca() is good enough
for both gcc and clang.  4-byte alignment is usually enough on i386,
but is less than what malloc() gives.  Otherwise, using __builtin_alloca()
is best for gcc-4.2.1 and using a an array or VLA with __aligned(16) is
best for clang.

The u_char arrays with no alignment statements for select() and now
ioctl() are technically broken since they are not required to be
aligned but are used to hold objects that are required to be aligned
(even the select bits are accessed as fd_mask = u_long objects).  They
only workd because the compiler optimizes large arrays by giving them
more alignment than strictly required.

How do the __builtin_alloca()'s in your FP code (recently synced to i386
where alignment is more difficult by jhb) work?  I think they give
16-byte alignment in all cases, but some cases seem to need 64-byte
alignment and do this by hand.  Old code seems to be little changed,
so I think it still does 16-byte alignment by hand.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141104045159.E1605>