Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 5 May 2018 12:58:27 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Matt Macy <mmacy@freebsd.org>, src-committers@freebsd.org,  svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r333242 - head/sys/kern
Message-ID:  <20180505114700.G1307@besplex.bde.org>
In-Reply-To: <3408582.QRuzgOyxgv@ralph.baldwin.cx>
References:  <201805040651.w446p2iB010839@repo.freebsd.org> <3408582.QRuzgOyxgv@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 4 May 2018, John Baldwin wrote:

> On Friday, May 04, 2018 06:51:02 AM Matt Macy wrote:
>> Author: mmacy
>> Date: Fri May  4 06:51:01 2018
>> New Revision: 333242
>> URL: https://svnweb.freebsd.org/changeset/base/333242
>>
>> Log:
>>   `dup1_processes -t 96 -s 5` on a dual 8160
>>
>>   x dup_before
>>   + dup_after
>>   +------------------------------------------------------------+
>>   |             x                                            + |
>>   |x    x   x   x                                         ++ ++|
>>   |   |____AM___|                                          |AM||
>>   +------------------------------------------------------------+
>>       N           Min           Max        Median           Avg        Stddev
>>   x   5  1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08     341205.71
>>   +   5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08     96232.829
>>   Difference at 95.0% confidence
>>           3.08952e+06 +/- 365604
>>           2.03266% +/- 0.245071%
>>           (Student's t, pooled s = 250681)
>
> The log doesn't quite describe what the change is though and why it results
> in this change.  bcopy -> memcpy to permit using the compiler builtin I understand,
> but using memcpy instead of a structure copy seems rather odd as I would expect
> the compiler to treat a structure copy as the same as __builtin_memcpy().

It is obvious what it does, and that it can't possibly give the claimed change,
and shouldn't have passed any review.

>>   Reported by:	mjg@
>>   MFC after:	1 week
>>
>> Modified:
>>   head/sys/kern/kern_descrip.c
>>
>> Modified: head/sys/kern/kern_descrip.c
>> ==============================================================================
>> --- head/sys/kern/kern_descrip.c	Fri May  4 04:05:07 2018	(r333241)
>> +++ head/sys/kern/kern_descrip.c	Fri May  4 06:51:01 2018	(r333242)
>> @@ -1503,7 +1503,7 @@ filecaps_copy(const struct filecaps *src, struct filec
>>
>>  	if (src->fc_ioctls != NULL && !locked)
>>  		return (false);
>> -	*dst = *src;
>> +	memcpy(dst, src, sizeof(*src));
>>  	if (src->fc_ioctls == NULL)
>>  		return (true);
>>

This seems to have no effect except on 32-bit arches.  The size of struct
filecaps seems to be 36 on i386 and 40 on amd64.  Thus it is automatically
inlined in all cases that I checked.  memcpy() in the kernel used to be
and extern function, so it was just slower.  But now it is a macro that
reduces to __builtin_memcpy().  The implementation of this macro is broken
-- it is missing parentheses for args.

>> @@ -1512,7 +1512,7 @@ filecaps_copy(const struct filecaps *src, struct filec
>>
>>  	size = sizeof(src->fc_ioctls[0]) * src->fc_nioctls;
>>  	dst->fc_ioctls = malloc(size, M_FILECAPS, M_WAITOK);
>> -	bcopy(src->fc_ioctls, dst->fc_ioctls, size);
>> +	memcpy(dst->fc_ioctls, src->fc_ioctls, size);
>>  	return (true);
>>  }

This is at best a very minor optimization.  bcopy() was changed to a macro
that uses __builtin_memmove() just before memcpy() was changed to use
__builtin_memcpy().  (The first change has more technical complications
and was copied from a similar change for bzero() and is has smaller bugs
-- it is not missing parentheses for args, but is missing them for the
nested bcopy() call like for the nested bzero() call (this looks like an
endlessly recursive call and parentheses should be used to clarify that
it isn't).)

memcpy() might tbe faster than bcopy() since it doesn't have to work for
overlapped copies.  But when both reduce to builtins, in cases like the
above the compiler should be able to see that the copies don't overlap
and further reduce to inlined memcpy() in both cases.

This shows another reason why it is wrong to not reduce bcopy() to
__builtin_memmove() in all cases -- it prevents the compiler further reducing
to memcpy() in the non-inlined case.  bcopy() and memcpy() are almost
equally pessimal on x86, but bcopy() is inherently slower.

Unfortunately, there are compiler bugs in this area.  -ffreestanding -mno-sse
-m32 breaks __builtin_memmove() with both clang and gcc for a struct copy of
size 36.  For a struct copy, the result is just bad code:
- gcc-4.2.1 generates 9 pairs of movl's through integer registers.  Too many
   I think.
- clang generates "rep movsl".  This is much slower than even the large code
   generated by gcc.
__builtin_memcpy() instead of __builtin_memmove() generates the same not so
good code as for a struct copy.

Adjusting the above analysis for this gives:
- changing the struct copy to memcpy() has no effect
- using __builtin_memmove() to implement bcopy() suffers from compiler bugs.
   The compiler not only fails to reduce to __builtin_memcpy() in most cases
- the second change above does have an effect since it bypasses the compiler
   bugs.

Further examples of the compiler bugs:
- __builtin_memmove() with size 36 does work with clang on amd64 with
   -ffreestanding -mno-sse (but no -m32).  So the kernel implementation
   works OK on amd64 with clang.  I thought that the limit was 32 for
   all of these functions.  The limit should obviously be larger with
   larger registers including AVX.  In these tests, I didn't use any
   special inlining flags.  The kernel uses expanded inlining flags
   which might make a difference.  The -Winline flag to detect failed
   inlining was recently broken (removed).  There should be a similar
   flag to detect using builtins that don't work.
- __builtin_memmove() doesn't work with gcc-4.2.1 for any size on either
   amd64 or i386 with or without special flags (it reduces to memmove().
   So the kernel implementation doesn't works on either amd64 or i386 with
   gcc.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180505114700.G1307>