Date: Sat, 5 May 2018 12:58:27 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: John Baldwin <jhb@freebsd.org> Cc: Matt Macy <mmacy@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333242 - head/sys/kern Message-ID: <20180505114700.G1307@besplex.bde.org> In-Reply-To: <3408582.QRuzgOyxgv@ralph.baldwin.cx> References: <201805040651.w446p2iB010839@repo.freebsd.org> <3408582.QRuzgOyxgv@ralph.baldwin.cx>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 4 May 2018, John Baldwin wrote: > On Friday, May 04, 2018 06:51:02 AM Matt Macy wrote: >> Author: mmacy >> Date: Fri May 4 06:51:01 2018 >> New Revision: 333242 >> URL: https://svnweb.freebsd.org/changeset/base/333242 >> >> Log: >> `dup1_processes -t 96 -s 5` on a dual 8160 >> >> x dup_before >> + dup_after >> +------------------------------------------------------------+ >> | x + | >> |x x x x ++ ++| >> | |____AM___| |AM|| >> +------------------------------------------------------------+ >> N Min Max Median Avg Stddev >> x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71 >> + 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829 >> Difference at 95.0% confidence >> 3.08952e+06 +/- 365604 >> 2.03266% +/- 0.245071% >> (Student's t, pooled s = 250681) > > The log doesn't quite describe what the change is though and why it results > in this change. bcopy -> memcpy to permit using the compiler builtin I understand, > but using memcpy instead of a structure copy seems rather odd as I would expect > the compiler to treat a structure copy as the same as __builtin_memcpy(). It is obvious what it does, and that it can't possibly give the claimed change, and shouldn't have passed any review. >> Reported by: mjg@ >> MFC after: 1 week >> >> Modified: >> head/sys/kern/kern_descrip.c >> >> Modified: head/sys/kern/kern_descrip.c >> ============================================================================== >> --- head/sys/kern/kern_descrip.c Fri May 4 04:05:07 2018 (r333241) >> +++ head/sys/kern/kern_descrip.c Fri May 4 06:51:01 2018 (r333242) >> @@ -1503,7 +1503,7 @@ filecaps_copy(const struct filecaps *src, struct filec >> >> if (src->fc_ioctls != NULL && !locked) >> return (false); >> - *dst = *src; >> + memcpy(dst, src, sizeof(*src)); >> if (src->fc_ioctls == NULL) >> return (true); >> This seems to have no effect except on 32-bit arches. The size of struct filecaps seems to be 36 on i386 and 40 on amd64. Thus it is automatically inlined in all cases that I checked. memcpy() in the kernel used to be and extern function, so it was just slower. But now it is a macro that reduces to __builtin_memcpy(). The implementation of this macro is broken -- it is missing parentheses for args. >> @@ -1512,7 +1512,7 @@ filecaps_copy(const struct filecaps *src, struct filec >> >> size = sizeof(src->fc_ioctls[0]) * src->fc_nioctls; >> dst->fc_ioctls = malloc(size, M_FILECAPS, M_WAITOK); >> - bcopy(src->fc_ioctls, dst->fc_ioctls, size); >> + memcpy(dst->fc_ioctls, src->fc_ioctls, size); >> return (true); >> } This is at best a very minor optimization. bcopy() was changed to a macro that uses __builtin_memmove() just before memcpy() was changed to use __builtin_memcpy(). (The first change has more technical complications and was copied from a similar change for bzero() and is has smaller bugs -- it is not missing parentheses for args, but is missing them for the nested bcopy() call like for the nested bzero() call (this looks like an endlessly recursive call and parentheses should be used to clarify that it isn't).) memcpy() might tbe faster than bcopy() since it doesn't have to work for overlapped copies. But when both reduce to builtins, in cases like the above the compiler should be able to see that the copies don't overlap and further reduce to inlined memcpy() in both cases. This shows another reason why it is wrong to not reduce bcopy() to __builtin_memmove() in all cases -- it prevents the compiler further reducing to memcpy() in the non-inlined case. bcopy() and memcpy() are almost equally pessimal on x86, but bcopy() is inherently slower. Unfortunately, there are compiler bugs in this area. -ffreestanding -mno-sse -m32 breaks __builtin_memmove() with both clang and gcc for a struct copy of size 36. For a struct copy, the result is just bad code: - gcc-4.2.1 generates 9 pairs of movl's through integer registers. Too many I think. - clang generates "rep movsl". This is much slower than even the large code generated by gcc. __builtin_memcpy() instead of __builtin_memmove() generates the same not so good code as for a struct copy. Adjusting the above analysis for this gives: - changing the struct copy to memcpy() has no effect - using __builtin_memmove() to implement bcopy() suffers from compiler bugs. The compiler not only fails to reduce to __builtin_memcpy() in most cases - the second change above does have an effect since it bypasses the compiler bugs. Further examples of the compiler bugs: - __builtin_memmove() with size 36 does work with clang on amd64 with -ffreestanding -mno-sse (but no -m32). So the kernel implementation works OK on amd64 with clang. I thought that the limit was 32 for all of these functions. The limit should obviously be larger with larger registers including AVX. In these tests, I didn't use any special inlining flags. The kernel uses expanded inlining flags which might make a difference. The -Winline flag to detect failed inlining was recently broken (removed). There should be a similar flag to detect using builtins that don't work. - __builtin_memmove() doesn't work with gcc-4.2.1 for any size on either amd64 or i386 with or without special flags (it reduces to memmove(). So the kernel implementation doesn't works on either amd64 or i386 with gcc. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180505114700.G1307>