Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Feb 2017 01:27:16 -0800
From:      Mark Millard <markmi@dsl-only.net>
To:        Alexandre Martins <alexandre.martins@stormshield.eu>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>, Ian Lepore <ian@freebsd.org>
Subject:   Re: bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail]
Message-ID:  <8E5F8A15-2F79-4015-B93B-975D27308782@dsl-only.net>
In-Reply-To: <7424243.zp5tqGREgJ@pc-alex>
References:  <5335118.oK1KXXDaG5@pc-alex> <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net> <7424243.zp5tqGREgJ@pc-alex>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Feb-13, at 12:28 AM, Alexandre Martins <alexandre.martins at =
stormshield.eu> wrote:

> Le vendredi 10 f=C3=A9vrier 2017, 18:11:23 Mark Millard a =C3=A9crit :
>> On 2017-Feb-10, at 7:51 AM, Alexandre Martins <alexandre.martins at=20=

> stormshield.eu> wrote:
>>> Hi all
>>>=20
>>> I see in the kernel code of bcopy/memmove an optimization if the =
copied
>>> region don't overlap:
>>>=20
>>> =
https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=3Dannotate=
#l
>>> 403
>>>=20
>>> I'm a newbie in ARM assembly, so, sorry if i'm wrong, but
>>> - R0 is the dst (1st parameter)
>>> - R1 is the src (2nd parameter)
>>>=20
>>> So :
>>> subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */
>>> subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */
>>> cmp r3, r2 /* if (r3 < len) we have an overlap */
>>> bcc PIC_SYM(_C_LABEL(memcpy), PLT)
>>>=20
>>> Seems to be inverted. Should it be that ? :
>>> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */
>>> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */
>>> cmp r3, r2 /* if (r3 < len) we have an overlap */
>>> bcs PIC_SYM(_C_LABEL(memcpy), PLT)
>>>=20
>>>=20
>>> Best regards
>>=20
>> I did not expect something so central that has been
>> around so long to look wrong but. . .
>>=20
>> Going through the supporting details of the original
>> code based on my looking around here is what I found:
>>=20
>> #include	<string.h>
>> void *memmove(void *dest, const void *src, size_t n);
>>=20
>> So I'd expect (c'ish notation):
>> r0=3D=3Ddest
>> r1=3D=3Dsrc
>> r2=3D=3Dn
>>=20
>> Then for (the comments vs. code is being challenged):
>> (The comments do seem to give the intent correctly.)
>>=20
>> cmp     r0, r1
>> RETeq           /* Bail now if src/dst are the same */
>> subcc   r3, r0, r1      /* if (dst > src) r3 =3D dst - src */
>> subcs   r3, r1, r0      /* if (src > dsr) r3 =3D src - dst */
>> cmp     r3, r2          /* if (r3 < len) we have an overlap */
>> bcc     PIC_SYM(_C_LABEL(memcpy), PLT)
>> . . .
>>=20
>> cmp r0,r1 should result in condition code (c'ish notation):
>>=20
>> eq=3D(r0=3D=3Dr1)
>> cc=3D(r0<r1)
>> cs=3D(r0>=3Dr1)
>>=20
>> (Only the r0 position has no immediate-value alternative.)
>>=20
>>=20
>> subcc r3,r0,r1 is: if (cc) r3=3Dr0-r1 // no condition code change
>> In other words: if (dst<src) r3=3Ddst-src
>>=20
>> So it does not match the test listed in the comment as
>> far as I can see. And in (mathematical) integer arithmetic
>> the r3 result would be negative for dst-src. For
>> non-negative arithmetic (natural or whole): undefined.
>>=20
>>=20
>> subcs r3,r1,r0 is: if (cs) r3=3Dr1-r0 // no condition code change
>> In other words: if (dst>=3Dsrc) r3=3Dsrc-dst
>>=20
>> So it does not match the test listed in the comment as
>> far as I can see. And in (mathematical) integer arithmetic
>> the r3 result would be nonpositive for src-dst. But the
>> earlier RETeq prevents the zero case, so: negative. For
>> non-negative arithmetic (natural or whole): undefined.
>>=20
>>=20
>> If it was only a normal mathemetical context r3=3D-abs(dst-src)
>> would be a summary of the two sub instruction sequence as it
>> is from what I can tell.
>>=20
>> For the purpose the summary should be: r3=3Dabs(dst-src), given
>> dst!=3Dsrc . There is no need to wonder outside normal
>> mathematical non-negative arithmetic in the process either.
>>=20
>> Your code change would have the right summary and use only
>> normal mathematical rules from what I can tell:
>>=20
>> cmp     r0, r1
>> RETeq           /* Bail now if src/dst are the same */
>> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */
>> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */
>> cmp r3, r2 /* if (r3 < len) we have an overlap */
>> bcs PIC_SYM(_C_LABEL(memcpy), PLT)
>> . . .
>>=20
>> subcs r3,r0,r1 is: if (cs) r3=3Dr0-r1 // no condition code change
>> In other words: if (dst>=3Dsrc) r3=3Ddst-src.
>> Given the prior RETeq, that is effectively: if (dst>src) r3=3Ddst-src.
>> And that matches the comments and would produce a positive result
>> for r3, matching the normal mathematical result.
>>=20
>> subcc r3,r1,r0 is: if (cc) r3=3Dr1-r0 // no condition code change
>> In other words: if (dst<src) r3=3Dsrc-dst
>> And that matches the comments and would produce a positive result
>> for r3, matching the normal mathematical result.
>>=20
>> Overall summary of the two updated sub?? instructions:
>> r3=3Dabs(dst-src), given dst!=3Dsrc.
>>=20
>> And that would make for an appropriate comparison to n (i.e., to r2).
>>=20
>> It appears to have been as it is now since -r143175 when memmove was
>> added to sys/arm/arm/support.S (2005-Apr-12).
>>=20
>>=20
>> =3D=3D=3D
>> Mark Millard
>> markmi at dsl-only.net
>=20
>=20
>=20
> Thank you for this deep anaysis !
>=20
> I also made some benchmark. It seems that the "Xscale" version of=20
> memcpy/memmove is slower that the standard "ARM" on my platform =
(armada388).
>=20
> I do the change by undefine the _ARM_ARCH_5E.
> #define _ARM_ARCH_5E =3D> Xscale version
> #undef _ARM_ARCH_5E =3D> "ARM" version
>=20
> There is my results:
>=20
> Block size: 2048
> memcpy (Kernel ARM)                  :   1028.7 MB/s
> memmove (Kernel ARM)               :    616.5 MB/s
> memcpy (Kernel xscale)               :    920.1 MB/s
> memmove (Kernel xscale)            :    618.8 MB/s
>=20
> Block size: 128
> memcpy (Kernel ARM)                  :   1018.5 MB/s
> memmove (Kernel ARM)               :    668.4 MB/s
> memcpy (Kernel xscale)               :    825.9 MB/s
> memmove (Kernel xscale)            :    668.6 MB/s
>=20
> Block size: 64
> memcpy (Kernel ARM)                  :    892.9 MB/s
> memmove (Kernel ARM)               :    667.2 MB/s
> memcpy (Kernel xscale)               :    721.2 MB/s
> memmove (Kernel xscale)            :    668.2 MB/s
>=20
> Block size: 32
> memcpy (Kernel ARM)                  :    620.6 MB/s
> memmove (Kernel ARM)               :    634.6 MB/s
> memcpy (Kernel xscale)               :    504.9 MB/s
> memmove (Kernel xscale)            :    634.5 MB/s
>=20
> Block size; 16
> memcpy (Kernel ARM)                  :    471.8 MB/s
> memmove (Kernel ARM)               :    464.5 MB/s
> memcpy (Kernel xscale)               :    254.5 MB/s
> memmove (Kernel xscale)            :    464.7 MB/s
>=20
> As you can see, more the size of the memcpy is small, more the =
standard (ARM)=20
> version of memcpy is faster.
>=20
> In addition, the libc version suffer the same problem, but is 15% more =
efficiant.
>=20
> What can I do to help you on this point ?
>=20
> Best regards
> --=20
> Alexandre Martins
> STORMSHIELD

I recommend submitting your original discovery to bugzilla:
https://bugs.freebsd.org/bugzilla/enter_bug.cgi

I can submit the original find if you are not going to.
(I would likely not submit a benchmark report since I
have not done such benchmarking or analysis producing
expected-performance estimates.)


I'm not a FreeBSD committer, nor an arm expert. I'm not working
on such issues as such. I just decided to look-up/study enough
material to confirm or deny what you had written, in part because
the notation looked like it could be easy to get the condition
code suffixes and argument order mismatched for sub??. (An
alternate fix is to reverse the operands in the two sub??
commands but leave the cc and cs as they are --but then
the comments would also need to be updated track.)

I had added Ian L. as a CC: because he is knowledgable, active,
and a committer. If I messed up in my studying the material
he would likely catch my mistakes. If you (and my confirmation)
are right then he likely could fix the code in svn as well.


As the decision about when to call the code that can
deal with overlapping memory regions is wrong, the code
that should only be used for non-overlaping regions likely
would handle some overlapping regions and so would operate
incorrectly in at least some cases.

In other words, I think the bug is worse than just an
example of being sub-optimal: the code is wrong from what
I can tell. (I've no clue if the code is ever put to use
for any bad cases.)


It is true that for the non-overlapping cases that are
sent to the more general, slower overlap-handling-capable
code the results would likely be worse than for the code
designed for non-overlapping-only.


As your your benchmarks:

I'm not sure if you benchmarked the original code vs.
your corrected code. The corrected code would be the
most intersting (presuming you [and my confirmation]
are correct).

If you submit to bugzilla, I'd suggest any benchmark
reports be submitted separately from the original
issue with the sub?? instructions.

The 32-bit arm's that I have access to are both Cortex-A7
based: a BPI-M3 and an RPI2, so armv7. I've not tried such
benchmarking on them. And I'm not likely to, at least not
any time soon.

=3D=3D=3D
Mark Millard
markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8E5F8A15-2F79-4015-B93B-975D27308782>