Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Mar 2003 18:44:06 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= <des@ofug.org>
Cc:        cvs-all@FreeBSD.org
Subject:   Re: Checksum/copy
Message-ID:  <20030328174850.M6165@gamplex.bde.org>
In-Reply-To: <xzpr88sv3ss.fsf@flood.ping.uio.no>
References:  <Pine.BSF.4.21.0303260956250.27748-100000@root.org> <20030327180247.D1825@gamplex.bde.org> <20030327212647.GA64029@walton.maths.tcd.ie> <xzpr88sv3ss.fsf@flood.ping.uio.no>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 27 Mar 2003, Dag-Erling [iso-8859-1] Sm=F8rgrav wrote:

> David Malone <dwmalone@maths.tcd.ie> writes:
> > On Thu, Mar 27, 2003 at 09:57:35AM +0100, des@ofug.org wrote:
> > > Might it be a good idea to have separate b{copy,zero} implementations
> > > for special purposes like pmap_{copy,zero}_page?
> > We do have a i686_pagezero already, which seems to be used in
> > pmap_zero_page - I guess it may not be well tuned to modern processors,
> > as it is almost 5 years old.

Indeed.

> i686_pagezero uses 'rep stosl' after an initial 'rep scasl' to check
> if the page was already zero (which is a pessimization unless we zero
> a lot of pages that are already zeroed).  SSE can do far better than
> that.

Even integer instructions can do significantly better than scasl/stosl
on "686"s (PentiumPro and similar CPUs).  Implementation bugs in
i686_pagezero() include:
- scasl is one of the slowest ways to read memory, at least on old
  Celerons (I believe PPro's have similar timing for string operations).
  It is a bit slower than lodsl, which is about 3.5 times slower than
  a lightly unrolled movl loop for the L1-cached case and about 2 times
  slower for the uncached case.
- the code apparently intends to check 16 words at a time, but due to
  getting a comparison backwards it actually zeros everything else as
  soon as it finds a nonzero word, with extra obfuscations and
  pessimizations when it is within 16 words of the end.
Implementation non-bugs include using stosl to do the zeroing.  Unlike
lodsl and scasl, stosl is actually useful for its intended purpos on
"686"s.

Instead of fixing the comparison and any other logic bugs, I rewrote the
function using orl instead of scasl, and simpler logic (ignore the changes
for the previous function in the same hunk).

%%%
Index: support.s
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /home/ncvs/src/sys/i386/i386/support.s,v
retrieving revision 1.93
diff -u -2 -r1.93 support.s
--- support.s=0922 Sep 2002 04:45:20 -0000=091.93
+++ support.s=0922 Sep 2002 09:51:27 -0000
@@ -333,70 +337,58 @@
 =09movl=09%edx,%edi
 =09xorl=09%eax,%eax
-=09shrl=09$2,%ecx
 =09cld
+=09shrl=09$2,%ecx
 =09rep
 =09stosl
 =09movl=0912(%esp),%ecx
 =09andl=09$3,%ecx
-=09jne=091f
-=09popl=09%edi
-=09ret
-
-1:
+=09je=091f
 =09rep
 =09stosb
+1:
 =09popl=09%edi
 =09ret
-#endif /* I586_CPU && defined(DEV_NPX) */
+#endif /* I586_CPU && DEV_NPX */

+#ifdef I686_CPU
 ENTRY(i686_pagezero)
-=09pushl=09%edi
-=09pushl=09%ebx
-
-=09movl=0912(%esp), %edi
+=09movl=094(%esp), %edx
 =09movl=09$1024, %ecx
-=09cld

 =09ALIGN_TEXT
 1:
-=09xorl=09%eax, %eax
-=09repe
-=09scasl
-=09jnz=092f
+=09movl=09(%edx), %eax
+=09orl=094(%edx), %eax
+=09orl=098(%edx), %eax
+=09orl=0912(%edx), %eax
+=09orl=0916(%edx), %eax
+=09orl=0920(%edx), %eax
+=09orl=0924(%edx), %eax
+=09orl=0928(%edx), %eax
+=09jne=092f
+
+=09addl=09$32, %edx
+=09subl=09$32/4, %ecx
+=09jne=091b

-=09popl=09%ebx
-=09popl=09%edi
 =09ret

 =09ALIGN_TEXT
-
 2:
-=09incl=09%ecx
-=09subl=09$4, %edi
+=09movl=09$0, (%edx)
+=09movl=09$0, 4(%edx)
+=09movl=09$0, 8(%edx)
+=09movl=09$0, 12(%edx)
+=09movl=09$0, 16(%edx)
+=09movl=09$0, 20(%edx)
+=09movl=09$0, 24(%edx)
+=09movl=09$0, 28(%edx)
+
+=09addl=09$32, %edx
+=09subl=09$32/4, %ecx
+=09jne=091b

-=09movl=09%ecx, %edx
-=09cmpl=09$16, %ecx
-
-=09jge=093f
-
-=09movl=09%edi, %ebx
-=09andl=09$0x3f, %ebx
-=09shrl=09%ebx
-=09shrl=09%ebx
-=09movl=09$16, %ecx
-=09subl=09%ebx, %ecx
-
-3:
-=09subl=09%ecx, %edx
-=09rep
-=09stosl
-
-=09movl=09%edx, %ecx
-=09testl=09%edx, %edx
-=09jnz=091b
-
-=09popl=09%ebx
-=09popl=09%edi
 =09ret
+#endif /* I686_CPU */

 /* fillw(pat, base, cnt) */
%%%

Implementation notes: using orl might not be best (due to pipelining issues=
).
Using movl instead of stosl might not be best (I used it to simplify the
logic and reduce initilization overheads).

This hasn't been tested recently.  I've had it disabled in pmap.c for
as long as I can remember, to prepare for complete testing (my pmap.c
just uses bzero()).

The importance of optimizing this function can be gauged by the number of
people who have noticed that it never worked right and the number of
commits to make it work right.

Zeroing pages is not completely unimportant, however.  The pagezero task
takes about 5% of the time for a makeworld here.  Most of this time is
"free" here since pagezero can run while the system is waiting for disks,
and I don't run much else while doing makeworld benchmarks.  However, it
is not free time under different/heavier loads.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030328174850.M6165>