From owner-cvs-all@FreeBSD.ORG Thu Mar 27 22:41:44 2003 Return-Path: Delivered-To: cvs-all@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B32137B401; Thu, 27 Mar 2003 22:41:44 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9B8BA43F85; Thu, 27 Mar 2003 22:41:42 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id RAA22140; Fri, 28 Mar 2003 17:04:23 +1100 Date: Fri, 28 Mar 2003 17:04:21 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: "Greg 'groggy' Lehey" In-Reply-To: <20030327232742.GA80113@wantadilla.lemis.com> Message-ID: <20030328161552.L5953@gamplex.bde.org> References: <20030327180247.D1825@gamplex.bde.org> <20030327232742.GA80113@wantadilla.lemis.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-26.2 required=5.0 tests=AWL,EMAIL_ATTRIBUTION,IN_REP_TO,QUOTED_EMAIL_TEXT, REFERENCES,REPLY_WITH_QUOTES autolearn=ham version=2.50 X-Spam-Level: X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) cc: cvs-src@FreeBSD.org cc: Mike Silbersack cc: src-committers@FreeBSD.org cc: cvs-all@FreeBSD.org cc: Nate Lawson Subject: Re: Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c) X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Mar 2003 06:41:47 -0000 On Fri, 28 Mar 2003, Greg 'groggy' Lehey wrote: > On Thursday, 27 March 2003 at 19:07:15 +1100, Bruce Evans wrote: > > On Wed, 26 Mar 2003, Mike Silbersack wrote: > >> On my Mobile Celeron, a for (i = 0; i < max; i++) array[i]=0 runs > >> faster than bzero. :( > > > > Saved data from my benchmarks show that bzero (stosl) was OK on > > 486's, poor on original Pentiums, OK on K6-1's, best by far on > > second generation Celerons (ones like PII) and poor on Athlon XP's > > (but not as relatively bad as on original Pentiums). > > What happened to i686_bzero? I was sure that years ago one existed, > but now all machines I use (i686 class) all use generic_bzero. I nuked it in: %%% RCS file: /home/ncvs/src/sys/i386/i386/support.s,v Working file: support.s head: 1.93 ... ---------------------------- revision 1.40 date: 1996/10/09 18:16:17; author: bde; state: Exp; lines: +291 -60 ... Removed old, dead i586_bzero() and i686_bzero(). Read-before-write is usually bad for i586's. It doubles the memory traffic unless the data is already cached, and data is (or should be) very rarely cached for large bzero()s (the system should prefer uncached pages for cleaning), and the amount of data handled by small bzero()s is relatively small in the kernel. ... ---------------------------- %%% "i686" basically means "second generation Pentium" (PentiumPro/PII/Celeron) (later x86's are mostly handled better using CPU features instead of a 1-dimensional class number). Hand-"optimized" bzero's are especially pessimal for this class of CPU. The log message is mainly about PentiumPro's. Later models aren't as bad. E.g. on a Celeron 400 MHz overclocked to 6*75MHz: [bzero times, 4K buffer] zero0: 2169427140 B/s ( 46095 us) (stosl) zero1: 1178408485 B/s ( 84860 us) (unroll 16) zero2: 1180481213 B/s ( 84711 us) (unroll 16 preallocate) zero3: 1564647390 B/s ( 63912 us) (unroll 32) zero4: 1287279636 B/s ( 77683 us) (unroll 32 preallocate) zero5: 1482553913 B/s ( 67451 us) (unroll 64) zero6: 1469029028 B/s ( 68072 us) (unroll 64 preallocate) zero7: 1774492387 B/s ( 56354 us) (fstl) zero8: 888397008 B/s ( 112562 us) (movl) zero9: 1179409162 B/s ( 84788 us) (unroll 8) zeroA: 2125122067 B/s ( 47056 us) (generic_bzero) zeroB: 1575245644 B/s ( 63482 us) (i486_bzero) zeroC: 960381695 B/s ( 104125 us) (i586_bzero) zeroD: 1289637018 B/s ( 77541 us) (i686_pagezero) [bzero times, 8M buffer] zero0: 140685510 B/s ( 698750 us) (stosl) zero1: 141949085 B/s ( 692530 us) (unroll 16) zero2: 142107500 B/s ( 691758 us) (unroll 16 preallocate) zero3: 141911380 B/s ( 692714 us) (unroll 32) zero4: 141969995 B/s ( 692428 us) (unroll 32 preallocate) zero5: 141955645 B/s ( 692498 us) (unroll 64) zero6: 141986195 B/s ( 692349 us) (unroll 64 preallocate) zero7: 141935968 B/s ( 692594 us) (fstl) zero8: 142159904 B/s ( 691503 us) (movl) zero9: 142006295 B/s ( 692251 us) (unroll 8) zeroA: 140841519 B/s ( 697976 us) (generic_bzero) zeroB: 142013476 B/s ( 692216 us) (i486_bzero) zeroC: 141868782 B/s ( 692922 us) (i586_bzero) zeroD: 360165750 B/s ( 272941 us) (i686_pagezero) zeroE: 140712494 B/s ( 698616 us) (bzero (stosl)) The best hand-"optimized" versions using integer registers are only about 12.5% slower than generic_bzero for buffers that fit in the L1 cache, and all bzero methods except i686_pagezero() have the same speed for buffers that don't fit in any cache. i686_pagezero() is faster if the buffer is already all zeros and other slower (the above time is for all zeros). The version of i686_pagezero() in the kernel is especially pessimal (see another reply in this thread). I didn't try hard to use MMX registers. In simple tests, 64-bit memory accesses provided no benefits at least in the uncached case, which is probably for the same reason that 64-bit memory accesses via the FPU provide no benefits. I believe all writes go through write buffers in the CPU, and these worked poorly on PentiumPro's and mediocrely on PII/Celeron. They work much better on more modern CPUs, as they must to keep up with increases with memory bandwidth. Write bandwidth for the PentiumPr family is also limited by read-before-write. This more than halves the write bandwidth for large cache-busting bzero's like the 8MB ones above. The halving can be seen in the above benchmarks. The main memory bandwidth is approx 360MB/sec on this system, and this is achieved by i686_bzero() since it just reads the buffer to verify that it is all zero's (optimized read bandwidth tests that just throw the data away run at the same speed). Read-before- write halves the maximum write bandwidth to 180MB/sec. In practice, the write bandwidth is limited to 140MB/sec (slower than on Pentium I systems with a main memory bandwidth of 180MB/sec! -- these can get near the max for both read and write). Benefits from SSE for bzeroing and bcopying, if any, would probably come more from bypassing caches and/or not doing read-before-write (SSE instructions give control over this) than from operating on wider data. I'm dubious about practical benefits. Obviously it is not useful to bust the cache when bzeroing 8MB of data, but real programs and OS's mostly operate on smaller buffers. It is negatively useful not to put bzero'ed data in the (L[1-2]) cache if the data will be used soon, and generally hard to predict if it will be used soon. Bruce