From owner-freebsd-amd64@FreeBSD.ORG Sat Feb 18 13:29:32 2006 Return-Path: X-Original-To: freebsd-amd64@FreeBSD.org Delivered-To: freebsd-amd64@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0C07816A420 for ; Sat, 18 Feb 2006 13:29:32 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5FD8043D45 for ; Sat, 18 Feb 2006 13:29:31 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id k1IDTKVI010902; Sun, 19 Feb 2006 00:29:20 +1100 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id k1IDTHWK015852; Sun, 19 Feb 2006 00:29:18 +1100 Date: Sun, 19 Feb 2006 00:29:17 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Andrew Gallatin In-Reply-To: <17397.58669.457047.277510@grasshopper.cs.duke.edu> Message-ID: <20060218232213.F59482@delplex.bde.org> References: <17397.58669.457047.277510@grasshopper.cs.duke.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-amd64@FreeBSD.org Subject: Re: non-temporal copyin/copyout? X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Feb 2006 13:29:32 -0000 On Fri, 17 Feb 2006, Andrew Gallatin wrote: > Has anybody considered using non-temporal copies for the in-kernel > bcopy on amd64? Yes. It's probably a small pessimization sunce large bcopys are (or should be) rare. If you really mean copyin/copyout as in the subject line, then things are less clear. > A quick test in userspace shows that for large copies, an adapted > pagecopy (from amd64/amd64/support.S) more than doubles bcopy > bandwidth from 1.2GB/s to 2.5GB/s on my on my Athlon64 X2 3800+. Is this with 5+GHz memory or with slower memory with the source cached? I've seen 1.7GB/s in non-quick tests in user space with PC3200 memory overclocked slightly. This is almost twice as fast as using the best nontemporal copy method (which gives 0.9GB/s on the same machine). > I'm bringing this up because I've noticed that FreeBSD 10GbE > performance is far below Solaris/amd64 and linux/x86_64 when using the > PCI-e 10GbE adaptor that I'm doing drivers for. For example, Solaris > can recieve a netperf TCP stream at 9.75Gb/sec while using only 47% > CPU as measured by vmstat. (eg, it is using a little less than a > single core). In contrast, FreeBSD is limited to 7.7Gb/sec, and uses > nearly 90% CPU. When profiling with hwpmc, I see a profile which > shows up to 70% of the time is spent in copyout. The problems with always using nontemporal copies is that they might be much slower if the target is already cached (which would often be the case, for example, if the same small buffer is used repeatedly), and they would be slower if the application actually uses the data soon enough after reading it that it doesn't become uncached (if it was cached as a side effect of the copy). I once thought that movnt* doesn't take any advantage of cached data. Tesing on AthlonXP showed that this isn't much of a problem -- repeated movnt{q,ps}'s to the same small buffer go almost as fast (to within about 10%, with half the extra overhead for prefetchnta) as the best temporal method, provided the target buffer is read into the cache first (otherwise temporal copies are limited to the bandwidth of main memory, which is (3 to 5 times slower on my test machines). However, on my Athlon64 and sledge's Opteron, movnt{q,ps,i} is limited to the speed of main memory whether or not the target buffer is pre-read. If it weren't for the Athlon64 behaviour, then using nontemporal copies for all larger copyin/outs would probably be best. "Large" wouldn't need to be very large for the 10% overhead to be a reasonable tradeoff. If the target were cached then the copy would go 10% slower (than very fast), and if the target weren't cached but the data weren't actually nontemporal (because the application actually uses it soon), then the copy would go as fast as possible and the cost of reading the data into the cache would be paid later where it would give a total cost of about the same as from reading it as part of the copy. With the Athlon64 behaviour, I think nontemporal copies should only be used in cases where it is know that the copies really are nontemporal. We use them for page copying now because this is (almost) known. For copyout(), it would be certainly known only for copies that are so large that they can't fit in the L2 cache. copyin() might be different, since it might often be known that the data will be DMA'ed out by a driver and need never be cached. Bruce