From owner-freebsd-amd64@FreeBSD.ORG Sat Feb 18 22:02:51 2006 Return-Path: X-Original-To: freebsd-amd64@FreeBSD.org Delivered-To: freebsd-amd64@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AB23016A420 for ; Sat, 18 Feb 2006 22:02:51 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 29E2143D49 for ; Sat, 18 Feb 2006 22:02:51 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.4/8.13.4) with ESMTP id k1IM2grO005411 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 18 Feb 2006 17:02:42 -0500 (EST) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k1IM2Ybm082201; Sat, 18 Feb 2006 17:02:34 -0500 (EST) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17399.39290.13815.777894@grasshopper.cs.duke.edu> Date: Sat, 18 Feb 2006 17:02:34 -0500 (EST) To: Bruce Evans In-Reply-To: <20060218232213.F59482@delplex.bde.org> References: <17397.58669.457047.277510@grasshopper.cs.duke.edu> <20060218232213.F59482@delplex.bde.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: Andrew Gallatin , freebsd-amd64@FreeBSD.org Subject: Re: non-temporal copyin/copyout? X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Feb 2006 22:02:51 -0000 Bruce Evans writes: > On Fri, 17 Feb 2006, Andrew Gallatin wrote: > > > Has anybody considered using non-temporal copies for the in-kernel > > bcopy on amd64? > > Yes. It's probably a small pessimization sunce large bcopys are (or > should be) rare. If you really mean copyin/copyout as in the subject > line, then things are less clear. Yes, copyin/copyout is what I really meant. > > A quick test in userspace shows that for large copies, an adapted > > pagecopy (from amd64/amd64/support.S) more than doubles bcopy > > bandwidth from 1.2GB/s to 2.5GB/s on my on my Athlon64 X2 3800+. > > Is this with 5+GHz memory or with slower memory with the source cached? > I've seen 1.7GB/s in non-quick tests in user space with PC3200 memory > overclocked slightly. This is almost twice as fast as using the best > nontemporal copy method (which gives 0.9GB/s on the same machine). This is a "DFI Lanparty UTnF4 Ultra-D" with an Nforce 4 chipset, and 2 256 MB sticks of PC3200 ram. The timings I mention above closely match the lmbench "bcopy" benchmark for large buffers (> L2 cache) when run on FreeBSD vs when run on Solaris (which uses a non-temporal bcopy even in userspace). <....> > With the Athlon64 behaviour, I think nontemporal copies should only be > used in cases where it is know that the copies really are nontemporal. > We use them for page copying now because this is (almost) known. For > copyout(), it would be certainly known only for copies that are so large > that they can't fit in the L2 cache. copyin() might be different, since > it might often be known that the data will be DMA'ed out by a driver and > need never be cached. I think you could make arguments for doing a non-temporal copy for both copyin and copyout when the size exceeds some tunable threshold. Solaris even uses a fixed threshold, and I believe the threshold is quite small (128 bytes). See http://cvs.opensolaris.org/source/xref/on/usr/src/uts/intel/ia32/ml/copy.s Maybe I'm being naive, but I would assume that most bulk data, both copied in and copied out should never be accessed by the kernel in a high performance system. Most Gigabit or better, and many 100Mb network drivers do checksum offloading on both send and receive, so there is no need for the kernel to touch any data which is copied in or out for network sends or receives. Further, I can imagine a network server (like a userspace nfs server or samba) turning around and writing data to disk which it received via a socket read without ever looking at the buffer. I don't know the storage system as well as the networking system, but unless a disk driver is using PIO, I don't think the data is ever touched by the kernel. This is all academic, as I don't know enough about x86_64 asm to implement any of this. But I have an ideal testbed if anybody would be inclined to implement it. Drew