From owner-freebsd-arch@FreeBSD.ORG Mon Jul 6 01:14:21 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1025C106564A for ; Mon, 6 Jul 2009 01:14:21 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D5BAB8FC23 for ; Mon, 6 Jul 2009 01:14:20 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n661EKpf065707 for ; Sun, 5 Jul 2009 18:14:20 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n661EK68065706; Sun, 5 Jul 2009 18:14:20 -0700 (PDT) Date: Sun, 5 Jul 2009 18:14:20 -0700 (PDT) From: Matthew Dillon Message-Id: <200907060114.n661EK68065706@apollo.backplane.com> To: freebsd-arch@freebsd.org References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2009 01:14:21 -0000 I think MAXPHYS, or the equivalent, is still used somewhat in the clustering code. The number of buffers the clustering code decides to chain together dictates the impact on the actual device. The relevancy here has very little to do with cache smashing and more to do with optimizing disk seeks (or network latency). There is no best value for this. It is only marginally more interesting for a network interface due to the fact that most links still run with absurdly small MTUs (even 9000+ is absurdly small). It is entirely uninteresting for a SATA or other modern disk link. For linear transfers you only need a value sufficiently large to reduce the impact of command overhead on the cpu and achieve the device's maximum linear transfer rate For example, doing a dd with bs=512 verses bs=32k. It runs on a curve and there will generally be very little additional bang for the buck beyond 64K for a linear transfer (assuming read ahead and NCQ to reduce inter-command latency). For random and semi-random transfers a larger buffer sizes have two impacts. First is a negative impact on seek times. A random seek-read of 16K is faster then a random seek-read of 64K is faster then a random seek-read of 512K. I did a ton of testing with HAMMER and it just didn't make much sense to go beyond 128K, frankly, but neither does it make sense to use something really tiny like 8K. 32K-128K seems to be the sweet spot. The second is a positive impact on reducing the total number of seeks *IF* you have reasonable cache locality of reference. There is no correct value, it depends heavily on the access pattern. A random access pattern with very little locality of reference will benefit from a smaller block size while a random access pattern with high locality of reference will benefit from a larger block size. That's all there is to it. I have a fairly negative opinion of trying to tune block size to cpu caches. I don't think it matters nearly as much as tuning it to the seek/locality-of-reference performace curve, and I don't feel that contrived linear tests are all that interesting since they don't really reflect real-life work-loads. on-drive caching has an impact too, but that's another conversation. Vendors have been known to intentionally degrade drive cache performance on consumer drives verses commercial drives. I've often hit limitations in testing HAMMER which seem to be contrived by vendors that would have allowed me to use a smaller block size and still get the locality of reference, but I wind up having to use a larger one because the drive cache doesn't behave sanely. -- The DMA ability of modern devices and device drivers is pretty much moot as no self respecting disk controller chipset will be limited to a measily 64K max transfer any more. AHCI certainly has no issue doing in excess of a megabyte. The limit is something like 65535 chained entries for AHCI. I forget what the spec says exactly but it's basically more then we'd ever really need. Nobody should really care about the performance of a chipset that is limited to a 64K max transfer. As long as the cluster code knows what the device can do and the filesystem doesn't try to use a larger block size the device is capable of in a single BIO, the cluster code will make up the difference for any device-based limitations. -Matt