From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 12:13:47 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8D4661065672; Sun, 20 Dec 2009 12:13:47 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 48B0E8FC23; Sun, 20 Dec 2009 12:13:47 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id CD56646B09; Sun, 20 Dec 2009 07:13:46 -0500 (EST) Date: Sun, 20 Dec 2009 12:13:46 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Harti Brandt In-Reply-To: <20091219164818.L1741@beagle.kn.op.dlr.de> Message-ID: References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 12:13:47 -0000 On Sat, 19 Dec 2009, Harti Brandt wrote: > To be honest, I'm lost now. Couldn't we just use the largest atomic type for > the given platform and atomic_inc/atomic_add/atomic_fetch and handle the > 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel thread? > > Are the 5-6 atomic operations really that costly given the many operations > done on an IP packet? Are they more costly than a heavyweight sync for each > ++ or +=? Frequent writes to the same cache line across multiple cores are remarkably expensive, as they trigger the cache coherency protocol (mileage may vary). For example, a single non-atomically incremented counter cut performance of gettimeofday() to 1/6th performance on an 8-core system when parallel system calls were made across all cores. On many current systems, the cost of an "atomic" operation is now fairly reasonable as long as the cache line is held exclusively by the current CPU. However, if we can avoid them that has value, as we update quite a few global stats on the way through the network stack. > Or we could use the PCPU stuff, use just ++ and += for modifying the > statistics (32bit) and do the 32->64 bit stuff for all platforms with a > kernel thread per CPU (do we have this?). Between that thread and the sysctl > we could use a heavy sync. The current short-term plan is to move do this but without a syncer thread: we'll just aggregate the results when they need to be reported, in the sysctl path. How best to scale to 64-bit counters is an interesting question, but one we can address after per-CPU stats are in place, which address an immediate performance (rather than statistics accuracy) concern. > Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > machine and do routing at link speed, though. This might overflow the IP > input/output byte counter (which we don't have yet) too fast. For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in about three seconds. Systems processing 40gbps a second are now quite realistic, although typically workloads of that sort will be distributed over 16+ cores and using multiple 10gbps NICs. My thinking is that we get the switch to per-CPU stats done in 9.x in the next month sometime, and also get it merged to 8.x a month or so later (I merged the wrapper macros necessary to do that before 8.0 but didn't have time to fully evaluate the performance implications of the implementation switch). There are two known areas of problem here: (1) The cross-product issue with virtual network stacks (2) The cross-product issue with network interfaces for per-interface stats I propose to ignore (1) for now by simply having only vnet0 use per-CPU stats, and other vnets use single-instance per-vnet stats. We can solve the larger problem there at a future date. I don't have a good proposal for (2) -- the answer may be using DPCPU memory, but that will require use to support more dynamic allocation ranges, which may add cost. (Right now, the DPCPU allocator relies on relatively static allocations over time). This means that, for now, we may also ignore that issue and leave interface counters as-is. This is probably a good idea because we also need to deal with multi-queue interfaces better, and perhaps the stats should be per-queue rather than per-ifnet, which may itself help address the cache line issue. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 13:19:40 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 78A1C106568B for ; Sun, 20 Dec 2009 13:19:40 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 06E518FC18 for ; Sun, 20 Dec 2009 13:19:39 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sun, 20 Dec 2009 14:19:38 +0100 Date: Sun, 20 Dec 2009 14:19:34 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: Robert Watson In-Reply-To: Message-ID: <20091220134738.V46221@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 20 Dec 2009 13:19:38.0085 (UTC) FILETIME=[14A3C950:01CA8177] Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 13:19:40 -0000 On Sun, 20 Dec 2009, Robert Watson wrote: RW> RW>On Sat, 19 Dec 2009, Harti Brandt wrote: RW> RW>> To be honest, I'm lost now. Couldn't we just use the largest atomic type RW>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle RW>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel RW>> thread? RW>> RW>> Are the 5-6 atomic operations really that costly given the many operations RW>> done on an IP packet? Are they more costly than a heavyweight sync for each RW>> ++ or +=? RW> RW>Frequent writes to the same cache line across multiple cores are remarkably RW>expensive, as they trigger the cache coherency protocol (mileage may vary). RW>For example, a single non-atomically incremented counter cut performance of RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system RW>calls were made across all cores. On many current systems, the cost of an RW>"atomic" operation is now fairly reasonable as long as the cache line is held RW>exclusively by the current CPU. However, if we can avoid them that has RW>value, as we update quite a few global stats on the way through the network RW>stack. Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP packet. I would expect, that a single increment is a good percentage of the entire processing (in terms of numbers of operations) for gettimeofday(), while in IP forwarding this is somewhere in the noise floor. In the simples case the packet is acted upon by the receiving driver, the IP input function, the IP output function and the sending driver. Not talking about IP filters, firewalls, tunnels, dummynet and what else. The relative cost of the increment should be much less. But, I may be wrong of course. RW> RW>> Or we could use the PCPU stuff, use just ++ and += for modifying the RW>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a RW>> kernel thread per CPU (do we have this?). Between that thread and the RW>> sysctl we could use a heavy sync. RW> RW>The current short-term plan is to move do this but without a syncer thread: RW>we'll just aggregate the results when they need to be reported, in the sysctl RW>path. How best to scale to 64-bit counters is an interesting question, but RW>one we can address after per-CPU stats are in place, which address an RW>immediate performance (rather than statistics accuracy) concern. Well, the user side of our statistics is in a very bad shape and I have problems in handling this in the SNMP daemon. Just a number of examples: interface statistics: - they use u_long, so are either 32-bit or 64-bit depending on the platform - a number of required statistics is missing - send drops are somewhere else and are 'int' - statistics are embedded into struct ifnet (bad for ABI stability) and not versioned - accessed together with other unrelated information via sysctl() IPv4 statistics: - also u_long (hence different size on the platforms) - a lot of fields required by SNMP is missing - not versioned - accessed via sysctl() - per interface statistics totally missing IPv6 statistics: - u_quad_t! so they are suspect to race conditions on 32-bit platforms and, maybe?, on 64-bit platforms - a lot of fields requred by SNMP is missing - not versioned - accessed via sysctl(); per interface statistics via ioctl() Ethernet statistics: - u_long - some fields missing - implemented in only 3! drivers; some drivers use the corresponding field for something else - not versioned I think, TCP and UDP statistics are equally bad shaped. I would really like to sort that out before any kind of ABI freeze happens. Ideally all the statistics would be accessible per sysctl(), have a version number and have all or most of the required statistics with a simple way to add new fields without breaking anything. Also the field sizes (64 vs. 32 bit) should be correct on the kernel - user interface. My current feeling after reading this thread is that the low-level kernel side stuff is probably out of what I could do with the time I have and would sidetrack me too far from the work on bsnmp. What I would like to do is to fix the kernel/user interface and let the people that now how to do it handle the low-level side. I would really not like to have to deal with a changing user/kernel interface in current if we go in several steps with the kernel stuff. RW>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a RW>> machine and do routing at link speed, though. This might overflow the IP RW>> input/output byte counter (which we don't have yet) too fast. RW> RW>For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in RW>about three seconds. Systems processing 40gbps a second are now quite RW>realistic, although typically workloads of that sort will be distributed over RW>16+ cores and using multiple 10gbps NICs. RW> RW>My thinking is that we get the switch to per-CPU stats done in 9.x in the RW>next month sometime, and also get it merged to 8.x a month or so later (I RW>merged the wrapper macros necessary to do that before 8.0 but didn't have RW>time to fully evaluate the performance implications of the implementation RW>switch). I will try to come up with a patch for the kernel/user interface in the mean time. This will be for 9.x only, obviously. RW>There are two known areas of problem here: RW> RW>(1) The cross-product issue with virtual network stacks RW>(2) The cross-product issue with network interfaces for per-interface stats RW> RW>I propose to ignore (1) for now by simply having only vnet0 use per-CPU RW>stats, and other vnets use single-instance per-vnet stats. We can solve the RW>larger problem there at a future date. This sounds reasonable if we wrap all the statistics stuff into macros and/or functions. RW>I don't have a good proposal for (2) -- the answer may be using DPCPU memory, RW>but that will require use to support more dynamic allocation ranges, which RW>may add cost. (Right now, the DPCPU allocator relies on relatively static RW>allocations over time). This means that, for now, we may also ignore that RW>issue and leave interface counters as-is. This is probably a good idea RW>because we also need to deal with multi-queue interfaces better, and perhaps RW>the stats should be per-queue rather than per-ifnet, which may itself help RW>address the cache line issue. Doesn't this help for output only? For the input statistics there still will be per-ifnet statistics. An interesting question from the SNMP point of view is, what happens to the statistics if you move around interfaces between vimages. In any case it would be good if we could abstract from all the complications while going kernel->userland. harti From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 13:47:03 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 964F1106566B; Sun, 20 Dec 2009 13:47:03 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6A21D8FC13; Sun, 20 Dec 2009 13:47:03 +0000 (UTC) Received: from [192.168.2.102] (host86-146-40-97.range86-146.btcentralplus.com [86.146.40.97]) by cyrus.watson.org (Postfix) with ESMTPSA id 87FC446B2D; Sun, 20 Dec 2009 08:47:01 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=us-ascii From: "Robert N. M. Watson" In-Reply-To: <20091220134738.V46221@beagle.kn.op.dlr.de> Date: Sun, 20 Dec 2009 13:46:54 +0000 Content-Transfer-Encoding: quoted-printable Message-Id: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220134738.V46221@beagle.kn.op.dlr.de> To: Harti Brandt X-Mailer: Apple Mail (2.1077) Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 13:47:03 -0000 On 20 Dec 2009, at 13:19, Harti Brandt wrote: > RW>Frequent writes to the same cache line across multiple cores are = remarkably > RW>expensive, as they trigger the cache coherency protocol (mileage = may vary). > RW>For example, a single non-atomically incremented counter cut = performance of > RW>gettimeofday() to 1/6th performance on an 8-core system when = parallel system > RW>calls were made across all cores. On many current systems, the = cost of an > RW>"atomic" operation is now fairly reasonable as long as the cache = line is held > RW>exclusively by the current CPU. However, if we can avoid them that = has > RW>value, as we update quite a few global stats on the way through the = network > RW>stack. >=20 > Hmm. I'm not sure that gettimeofday() is comparable to forwarding an = IP=20 > packet. I would expect, that a single increment is a good percentage = of=20 > the entire processing (in terms of numbers of operations) for=20 > gettimeofday(), while in IP forwarding this is somewhere in the noise=20= > floor. In the simples case the packet is acted upon by the receiving=20= > driver, the IP input function, the IP output function and the sending=20= > driver. Not talking about IP filters, firewalls, tunnels, dummynet and=20= > what else. The relative cost of the increment should be much less. = But, I=20 > may be wrong of course. If processing is occurring on multiple CPUs -- for example, you are = receiving UDP from two ithreads -- then 4-8 cache lines being contended = due to stats is a lot. Our goal should be (for 9.0) to avoid having any = contended cache lines in the common case when processing independent = streams on different CPUs. > I would really like to sort that out before any kind of ABI freeze=20 > happens. Ideally all the statistics would be accessible per sysctl(), = have=20 > a version number and have all or most of the required statistics with = a=20 > simple way to add new fields without breaking anything. Also the field=20= > sizes (64 vs. 32 bit) should be correct on the kernel - user = interface. >=20 > My current feeling after reading this thread is that the low-level = kernel=20 > side stuff is probably out of what I could do with the time I have and=20= > would sidetrack me too far from the work on bsnmp. What I would like = to do=20 > is to fix the kernel/user interface and let the people that now how to = do=20 > it handle the low-level side. >=20 > I would really not like to have to deal with a changing user/kernel=20 > interface in current if we go in several steps with the kernel stuff. I think we should treat the statistics gathering and statistics = reporting interfaces as entirely separable problems. Statistics are = updated far more frequently than they are queried, so making the query = process a bit more expensive (reformatting from an efficient 'update' = format to an application-friendly 'report' format) should be fine. One question to think about is whether or not simply cross-CPU summaries = are sufficient, or whether we actually also want to be able to directly = monitor per-CPU statistics at the IP layer. The former would maintain = the status quo making per-CPU behavior purely part of the 'update' step; = the latter would change the 'report' format as well. I've been focused = primarily on 'update', but at least for my work it would be quite = helpful to have per-CPU stats in the 'report' format as well. > I will try to come up with a patch for the kernel/user interface in = the=20 > mean time. This will be for 9.x only, obviously. Sounds good -- and the kernel stats capture can "grow into" the full = report format as it matures. > Doesn't this help for output only? For the input statistics there = still=20 > will be per-ifnet statistics. Most ifnet-layer stats should really be per-queue, both for input and = output, which may help. > An interesting question from the SNMP point of view is, what happens = to=20 > the statistics if you move around interfaces between vimages. In any = case=20 > it would be good if we could abstract from all the complications while=20= > going kernel->userland. At least for now, the interface is effectively recreated when it moves = vimage, and only the current vimage is able to monitor it. That could be = considered a bug but it might also be a simplifying assumption or even a = feature. Likewise, it's worth remembering that the ifnet index space is = per-vimage. Robert= From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 14:18:15 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 979981065694; Sun, 20 Dec 2009 14:18:15 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 236238FC25; Sun, 20 Dec 2009 14:18:14 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sun, 20 Dec 2009 15:18:13 +0100 Date: Sun, 20 Dec 2009 15:18:11 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: "Robert N. M. Watson" In-Reply-To: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> Message-ID: <20091220150003.W54492@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220134738.V46221@beagle.kn.op.dlr.de> <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 20 Dec 2009 14:18:13.0891 (UTC) FILETIME=[44393530:01CA817F] Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 14:18:15 -0000 On Sun, 20 Dec 2009, Robert N. M. Watson wrote: RNMW> RNMW>On 20 Dec 2009, at 13:19, Harti Brandt wrote: RNMW> RNMW>> RW>Frequent writes to the same cache line across multiple cores are remarkably RNMW>> RW>expensive, as they trigger the cache coherency protocol (mileage may vary). RNMW>> RW>For example, a single non-atomically incremented counter cut performance of RNMW>> RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system RNMW>> RW>calls were made across all cores. On many current systems, the cost of an RNMW>> RW>"atomic" operation is now fairly reasonable as long as the cache line is held RNMW>> RW>exclusively by the current CPU. However, if we can avoid them that has RNMW>> RW>value, as we update quite a few global stats on the way through the network RNMW>> RW>stack. RNMW>> RNMW>> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP RNMW>> packet. I would expect, that a single increment is a good percentage of RNMW>> the entire processing (in terms of numbers of operations) for RNMW>> gettimeofday(), while in IP forwarding this is somewhere in the noise RNMW>> floor. In the simples case the packet is acted upon by the receiving RNMW>> driver, the IP input function, the IP output function and the sending RNMW>> driver. Not talking about IP filters, firewalls, tunnels, dummynet and RNMW>> what else. The relative cost of the increment should be much less. But, I RNMW>> may be wrong of course. RNMW> RNMW>If processing is occurring on multiple CPUs -- for example, you are receiving UDP from two ithreads -- then 4-8 cache lines being contended due to stats is a lot. Our goal should be (for 9.0) to avoid having any contended cache lines in the common case when processing independent streams on different CPUs. RNMW> RNMW>> I would really like to sort that out before any kind of ABI freeze RNMW>> happens. Ideally all the statistics would be accessible per sysctl(), have RNMW>> a version number and have all or most of the required statistics with a RNMW>> simple way to add new fields without breaking anything. Also the field RNMW>> sizes (64 vs. 32 bit) should be correct on the kernel - user interface. RNMW>> RNMW>> My current feeling after reading this thread is that the low-level kernel RNMW>> side stuff is probably out of what I could do with the time I have and RNMW>> would sidetrack me too far from the work on bsnmp. What I would like to do RNMW>> is to fix the kernel/user interface and let the people that now how to do RNMW>> it handle the low-level side. RNMW>> RNMW>> I would really not like to have to deal with a changing user/kernel RNMW>> interface in current if we go in several steps with the kernel stuff. RNMW> RNMW>I think we should treat the statistics gathering and statistics RNMW>reporting interfaces as entirely separable problems. Statistics are RNMW>updated far more frequently than they are queried, so making the RNMW>query process a bit more expensive (reformatting from an efficient RNMW>'update' format to an application-friendly 'report' format) should be RNMW>fine. RNMW> RNMW>One question to think about is whether or not simply cross-CPU RNMW>summaries are sufficient, or whether we actually also want to be able RNMW>to directly monitor per-CPU statistics at the IP layer. The former RNMW>would maintain the status quo making per-CPU behavior purely part of RNMW>the 'update' step; the latter would change the 'report' format as RNMW>well. I've been focused primarily on 'update', but at least for my RNMW>work it would be quite helpful to have per-CPU stats in the 'report' RNMW>format as well. No problem. I can even add that in a private SNMP MIB if it seems useful. RNMW> RNMW>> I will try to come up with a patch for the kernel/user interface in the RNMW>> mean time. This will be for 9.x only, obviously. RNMW> RNMW>Sounds good -- and the kernel stats capture can "grow into" the full RNMW>report format as it matures. RNMW> RNMW>> Doesn't this help for output only? For the input statistics there still RNMW>> will be per-ifnet statistics. RNMW> RNMW>Most ifnet-layer stats should really be per-queue, both for input and RNMW>output, which may help. As far as I can see currently the driver just calls if_input which is the interface dependend input function. There seems to be no driver-independent abstraction of input queues. (The hatm driver I wrote several years ago has to input queues in hardware corresponding to 4 (or 8?) interrupt queues, but somewhere in the driver you put all of this through the single if_input hook). Or is there something I'm missing? RNMW>> An interesting question from the SNMP point of view is, what happens to RNMW>> the statistics if you move around interfaces between vimages. In any case RNMW>> it would be good if we could abstract from all the complications while RNMW>> going kernel->userland. RNMW> RNMW>At least for now, the interface is effectively recreated when it RNMW>moves vimage, and only the current vimage is able to monitor it. That RNMW>could be considered a bug but it might also be a simplifying RNMW>assumption or even a feature. Likewise, it's worth remembering that RNMW>the ifnet index space is per-vimage. I was already thinking about how to fit the vimage stuff into the SNMP model. The simplest way is to run one SNMP daemon per vimage. Next comes having one daemon that has one context per vimage. Bsnmpd does its own mapping of system ifnet indexes to SNMP interface indexes, because the allocation of system ifnet indexes does not fit to the RFC requirements. This means it will detect when an interface is moved away from a vimage and comes back later. If the kernel statistics are stable over these movements, there is no need to declare a counter discontinuity via SNMP. On the other hand these operations are probably seldom enough ... harti From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 14:35:27 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7B22D1065693; Sun, 20 Dec 2009 14:35:27 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4E6D88FC08; Sun, 20 Dec 2009 14:35:27 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id 019E046B1A; Sun, 20 Dec 2009 09:35:27 -0500 (EST) Date: Sun, 20 Dec 2009 14:35:26 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Harti Brandt In-Reply-To: <20091220150003.W54492@beagle.kn.op.dlr.de> Message-ID: References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220134738.V46221@beagle.kn.op.dlr.de> <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> <20091220150003.W54492@beagle.kn.op.dlr.de> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 14:35:27 -0000 On Sun, 20 Dec 2009, Harti Brandt wrote: > RNMW>> I will try to come up with a patch for the kernel/user interface in the > RNMW>> mean time. This will be for 9.x only, obviously. > RNMW> > RNMW>Sounds good -- and the kernel stats capture can "grow into" the full > RNMW>report format as it matures. > RNMW> > RNMW>> Doesn't this help for output only? For the input statistics there still > RNMW>> will be per-ifnet statistics. > RNMW> > RNMW>Most ifnet-layer stats should really be per-queue, both for input and > RNMW>output, which may help. > > As far as I can see currently the driver just calls if_input which is the > interface dependend input function. There seems to be no driver-independent > abstraction of input queues. (The hatm driver I wrote several years ago has > to input queues in hardware corresponding to 4 (or 8?) interrupt queues, but > somewhere in the driver you put all of this through the single if_input > hook). Or is there something I'm missing? You're not missing anything, it's the code for what I describe that's missing :-). Adding an ifnet-layer abstration for input and output queues (if only to hold stats in a cross-driver way) is something that's come up at the last devsummit or two, and something we need to sort out for 9.0. > I was already thinking about how to fit the vimage stuff into the SNMP > model. The simplest way is to run one SNMP daemon per vimage. Next comes > having one daemon that has one context per vimage. Bsnmpd does its own > mapping of system ifnet indexes to SNMP interface indexes, because the > allocation of system ifnet indexes does not fit to the RFC requirements. > This means it will detect when an interface is moved away from a vimage and > comes back later. If the kernel statistics are stable over these movements, > there is no need to declare a counter discontinuity via SNMP. On the other > hand these operations are probably seldom enough ... For a system with thousands of virtual network stacks, it would be nice to avoid requiring one process/vimage, but given the way we currently link processes to vimages, arranging that is currently awkward. We're also well past the point where a 16-bit integer can describe what is required out of our interface system; perhaps we just bite the bullet and roll to 32-bit if indexes, but also give each interface a uuid (or the like) that is stable over vimage moves. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 20:18:01 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 94999106568B; Sun, 20 Dec 2009 20:18:01 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id 2ACC48FC14; Sun, 20 Dec 2009 20:18:00 +0000 (UTC) Received: from c211-30-197-33.carlnfd3.nsw.optusnet.com.au (c211-30-197-33.carlnfd3.nsw.optusnet.com.au [211.30.197.33]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBKKHu6Y004453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 21 Dec 2009 07:17:58 +1100 Date: Mon, 21 Dec 2009 07:17:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Harti Brandt In-Reply-To: <20091219204217.D1741@beagle.kn.op.dlr.de> Message-ID: <20091221063312.R39703@delplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220032452.W2429@besplex.bde.org> <20091219204217.D1741@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 20:18:01 -0000 On Sat, 19 Dec 2009, Harti Brandt wrote: > On Sun, 20 Dec 2009, Bruce Evans wrote: > BE>... > BE>I don't see why using atomic or locks for just the 64 bit counters is good. > BE>We will probably end up with too many 64-bit counters, especially if they > BE>don't cost much when not read. > > On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is > modifying it, the read will probably be always correct given the variable is > correctly aligned. We assume this. Otherwise the per-CPU counter optimization wouldn't work so well. > On a 64-bit arch when reading a 64-bit value on one CPU > while the other one is adding to, do I always get the correct value? I'm > not sure about this, why I put atomic_*() there assuming that they will make > this correct. You have to use the PCPU_INC()/PCPU_GET() interface and not worry about this. It must supply any necessary atomicness on an arch-specific basis. > The idea is (for 32-bit platforms): > > struct pcpu_stats { > uint32_t in_bytes; > uint32_t in_packets; > }; > > struct pcpu_hc_stats { > uint64_t hc_in_bytes; > uint64_t hc_in_packets; > }; > > /* driver; IP stack; ... */ > ... > pcpu_stats->in_bytes += bytes; > pcpu_stats->in_packets++; > ... > > /* per CPU kernel thread for 32-bit arch */ > lock(pcpu_hc_stats); > ... > val = pcpu_stats->in_bytes; > if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val) > pcpu_hc_stats->in_bytes += 0x100000000; > pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes & > 0xffffffff00000000ULL) | val; Why not just `pcpu_hc_stats->in_bytes |= val' for the second statement? > ... > unlock(pcpu_hc_stats); > BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > BE>> machine and do routing at link speed, though. This might overflow the IP > BE>> input/output byte counter (which we don't have yet) too fast. > BE> > BE>Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow > BE>a 32-bit byte counter. A bit counter would take a while to overflow too. > BE>Are there any faster incrementors? TSCs also take O(1) seconds to overflow, > BE>and timecounter logic depends on no timecounter overflowing much faster > BE>than that. > > If you have 4 10GBit/s adapters each operating full-duplex at link speed you > wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels > where each packet counts several times. But I suppose this will be not so easy > with IA32 to implement :-) I was only thinking of per-interface counters, but we can handle the 64-bit fixup for thousands of these if we can handle thousands of fixup threads each running thousands of times per second. Actually, thousands of adaptors would need thousands of CPUS to handle and each per-CPU counter would be limited by what 1 CPU could handle so we're back nearer to the original 4 seconds than to 4/4000. Bruce From owner-freebsd-arch@FreeBSD.ORG Mon Dec 21 11:06:51 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3753910656A4 for ; Mon, 21 Dec 2009 11:06:51 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 0C7BD8FC1A for ; Mon, 21 Dec 2009 11:06:51 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id nBLB6oWA004017 for ; Mon, 21 Dec 2009 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id nBLB6olq004015 for freebsd-arch@FreeBSD.org; Mon, 21 Dec 2009 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 21 Dec 2009 11:06:50 GMT Message-Id: <200912211106.nBLB6olq004015@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Dec 2009 11:06:51 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 10:58:50 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8F1251065696; Fri, 25 Dec 2009 10:58:50 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com [209.85.220.227]) by mx1.freebsd.org (Postfix) with ESMTP id EE61D8FC19; Fri, 25 Dec 2009 10:58:49 +0000 (UTC) Received: by fxm27 with SMTP id 27so8778359fxm.3 for ; Fri, 25 Dec 2009 02:58:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:x-enigmail-version:content-type :content-transfer-encoding; bh=dOGAq0fldWTFVfl3MvbKRfDkuvsbY7wkE6GaZL8ERHg=; b=dwcbHs0bSEmbDz+AsLd/7eq7p7R52mMaMOb6EBkh3RtYthc+vC5HX6XUk1rkv5RhpR f6AioFXme82Mr/Pgy6tMkSL91YTtb+xagYB13AdzswxQZjxdzH8ZRZAkNmRnozEoY2vA G7vt+wSXnSJNJ03sX9mhshUXEj2WPGLYU9CnA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :x-enigmail-version:content-type:content-transfer-encoding; b=omM1nlgF+hviwCCxtgDgzXupprgNaE+hw+AGdFQC52Ezh82cmYYgnU+UNNvaDR2laB cqK+pYjJbm1YLpWq0i5Mm9HcSnIRwpgMlMS45ByFAVTMprgs/xx+O2UOg28QxXFu7fQm 2rvEfJMFyYleL0umo6Udyw4nFJAIx69SQdhoE= Received: by 10.223.95.72 with SMTP id c8mr6014885fan.73.1261738728801; Fri, 25 Dec 2009 02:58:48 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id 16sm3145902fxm.8.2009.12.25.02.58.47 (version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 02:58:48 -0800 (PST) Sender: Alexander Motin Message-ID: <4B349ABF.2070800@FreeBSD.org> Date: Fri, 25 Dec 2009 12:58:07 +0200 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20090901) MIME-Version: 1.0 To: freebsd-arch@freebsd.org, FreeBSD-Current X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Cc: Subject: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 10:58:50 -0000 Hi. Recently WD released first series of ATA disks with increased physical sector size. It makes writes not matching with 4K blocks inefficient there. So I propose to get back to the question of optimal FS block alignment. This topic is also important for most of RAIDs having striped nature, such as RAID0/3/5/... and flash drives with simple controller (such as MMC/SD cards). As I have no one of those WD disks yet, I have made series of tests with RAID0, made by geom_stripe, to check general idea. I've tested the most describing case: 2-disk RAID0 with 16K stripe, 16K FS block and many 16K random I/Os (reads in this test, to avoid FS locking). Same load pattern but with writes I had on my busy disk-bound MySQL servers, so it is quite real. Test one, default partitioning. %gstripe label -s 16384 data /dev/ada1 /dev/ada2 %fdisk -I /dev/stripe/data %disklabel -w /dev/stripe/datas1 %disklabel /dev/stripe/datas1 # /dev/stripe/datas1: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 1250274611 16 unused 0 0 c: 1250274627 0 unused 0 0 # "raw" part, don't edit %diskinfo -v /dev/stripe/datas1a /dev/stripe/datas1a 512 # sectorsize 640140600832 # mediasize in bytes (596G) 1250274611 # mediasize in sectors 16384 # stripesize 7680 # stripeoffset 77825 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. As you can see, fdisk aligned partition to the "track length" of 63 sectors and disklabel added offset of 16 sectors. As result, file system will start at quite odd place of the RAID stripe. I've created UFS file system, pre-wrote 4GB file and run tests (raidtest was patched to generate only 16K requests): %raidtest test -d /mnt/qqq -n 1 Requests per second: 112 %raidtest test -d /mnt/qqq -n 64 Requests per second: 314 Before each test FS was unmounted to flush caches. Test two, FS manually aligned with disklabel. %disklabel /dev/stripe/datas1 # /dev/stripe/datas1: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 1250274578 33 unused 0 0 c: 1250274627 0 unused 0 0 # "raw" part, don't edit %diskinfo -v /dev/stripe/datas1a /dev/stripe/datas1a 512 # sectorsize 640140583936 # mediasize in bytes (596G) 1250274578 # mediasize in sectors 16384 # stripesize 0 # stripeoffset 77825 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. File system aligned with stripe. %raidtest test -d /mnt/qqq -n 1 Requests per second: 133 %raidtest test -d /mnt/qqq -n 64 Requests per second: 594 The difference is quite significant. Unaligned RAID0 access causes two disks involved in it's handling, while aligned one leaves one of disks free for another request, doubling performance. As we have now mechanism for reporting stripe size and offset for any partition to user-level, it should be easy to make disk partitioning and file system creation tools to use it automatically. Stripe size/offset reporting now supported by ada and mmcsd disk drivers and most of GEOM modules. It would be nice to fetch that info from hardware RAIDs also, where possible. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 11:27:43 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0EA2A10656A4; Fri, 25 Dec 2009 11:27:43 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com [209.85.220.227]) by mx1.freebsd.org (Postfix) with ESMTP id 6AA618FC1B; Fri, 25 Dec 2009 11:27:42 +0000 (UTC) Received: by fxm27 with SMTP id 27so8788129fxm.3 for ; Fri, 25 Dec 2009 03:27:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=wkz+a8gEXMEDZkF74xrcCKTqK2ep/dI9sOSVAH0mAs8=; b=JWU99z8Z6Q+WNtUvChyGlbjrkNoEHbLPbgiPjKFUGuMOllANhI7SoKq18kbX6gG6pz b8dXdzPU0PZtkZf4r3DAEJAR0hYwdXeuNLsmYhELaAPV2/K9OgdpwTGMiLhBFZ7gjXU8 i66kRYzD2MA/AeDHQjGwIL/Ri30mL+IErhGm0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=Rd28odWfAhYh6BHS5lFK5sW841UQcOzXPCu6a255Vsvq9ECc7dZsOvJUKNGsd+HgEL M9X5NEuCs+kCkyOERr9GwppzMvY8+LhjwfXelzis/eJt+FHSfexxnJ90WrM2t5ByrCvL Axujpk5+JZPjLOBs7HwTAm2+ISeq/ewsxJn2w= Received: by 10.223.14.150 with SMTP id g22mr10842718faa.14.1261740461308; Fri, 25 Dec 2009 03:27:41 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id 14sm3159895fxm.7.2009.12.25.03.27.39 (version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 03:27:40 -0800 (PST) Sender: Alexander Motin Message-ID: <4B34A183.7000909@FreeBSD.org> Date: Fri, 25 Dec 2009 13:26:59 +0200 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20090901) MIME-Version: 1.0 To: Thomas Backman References: <4B349ABF.2070800@FreeBSD.org> <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org> In-Reply-To: <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: FreeBSD-Current , freebsd-arch@freebsd.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 11:27:43 -0000 Thomas Backman wrote: > On Dec 25, 2009, at 11:58 AM, Alexander Motin wrote: |>> Recently WD released first series of ATA disks with increased physical >> sector size. It makes writes not matching with 4K blocks inefficient >> there. > They don't expose this to the OS, though (not by default, anyway), but chop it up into 8 512-byte sectors for compatibility reasons. > Just thought I'd point that out - I'm not even sure if you can get them to *not* do the compatibility thing and expose 4k-sized sectors. Latest ATA-8 specification allows drive to report both logical (512B) and physical (4KB) sector sizes. ada driver able to fetch and report that info to GEOM. If these drives not reporting it yet (are you really sure?), it is only question of their firmware. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 11:38:09 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8EAE106566B for ; Fri, 25 Dec 2009 11:38:09 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net [80.76.149.212]) by mx1.freebsd.org (Postfix) with ESMTP id 653748FC0A for ; Fri, 25 Dec 2009 11:38:09 +0000 (UTC) Received: from c83-253-248-99.bredband.comhem.se ([83.253.248.99]:58571 helo=mx.exscape.org) by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.68) (envelope-from ) id 1NO8FN-0007Ak-46; Fri, 25 Dec 2009 12:22:18 +0100 Received: from [192.168.1.5] (macbookpro [192.168.1.5]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx.exscape.org (Postfix) with ESMTPSA id C0FFE1F57F8; Fri, 25 Dec 2009 12:22:14 +0100 (CET) Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=us-ascii From: Thomas Backman In-Reply-To: <4B349ABF.2070800@FreeBSD.org> Date: Fri, 25 Dec 2009 12:22:08 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org> References: <4B349ABF.2070800@FreeBSD.org> To: Alexander Motin X-Mailer: Apple Mail (2.1077) X-Originating-IP: 83.253.248.99 X-Scan-Result: No virus found in message 1NO8FN-0007Ak-46. X-Scan-Signature: ch-smtp01.sth.basefarm.net 1NO8FN-0007Ak-46 71b17cad7b06e00831995badaaabf3d8 Cc: FreeBSD-Current , freebsd-arch@freebsd.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 11:38:09 -0000 On Dec 25, 2009, at 11:58 AM, Alexander Motin wrote: > Hi. >=20 > Recently WD released first series of ATA disks with increased physical > sector size. It makes writes not matching with 4K blocks inefficient > there. They don't expose this to the OS, though (not by default, anyway), but = chop it up into 8 512-byte sectors for compatibility reasons. Just thought I'd point that out - I'm not even sure if you can get them = to *not* do the compatibility thing and expose 4k-sized sectors. I'm sure your work is important for other setups, though, as proved. :) Regards, Thomas= From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 11:39:26 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2054D106566B; Fri, 25 Dec 2009 11:39:26 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id D8DA08FC16; Fri, 25 Dec 2009 11:39:25 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 9983C7E9A3; Fri, 25 Dec 2009 11:39:24 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nBPBe2S0027358; Fri, 25 Dec 2009 11:40:02 GMT (envelope-from phk@critter.freebsd.dk) To: Alexander Motin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Fri, 25 Dec 2009 12:58:07 +0200." <4B349ABF.2070800@FreeBSD.org> Date: Fri, 25 Dec 2009 11:40:02 +0000 Message-ID: <27357.1261741202@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: FreeBSD-Current , freebsd-arch@FreeBSD.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 11:39:26 -0000 In message <4B349ABF.2070800@FreeBSD.org>, Alexander Motin writes: >The difference is quite significant. Unaligned RAID0 access causes two >disks involved in it's handling, while aligned one leaves one of disks >free for another request, doubling performance. You will find RAID5 writes to be an even better test: Optimal filesystem block-size is a RAID5 stripe width, and if you do not get the offset right you instantly loose at least 50% of your write bandwidth. My practical experience says oftem more like 75% is lost. >As we have now mechanism for reporting stripe size and offset for any >partition to user-level, it should be easy to make disk partitioning and >file system creation tools to use it automatically. For MBR's there are compat requirement worries, slices must be track aligned for strict compat with (old ?) funky bioses. BSDlabel have no such fine details, so that is probably the best place to align to stripe offsets. Be aware that stripe-widths may be ridiculously large: you should not use them as blocksizes, just make sure that blocksizes divide cleanly into them. >Stripe size/offset reporting now supported by ada and mmcsd disk drivers >and most of GEOM modules. It would be nice to fetch that info from >hardware RAIDs also, where possible. Indeed. Good work, keep at it! Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 11:44:44 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E40C31065670; Fri, 25 Dec 2009 11:44:44 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id A80688FC19; Fri, 25 Dec 2009 11:44:44 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 9945F7E995; Fri, 25 Dec 2009 11:44:43 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nBPBjLS0027417; Fri, 25 Dec 2009 11:45:21 GMT (envelope-from phk@critter.freebsd.dk) To: Thomas Backman From: "Poul-Henning Kamp" In-Reply-To: Your message of "Fri, 25 Dec 2009 12:22:08 +0100." <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org> Date: Fri, 25 Dec 2009 11:45:20 +0000 Message-ID: <27416.1261741520@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Alexander Motin , FreeBSD-Current , freebsd-arch@freebsd.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 11:44:45 -0000 In message <469FFFC8-514B-41B9-AEEC-E4B7AB6CB886@exscape.org>, Thomas Backman w rites: >On Dec 25, 2009, at 11:58 AM, Alexander Motin wrote: >They don't expose this to the OS, though (not by default, anyway), but = >chop it up into 8 512-byte sectors for compatibility reasons. >Just thought I'd point that out - I'm not even sure if you can get them = >to *not* do the compatibility thing and expose 4k-sized sectors. While that is true, it is worth noting that the same Windows-compat idioty is what doomed the world to RAID5 instead of RAID3. The recent article in Queue Magazine shows how deeply ingrained the 512byte mindset has become: The author goes to great lengths to praise RAID6 and higher for their ability to have multiple bit ECC without ever recognizing (author not knowing ?) that RAID3 has had this ability from day one. UFS runs incredibly well on 4k blocks, and we should exploit that to the fullest extent, and if we really want to jerk chains, we should push RAID3 in 4+2 and 8+3 configs aggressively, it performs great, both under read and write, and Windows cannot do it. Poul-Henning PS: Merry X-mas everybody! -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 14:03:08 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B439010656C1; Fri, 25 Dec 2009 14:03:08 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com [209.85.220.227]) by mx1.freebsd.org (Postfix) with ESMTP id 1E7898FC28; Fri, 25 Dec 2009 14:03:07 +0000 (UTC) Received: by fxm27 with SMTP id 27so8840276fxm.3 for ; Fri, 25 Dec 2009 06:03:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=R2BJlZNHt3AdKWRU5CjVdwHNBknRyMYX3YkeFXpqZFg=; b=PVkmPU+1PctRe78u4jal5YkdiqjvgBzbQqt7gNRmhqrP1vJvWEzs8cjFhyuRZnvIpE jzVXYfdZo44LiCwmkKetWPICSZUw9U7/ZJ0Jd2UgW+chClU+6LtFS624z4Wv+l+NOU1v pmTaohyXCHS60CEvbEWgMY6rZvhdYX31rlDnU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=EgdinLOY749Ydsq4tWqSZI7SixhvO/l5hOp9pC/wD8c6xdE1sg9fv5Oq2CwkHm8E34 9y/GAj1W0lAuSHVisZY6uDPfitqieBido6kokHAQi3z3utFLC9g3NorozDR8E4aHUIyR A0Gm5Zw4iEhmvaztvw36Ease7ivY2rmDpZOxM= Received: by 10.223.62.11 with SMTP id v11mr4553173fah.60.1261749787066; Fri, 25 Dec 2009 06:03:07 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id 13sm3185428fxm.5.2009.12.25.06.03.05 (version=SSLv3 cipher=RC4-MD5); Fri, 25 Dec 2009 06:03:06 -0800 (PST) Sender: Alexander Motin Message-ID: <4B34C619.7070505@FreeBSD.org> Date: Fri, 25 Dec 2009 16:03:05 +0200 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20091212) MIME-Version: 1.0 To: Poul-Henning Kamp References: <27357.1261741202@critter.freebsd.dk> In-Reply-To: <27357.1261741202@critter.freebsd.dk> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Cc: FreeBSD-Current , freebsd-arch@FreeBSD.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 14:03:08 -0000 Poul-Henning Kamp wrote: > In message <4B349ABF.2070800@FreeBSD.org>, Alexander Motin writes: >> The difference is quite significant. Unaligned RAID0 access causes two >> disks involved in it's handling, while aligned one leaves one of disks >> free for another request, doubling performance. > > You will find RAID5 writes to be an even better test: Optimal filesystem > block-size is a RAID5 stripe width, and if you do not get the offset > right you instantly loose at least 50% of your write bandwidth. My > practical experience says oftem more like 75% is lost. Sure, I just had no trusted RAID5 nearby to do benchmark. Actually with RAID5 situation is even more complicated, as there are actually two optimal transaction sizes: - First is a stripe size - amount of data written sequentially to one disk. If you are not aligned with it, it give same results as I have just shown. - Second is a row size - stripe size * number of data disks. You may freely read less information then full row, but short write cause RAID to handle read-modify-write scenario. If you have 3 disks and no battery backed cache - you will definitely loose. But if there are 15 disks and good cache, I believe ability to execute multiple requests independently in parallel will compensate penalty. Also with 15 disks it would impractical to increase FS block size, as in that case OS will have to do that read-modify-write instead of controller and you may loose even more. With RAID5 I think best practice would be to align FS to the stripe size and instruct it to write data in maximal bursts, in best case - full row at a time. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 17:30:59 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0381210656A3; Fri, 25 Dec 2009 17:30:59 +0000 (UTC) (envelope-from gpalmer@freebsd.org) Received: from noop.in-addr.com (mail.in-addr.com [IPv6:2001:470:8:162::1]) by mx1.freebsd.org (Postfix) with ESMTP id CF14B8FC1D; Fri, 25 Dec 2009 17:30:58 +0000 (UTC) Received: from gjp by noop.in-addr.com with local (Exim 4.54 (FreeBSD)) id 1NOE09-000LYr-79; Fri, 25 Dec 2009 12:30:57 -0500 Date: Fri, 25 Dec 2009 12:30:57 -0500 From: Gary Palmer To: Alexander Motin Message-ID: <20091225173057.GA75881@in-addr.com> References: <4B349ABF.2070800@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4B349ABF.2070800@FreeBSD.org> Cc: FreeBSD-Current , freebsd-arch@freebsd.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 17:30:59 -0000 On Fri, Dec 25, 2009 at 12:58:07PM +0200, Alexander Motin wrote: > Hi. > > Recently WD released first series of ATA disks with increased physical > sector size. It makes writes not matching with 4K blocks inefficient > there. So I propose to get back to the question of optimal FS block > alignment. This topic is also important for most of RAIDs having striped > nature, such as RAID0/3/5/... and flash drives with simple controller > (such as MMC/SD cards). This is also a critical issue on certain SAN system. NetApp, I suspect as a result of them layering a virtual LUN ontop of another filesystem (WAFL), is very sensitive to filesystem alignment on the LUN. If the I/Os to the LUN are not 4k aligned, performance suffers a serious hit. I'm not sure what other SAN vendors suffer similar alignment restrictions. Regards, Gary From owner-freebsd-arch@FreeBSD.ORG Fri Dec 25 18:18:25 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 51948106568F; Fri, 25 Dec 2009 18:18:25 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 14C748FC15; Fri, 25 Dec 2009 18:18:24 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 229B57E831; Fri, 25 Dec 2009 18:18:24 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nBPIJ1tH028376; Fri, 25 Dec 2009 18:19:01 GMT (envelope-from phk@critter.freebsd.dk) To: Alexander Motin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Fri, 25 Dec 2009 16:03:05 +0200." <4B34C619.7070505@FreeBSD.org> Date: Fri, 25 Dec 2009 18:19:01 +0000 Message-ID: <28375.1261765141@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: FreeBSD-Current , freebsd-arch@FreeBSD.org Subject: Re: File system blocks alignment X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2009 18:18:25 -0000 In message <4B34C619.7070505@FreeBSD.org>, Alexander Motin writes: >Poul-Henning Kamp wrote: >- Second is a row size - stripe size * number of data disks. You may >freely read less information then full row, but short write cause RAID >to handle read-modify-write scenario. There is a far worse scenario: a stripe-spanning write, it forces a RMW cycle over two different RAID5 stripes. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.