From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 13:19:40 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 78A1C106568B for ; Sun, 20 Dec 2009 13:19:40 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 06E518FC18 for ; Sun, 20 Dec 2009 13:19:39 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sun, 20 Dec 2009 14:19:38 +0100 Date: Sun, 20 Dec 2009 14:19:34 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: Robert Watson In-Reply-To: Message-ID: <20091220134738.V46221@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 20 Dec 2009 13:19:38.0085 (UTC) FILETIME=[14A3C950:01CA8177] Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 13:19:40 -0000 On Sun, 20 Dec 2009, Robert Watson wrote: RW> RW>On Sat, 19 Dec 2009, Harti Brandt wrote: RW> RW>> To be honest, I'm lost now. Couldn't we just use the largest atomic type RW>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle RW>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel RW>> thread? RW>> RW>> Are the 5-6 atomic operations really that costly given the many operations RW>> done on an IP packet? Are they more costly than a heavyweight sync for each RW>> ++ or +=? RW> RW>Frequent writes to the same cache line across multiple cores are remarkably RW>expensive, as they trigger the cache coherency protocol (mileage may vary). RW>For example, a single non-atomically incremented counter cut performance of RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system RW>calls were made across all cores. On many current systems, the cost of an RW>"atomic" operation is now fairly reasonable as long as the cache line is held RW>exclusively by the current CPU. However, if we can avoid them that has RW>value, as we update quite a few global stats on the way through the network RW>stack. Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP packet. I would expect, that a single increment is a good percentage of the entire processing (in terms of numbers of operations) for gettimeofday(), while in IP forwarding this is somewhere in the noise floor. In the simples case the packet is acted upon by the receiving driver, the IP input function, the IP output function and the sending driver. Not talking about IP filters, firewalls, tunnels, dummynet and what else. The relative cost of the increment should be much less. But, I may be wrong of course. RW> RW>> Or we could use the PCPU stuff, use just ++ and += for modifying the RW>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a RW>> kernel thread per CPU (do we have this?). Between that thread and the RW>> sysctl we could use a heavy sync. RW> RW>The current short-term plan is to move do this but without a syncer thread: RW>we'll just aggregate the results when they need to be reported, in the sysctl RW>path. How best to scale to 64-bit counters is an interesting question, but RW>one we can address after per-CPU stats are in place, which address an RW>immediate performance (rather than statistics accuracy) concern. Well, the user side of our statistics is in a very bad shape and I have problems in handling this in the SNMP daemon. Just a number of examples: interface statistics: - they use u_long, so are either 32-bit or 64-bit depending on the platform - a number of required statistics is missing - send drops are somewhere else and are 'int' - statistics are embedded into struct ifnet (bad for ABI stability) and not versioned - accessed together with other unrelated information via sysctl() IPv4 statistics: - also u_long (hence different size on the platforms) - a lot of fields required by SNMP is missing - not versioned - accessed via sysctl() - per interface statistics totally missing IPv6 statistics: - u_quad_t! so they are suspect to race conditions on 32-bit platforms and, maybe?, on 64-bit platforms - a lot of fields requred by SNMP is missing - not versioned - accessed via sysctl(); per interface statistics via ioctl() Ethernet statistics: - u_long - some fields missing - implemented in only 3! drivers; some drivers use the corresponding field for something else - not versioned I think, TCP and UDP statistics are equally bad shaped. I would really like to sort that out before any kind of ABI freeze happens. Ideally all the statistics would be accessible per sysctl(), have a version number and have all or most of the required statistics with a simple way to add new fields without breaking anything. Also the field sizes (64 vs. 32 bit) should be correct on the kernel - user interface. My current feeling after reading this thread is that the low-level kernel side stuff is probably out of what I could do with the time I have and would sidetrack me too far from the work on bsnmp. What I would like to do is to fix the kernel/user interface and let the people that now how to do it handle the low-level side. I would really not like to have to deal with a changing user/kernel interface in current if we go in several steps with the kernel stuff. RW>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a RW>> machine and do routing at link speed, though. This might overflow the IP RW>> input/output byte counter (which we don't have yet) too fast. RW> RW>For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in RW>about three seconds. Systems processing 40gbps a second are now quite RW>realistic, although typically workloads of that sort will be distributed over RW>16+ cores and using multiple 10gbps NICs. RW> RW>My thinking is that we get the switch to per-CPU stats done in 9.x in the RW>next month sometime, and also get it merged to 8.x a month or so later (I RW>merged the wrapper macros necessary to do that before 8.0 but didn't have RW>time to fully evaluate the performance implications of the implementation RW>switch). I will try to come up with a patch for the kernel/user interface in the mean time. This will be for 9.x only, obviously. RW>There are two known areas of problem here: RW> RW>(1) The cross-product issue with virtual network stacks RW>(2) The cross-product issue with network interfaces for per-interface stats RW> RW>I propose to ignore (1) for now by simply having only vnet0 use per-CPU RW>stats, and other vnets use single-instance per-vnet stats. We can solve the RW>larger problem there at a future date. This sounds reasonable if we wrap all the statistics stuff into macros and/or functions. RW>I don't have a good proposal for (2) -- the answer may be using DPCPU memory, RW>but that will require use to support more dynamic allocation ranges, which RW>may add cost. (Right now, the DPCPU allocator relies on relatively static RW>allocations over time). This means that, for now, we may also ignore that RW>issue and leave interface counters as-is. This is probably a good idea RW>because we also need to deal with multi-queue interfaces better, and perhaps RW>the stats should be per-queue rather than per-ifnet, which may itself help RW>address the cache line issue. Doesn't this help for output only? For the input statistics there still will be per-ifnet statistics. An interesting question from the SNMP point of view is, what happens to the statistics if you move around interfaces between vimages. In any case it would be good if we could abstract from all the complications while going kernel->userland. harti