From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 14:18:15 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 979981065694; Sun, 20 Dec 2009 14:18:15 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 236238FC25; Sun, 20 Dec 2009 14:18:14 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sun, 20 Dec 2009 15:18:13 +0100 Date: Sun, 20 Dec 2009 15:18:11 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: "Robert N. M. Watson" In-Reply-To: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> Message-ID: <20091220150003.W54492@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220134738.V46221@beagle.kn.op.dlr.de> <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 20 Dec 2009 14:18:13.0891 (UTC) FILETIME=[44393530:01CA817F] Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 14:18:15 -0000 On Sun, 20 Dec 2009, Robert N. M. Watson wrote: RNMW> RNMW>On 20 Dec 2009, at 13:19, Harti Brandt wrote: RNMW> RNMW>> RW>Frequent writes to the same cache line across multiple cores are remarkably RNMW>> RW>expensive, as they trigger the cache coherency protocol (mileage may vary). RNMW>> RW>For example, a single non-atomically incremented counter cut performance of RNMW>> RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system RNMW>> RW>calls were made across all cores. On many current systems, the cost of an RNMW>> RW>"atomic" operation is now fairly reasonable as long as the cache line is held RNMW>> RW>exclusively by the current CPU. However, if we can avoid them that has RNMW>> RW>value, as we update quite a few global stats on the way through the network RNMW>> RW>stack. RNMW>> RNMW>> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP RNMW>> packet. I would expect, that a single increment is a good percentage of RNMW>> the entire processing (in terms of numbers of operations) for RNMW>> gettimeofday(), while in IP forwarding this is somewhere in the noise RNMW>> floor. In the simples case the packet is acted upon by the receiving RNMW>> driver, the IP input function, the IP output function and the sending RNMW>> driver. Not talking about IP filters, firewalls, tunnels, dummynet and RNMW>> what else. The relative cost of the increment should be much less. But, I RNMW>> may be wrong of course. RNMW> RNMW>If processing is occurring on multiple CPUs -- for example, you are receiving UDP from two ithreads -- then 4-8 cache lines being contended due to stats is a lot. Our goal should be (for 9.0) to avoid having any contended cache lines in the common case when processing independent streams on different CPUs. RNMW> RNMW>> I would really like to sort that out before any kind of ABI freeze RNMW>> happens. Ideally all the statistics would be accessible per sysctl(), have RNMW>> a version number and have all or most of the required statistics with a RNMW>> simple way to add new fields without breaking anything. Also the field RNMW>> sizes (64 vs. 32 bit) should be correct on the kernel - user interface. RNMW>> RNMW>> My current feeling after reading this thread is that the low-level kernel RNMW>> side stuff is probably out of what I could do with the time I have and RNMW>> would sidetrack me too far from the work on bsnmp. What I would like to do RNMW>> is to fix the kernel/user interface and let the people that now how to do RNMW>> it handle the low-level side. RNMW>> RNMW>> I would really not like to have to deal with a changing user/kernel RNMW>> interface in current if we go in several steps with the kernel stuff. RNMW> RNMW>I think we should treat the statistics gathering and statistics RNMW>reporting interfaces as entirely separable problems. Statistics are RNMW>updated far more frequently than they are queried, so making the RNMW>query process a bit more expensive (reformatting from an efficient RNMW>'update' format to an application-friendly 'report' format) should be RNMW>fine. RNMW> RNMW>One question to think about is whether or not simply cross-CPU RNMW>summaries are sufficient, or whether we actually also want to be able RNMW>to directly monitor per-CPU statistics at the IP layer. The former RNMW>would maintain the status quo making per-CPU behavior purely part of RNMW>the 'update' step; the latter would change the 'report' format as RNMW>well. I've been focused primarily on 'update', but at least for my RNMW>work it would be quite helpful to have per-CPU stats in the 'report' RNMW>format as well. No problem. I can even add that in a private SNMP MIB if it seems useful. RNMW> RNMW>> I will try to come up with a patch for the kernel/user interface in the RNMW>> mean time. This will be for 9.x only, obviously. RNMW> RNMW>Sounds good -- and the kernel stats capture can "grow into" the full RNMW>report format as it matures. RNMW> RNMW>> Doesn't this help for output only? For the input statistics there still RNMW>> will be per-ifnet statistics. RNMW> RNMW>Most ifnet-layer stats should really be per-queue, both for input and RNMW>output, which may help. As far as I can see currently the driver just calls if_input which is the interface dependend input function. There seems to be no driver-independent abstraction of input queues. (The hatm driver I wrote several years ago has to input queues in hardware corresponding to 4 (or 8?) interrupt queues, but somewhere in the driver you put all of this through the single if_input hook). Or is there something I'm missing? RNMW>> An interesting question from the SNMP point of view is, what happens to RNMW>> the statistics if you move around interfaces between vimages. In any case RNMW>> it would be good if we could abstract from all the complications while RNMW>> going kernel->userland. RNMW> RNMW>At least for now, the interface is effectively recreated when it RNMW>moves vimage, and only the current vimage is able to monitor it. That RNMW>could be considered a bug but it might also be a simplifying RNMW>assumption or even a feature. Likewise, it's worth remembering that RNMW>the ifnet index space is per-vimage. I was already thinking about how to fit the vimage stuff into the SNMP model. The simplest way is to run one SNMP daemon per vimage. Next comes having one daemon that has one context per vimage. Bsnmpd does its own mapping of system ifnet indexes to SNMP interface indexes, because the allocation of system ifnet indexes does not fit to the RFC requirements. This means it will detect when an interface is moved away from a vimage and comes back later. If the kernel statistics are stable over these movements, there is no need to declare a counter discontinuity via SNMP. On the other hand these operations are probably seldom enough ... harti