From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 13:47:03 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 964F1106566B; Sun, 20 Dec 2009 13:47:03 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6A21D8FC13; Sun, 20 Dec 2009 13:47:03 +0000 (UTC) Received: from [192.168.2.102] (host86-146-40-97.range86-146.btcentralplus.com [86.146.40.97]) by cyrus.watson.org (Postfix) with ESMTPSA id 87FC446B2D; Sun, 20 Dec 2009 08:47:01 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=us-ascii From: "Robert N. M. Watson" In-Reply-To: <20091220134738.V46221@beagle.kn.op.dlr.de> Date: Sun, 20 Dec 2009 13:46:54 +0000 Content-Transfer-Encoding: quoted-printable Message-Id: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220134738.V46221@beagle.kn.op.dlr.de> To: Harti Brandt X-Mailer: Apple Mail (2.1077) Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 13:47:03 -0000 On 20 Dec 2009, at 13:19, Harti Brandt wrote: > RW>Frequent writes to the same cache line across multiple cores are = remarkably > RW>expensive, as they trigger the cache coherency protocol (mileage = may vary). > RW>For example, a single non-atomically incremented counter cut = performance of > RW>gettimeofday() to 1/6th performance on an 8-core system when = parallel system > RW>calls were made across all cores. On many current systems, the = cost of an > RW>"atomic" operation is now fairly reasonable as long as the cache = line is held > RW>exclusively by the current CPU. However, if we can avoid them that = has > RW>value, as we update quite a few global stats on the way through the = network > RW>stack. >=20 > Hmm. I'm not sure that gettimeofday() is comparable to forwarding an = IP=20 > packet. I would expect, that a single increment is a good percentage = of=20 > the entire processing (in terms of numbers of operations) for=20 > gettimeofday(), while in IP forwarding this is somewhere in the noise=20= > floor. In the simples case the packet is acted upon by the receiving=20= > driver, the IP input function, the IP output function and the sending=20= > driver. Not talking about IP filters, firewalls, tunnels, dummynet and=20= > what else. The relative cost of the increment should be much less. = But, I=20 > may be wrong of course. If processing is occurring on multiple CPUs -- for example, you are = receiving UDP from two ithreads -- then 4-8 cache lines being contended = due to stats is a lot. Our goal should be (for 9.0) to avoid having any = contended cache lines in the common case when processing independent = streams on different CPUs. > I would really like to sort that out before any kind of ABI freeze=20 > happens. Ideally all the statistics would be accessible per sysctl(), = have=20 > a version number and have all or most of the required statistics with = a=20 > simple way to add new fields without breaking anything. Also the field=20= > sizes (64 vs. 32 bit) should be correct on the kernel - user = interface. >=20 > My current feeling after reading this thread is that the low-level = kernel=20 > side stuff is probably out of what I could do with the time I have and=20= > would sidetrack me too far from the work on bsnmp. What I would like = to do=20 > is to fix the kernel/user interface and let the people that now how to = do=20 > it handle the low-level side. >=20 > I would really not like to have to deal with a changing user/kernel=20 > interface in current if we go in several steps with the kernel stuff. I think we should treat the statistics gathering and statistics = reporting interfaces as entirely separable problems. Statistics are = updated far more frequently than they are queried, so making the query = process a bit more expensive (reformatting from an efficient 'update' = format to an application-friendly 'report' format) should be fine. One question to think about is whether or not simply cross-CPU summaries = are sufficient, or whether we actually also want to be able to directly = monitor per-CPU statistics at the IP layer. The former would maintain = the status quo making per-CPU behavior purely part of the 'update' step; = the latter would change the 'report' format as well. I've been focused = primarily on 'update', but at least for my work it would be quite = helpful to have per-CPU stats in the 'report' format as well. > I will try to come up with a patch for the kernel/user interface in = the=20 > mean time. This will be for 9.x only, obviously. Sounds good -- and the kernel stats capture can "grow into" the full = report format as it matures. > Doesn't this help for output only? For the input statistics there = still=20 > will be per-ifnet statistics. Most ifnet-layer stats should really be per-queue, both for input and = output, which may help. > An interesting question from the SNMP point of view is, what happens = to=20 > the statistics if you move around interfaces between vimages. In any = case=20 > it would be good if we could abstract from all the complications while=20= > going kernel->userland. At least for now, the interface is effectively recreated when it moves = vimage, and only the current vimage is able to monitor it. That could be = considered a bug but it might also be a simplifying assumption or even a = feature. Likewise, it's worth remembering that the ifnet index space is = per-vimage. Robert=