From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 12:13:47 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8D4661065672; Sun, 20 Dec 2009 12:13:47 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 48B0E8FC23; Sun, 20 Dec 2009 12:13:47 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id CD56646B09; Sun, 20 Dec 2009 07:13:46 -0500 (EST) Date: Sun, 20 Dec 2009 12:13:46 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Harti Brandt In-Reply-To: <20091219164818.L1741@beagle.kn.op.dlr.de> Message-ID: References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: =?ISO-8859-15?Q?Ulrich_Sp=F6rlein?= , Hans Petter Selasky , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 12:13:47 -0000 On Sat, 19 Dec 2009, Harti Brandt wrote: > To be honest, I'm lost now. Couldn't we just use the largest atomic type for > the given platform and atomic_inc/atomic_add/atomic_fetch and handle the > 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel thread? > > Are the 5-6 atomic operations really that costly given the many operations > done on an IP packet? Are they more costly than a heavyweight sync for each > ++ or +=? Frequent writes to the same cache line across multiple cores are remarkably expensive, as they trigger the cache coherency protocol (mileage may vary). For example, a single non-atomically incremented counter cut performance of gettimeofday() to 1/6th performance on an 8-core system when parallel system calls were made across all cores. On many current systems, the cost of an "atomic" operation is now fairly reasonable as long as the cache line is held exclusively by the current CPU. However, if we can avoid them that has value, as we update quite a few global stats on the way through the network stack. > Or we could use the PCPU stuff, use just ++ and += for modifying the > statistics (32bit) and do the 32->64 bit stuff for all platforms with a > kernel thread per CPU (do we have this?). Between that thread and the sysctl > we could use a heavy sync. The current short-term plan is to move do this but without a syncer thread: we'll just aggregate the results when they need to be reported, in the sysctl path. How best to scale to 64-bit counters is an interesting question, but one we can address after per-CPU stats are in place, which address an immediate performance (rather than statistics accuracy) concern. > Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > machine and do routing at link speed, though. This might overflow the IP > input/output byte counter (which we don't have yet) too fast. For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in about three seconds. Systems processing 40gbps a second are now quite realistic, although typically workloads of that sort will be distributed over 16+ cores and using multiple 10gbps NICs. My thinking is that we get the switch to per-CPU stats done in 9.x in the next month sometime, and also get it merged to 8.x a month or so later (I merged the wrapper macros necessary to do that before 8.0 but didn't have time to fully evaluate the performance implications of the implementation switch). There are two known areas of problem here: (1) The cross-product issue with virtual network stacks (2) The cross-product issue with network interfaces for per-interface stats I propose to ignore (1) for now by simply having only vnet0 use per-CPU stats, and other vnets use single-instance per-vnet stats. We can solve the larger problem there at a future date. I don't have a good proposal for (2) -- the answer may be using DPCPU memory, but that will require use to support more dynamic allocation ranges, which may add cost. (Right now, the DPCPU allocator relies on relatively static allocations over time). This means that, for now, we may also ignore that issue and leave interface counters as-is. This is probably a good idea because we also need to deal with multi-queue interfaces better, and perhaps the stats should be per-queue rather than per-ifnet, which may itself help address the cache line issue. Robert N M Watson Computer Laboratory University of Cambridge