From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 20:01:35 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4E55F1065676 for ; Sat, 19 Dec 2009 20:01:35 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id D451D8FC16 for ; Sat, 19 Dec 2009 20:01:34 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sat, 19 Dec 2009 21:01:32 +0100 Date: Sat, 19 Dec 2009 21:01:35 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: Bruce Evans In-Reply-To: <20091220032452.W2429@besplex.bde.org> Message-ID: <20091219204217.D1741@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220032452.W2429@besplex.bde.org> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 19 Dec 2009 20:01:32.0832 (UTC) FILETIME=[0FBC6A00:01CA80E6] Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 20:01:35 -0000 On Sun, 20 Dec 2009, Bruce Evans wrote: BE>On Sat, 19 Dec 2009, Harti Brandt wrote: BE> BE>> On Sun, 20 Dec 2009, Bruce Evans wrote: BE>> BE>> [... complications] BE>> BE>> To be honest, I'm lost now. Couldn't we just use the largest atomic type BE>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle BE>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel BE>> thread? BE> BE>That's probably best (except without the atomic operations) (like I said BE>originally. I tried to spell out the complications to make it clear that BE>they would be too much except for incomplete ones). BE> BE>> Are the 5-6 atomic operations really that costly given the many operations BE>> done on an IP packet? Are they more costly than a heavyweight sync for BE>> each ++ or +=? BE> BE>rwatson found that even non-atomic operations are quite costly, since BE>at least on amd64 and i386, ones that write (or any access?) the same BE>address (or cache line?) apparently involve much the same hardware BE>activity (cache snoop?) as atomic ones implemented by locking the bus. BE>I think this is mostly historical -- it should be necessary to lock the BE>bus to get the slow version. Per-CPU counters give separate addresses BE>and also don't require the bus lock. I don't like the complexity for BE>per-CPU counters but don't use big SMP systems enough to know what the BE>locks cost in real applications. BE> BE>> Or we could use the PCPU stuff, use just ++ and += for modifying the BE>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a BE>> kernel thread per CPU (do we have this?). Between that thread and the BE>> sysctl we could use a heavy sync. BE> BE>I don't like the squillions of threads in FreeBSD-post-4, but this seems BE>to need its own one and there isn't one yet AFAIK. I think a thread is BE>only needed for the 32-bit stuff (since aggregation has to use the BE>current values and it shouldn't have to ask a thread to sum them). The BE>thread should maintain only the high 32 or 33 bits of the 64-bit counters. BE>Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so BE>that these bits can be accessed without locking. The synchronization is BE>still interesting. BE> BE>> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the BE>> largest atomic type for the platform, handle the aggregation and (on IA32) BE>> the 32->64 bit stuff in a kernel thread. BE> BE>I don't see why using atomic or locks for just the 64 bit counters is good. BE>We will probably end up with too many 64-bit counters, especially if they BE>don't cost much when not read. On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is modifying it, the read will probably be always correct given the variable is correctly aligned. On a 64-bit arch when reading a 64-bit value on one CPU while the other one is adding to, do I always get the correct value? I'm not sure about this, why I put atomic_*() there assuming that they will make this correct. The idea is (for 32-bit platforms): struct pcpu_stats { uint32_t in_bytes; uint32_t in_packets; }; struct pcpu_hc_stats { uint64_t hc_in_bytes; uint64_t hc_in_packets; }; /* driver; IP stack; ... */ ... pcpu_stats->in_bytes += bytes; pcpu_stats->in_packets++; ... /* per CPU kernel thread for 32-bit arch */ lock(pcpu_hc_stats); ... val = pcpu_stats->in_bytes; if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val) pcpu_hc_stats->in_bytes += 0x100000000; pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes & 0xffffffff00000000ULL) | val; ... unlock(pcpu_hc_stats); /* sysctl */ memset(&stats, 0, sizeof(stats)); foreach(cpu) { lock(pcpu_hc_stats(cpu)); ... stats.in_bytes += pcpu_hc_stats(cpu)->hc_in_bytes; ... unlock(pcpu_hc_stats(cpu)); } copyout(stats); On 64-bit archs we can go without the locks and the thread given that we can reliably read the 64-bit per CPU numbers (can we?). BE>I just thought of another implementation to reduce reads: trap on BE>overflow and handle all the complications in the trap handler, or BE>just set a flag to tell the fixup thread to run and normally don't BE>run the fixup thread. This seems to not quite work -- arranging BE>for the trap would be costly (needs "into" instruction on i386?). BE>Similarly for explicit tests for wraparound (PCPU_INC() could be a BE>function call that does the test and handles wraparound in a fully BE>locked fashion. We don't care that this code executes slowly since BE>it rarely executes, but we care that the test pessimizes the usual BE>case). BE> BE>There is also "lock cmpxchg8b" on i386. I think this can be used in a BE>loop to implement atomic 64-bit ops (?). Simpler, but slower in BE>PCPU_INC(). I prefer a function call version of PCPU_INC() to this. BE>That should be faster in the usual case and only much larger if we BE>have too many 64-bit counters. BE> BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a BE>> machine and do routing at link speed, though. This might overflow the IP BE>> input/output byte counter (which we don't have yet) too fast. BE> BE>Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow BE>a 32-bit byte counter. A bit counter would take a while to overflow too. BE>Are there any faster incrementors? TSCs also take O(1) seconds to overflow, BE>and timecounter logic depends on no timecounter overflowing much faster BE>than that. If you have 4 10GBit/s adapters each operating full-duplex at link speed you wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels where each packet counts several times. But I suppose this will be not so easy with IA32 to implement :-) harti