From owner-freebsd-current@FreeBSD.ORG Wed Sep 8 22:47:08 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D875A16A4CE; Wed, 8 Sep 2004 22:47:08 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 459D443D3F; Wed, 8 Sep 2004 22:47:08 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i88MiBA9070637; Wed, 8 Sep 2004 18:44:12 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i88MiBhl070634; Wed, 8 Sep 2004 18:44:11 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 8 Sep 2004 18:44:11 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Gerrit Nagelhout In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: current@freebsd.org cc: slong@freebsd.org Subject: Re: FreeBSD 5.3 Bridge performance take II X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Sep 2004 22:47:09 -0000 On Tue, 7 Sep 2004, Gerrit Nagelhout wrote: > From these numbers, the uma locks seem to get called twice for every > packet, but have no collisions. All other locks have significant > collision problems resulting in a lot of overhead. UMA grabs per-cpu cache locks to check the per-cpu cache, where common case free and allocations occur from. This is something I'm currently exploring in a Perforce branch by associating caches with additional objects/entities, such as threads and interfaces. The per-cpu locks serve a number of functions, however, and there are trade-offs in looking at weaker synchronization models. Here are some things that these locks do: - Synchronize access to the per-cpu uma_cache for the zone in the event that the caller gets preempted, leaving the cache consistent in the view of the thread, not the CPU. - Synchronize access in the event a thread migrates CPUs. As such, the thread can continue to reference the cache from another CPU and finish up whatever it's doing "safely". - Allows global access to the per-cpu uma_caches for the purposes of draining and statistics collection. - Allow global access to per-cpu UMA caches for the purposes of destroying a UMA zone. In the rwatson_umaperthread branch, what I've done is started to associate struct uma_cache structures with threads. Since caches are "per-zone", I allow threads to register for zones of interest; these caches are hung off of struct thread, and must be explicitly registered and released. While this approach might not be desirable in the long term, it allowed me to experiment at low implementation cost. In particular, for ithreads and netisr's, I'm running with caches for each of the mbuf-related zones, mbufs, clusters, and packets. In practice, this eliminates mutex acquisition for mbuf allocation and free in the forwarding and bridging paths, and halves the number of operations when interacting with user threads (as they don't have the caches set up). It also allows me to maintain these properties in the presence of preemption and CPU migration and load balancing. I've not yet done a lot of performance measurement since I'm running into problems with the if_em interface on the boxes I'm testing with wedging under high packet queue depth; with lower depth using the UDP_RR test in netperf, I see a several percent drop in processing latency. Without testing under higher volume, though, it's hard to reason about the over all benefits. I hope to be able to start doing more effective performance testing on this in the near future. I had hoped for slightly better improvements in the face of removing those mutex operations; switching to direct dispatch in the network stack, in contrast, has a far more dramatic effect on processing latency. There are some immediate downsides, however: - First is that the caches can no longer be accessed from other threads safely, so we can't drain per-thread caches except in the context of the thread. - Since my experimental model maintains the notion that caches are maintained per-zone and not across UMA, threads have to notify UMA in advance as to what memory types are particularly important. This is easy for ithreads on network drivers and the netisr, but harder in the general case (which also matters :-). - Removing of zones is now harder, since global access to caches is restricted in the current model. My interest in looking at per-thread caches was to explore ways in which to reduce the cost of zone allocation without making modifications to our synchronization model. It has been proposed that a better way to achieve the same results is to lower the cost of entering critical sections, which would have the effect of pinning the thread to the current CPU (preventing migration) and also preventing preemption. Right now, our critical section cost is quite high (no measurements on hand), suggesting that using locks on per-cpu structures doesn't actually put us in worse situation. Moving to critical sections would also complicate the act of tearing down UMA zones (etc). In the per-thread UMA cache model, I gloss over this (since it's experimentation) by simply declaring that zones declared as supporting per-thread caching can't be destroyed. One nice thing about using this experimental code is that I hope it will allow us to reason more effectively about the extent to which improving per-cpu data structures improves efficiency -- I can now much more easily say "OK, what happens if eliminate the cost of locking for common place mbuf allocation/free". I've also started looking at per-interface caches based on the same model, which has some similar limitations (but also some similar benefits), such as stuffing per-interface uma caches in struct ifnet. Since there are additional costs associated with more extensive use of critical sections (such as the impact on timely preemption, load balancing, etc), we should be in a better position to do a useful comparison as work is done to improve the performance of our critical section functionality. BTW, right now my primary areas of optimization and work focus for the next few weeks in the stack are: - Lowering costs (and eliminating costs) associated with entropy harvesting in the interrupt and network paths. Right now it's somewhat scary how much work is done there. If you're not already disabling harvesting of entropy on interrupts and in network processing, you really want to for performance purposes. I have changes in the pipeline that halve the number of mutex operations during harvesting of entropy, and reduce to O(4) the number of mutex operations during entropy processing in the Yarrow thread (from O(4N)). Also, there's some "hard work" going on for CPUs without cycle counters due to timing information collection. I'd like to eliminate the entropy harvesting point in ether_input() -- it strikes me as both redundant (called many times in close succession) and incorrect (processes the wrong data). - Running additional traces on the network processing path using KTR to identify weaknesses in performance. In particular, to look at context switching (especially for gratuitous wakeup and poorly timed thrashing, delays in processing, et al), mutex acquisition/drop, excess or gratuitous memory allocation, inefficient memory copies, etc. - Spend additional time on IPv6 locking and safety in an MPSAFE kernel. - Re-work BPF locking (and other aspects of its behavior) due to reported bugs, locking weaknesses, etc. - Continue work on KAME IPSEC locking. - Measure contention in the pcbinfo locking models used by several protocols, and start to identify locking strategies that mitigate that contention (and hopefully lower cost also). I'm thinking of looking at changing the reference model for so_pcb pointers into per-protocol pcb's, since they currently tend to point at fairly heavy weight locking models, but need to do some more research first. - Start to explore models for processing packets in sets for somewhat-indirect-dispatch. We do some fairly inefficient things, such as fragmenting a datagram into many packets and passing them one-by-one into network processing. We do that in other places also, such as crossing layer boundaries, etc, etc. Things I hope to see others working on (:-) include optimizing synchronization primitives (such as mutexes, wakeup/sleep events, critical sections, etc), performing similar sorts of analysis to the above, and spending time on driver locking to see how efficiency can be improved. I also measured substantial contention between send and receive paths in heavy processing, but I'm not very familiar with our non-synthetic network interface drivers. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research