From owner-freebsd-current@FreeBSD.ORG Thu Sep 9 08:55:02 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C8BCA16A4CE; Thu, 9 Sep 2004 08:55:02 +0000 (GMT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 841F443D31; Thu, 9 Sep 2004 08:55:02 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) i898t1la072489; Thu, 9 Sep 2004 01:55:02 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id i898t1R8072488; Thu, 9 Sep 2004 01:55:01 -0700 (PDT) (envelope-from dillon) Date: Thu, 9 Sep 2004 01:55:01 -0700 (PDT) From: Matthew Dillon Message-Id: <200409090855.i898t1R8072488@apollo.backplane.com> To: Robert Watson References: cc: scottl@freebsd.org cc: Gerrit Nagelhout cc: current@freebsd.org Subject: Re: FreeBSD 5.3 Bridge performance take II X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Sep 2004 08:55:02 -0000 :% One nice thing about using this experimental code is that I hope it will :% allow us to reason more effectively about the extent to which improving :% per-cpu data structures improves efficiency -- I can now much more :% easily say "OK, what happens if eliminate the cost of locking for common :% place mbuf allocation/free". I've also started looking at per-interface :% caches based on the same model, which has some similar limitations (but :% also some similar benefits), such as stuffing per-interface uma caches :% in struct ifnet. : :I.e., using per-thread UMA caches is a 30-60 minute hack that allows me to :explore and measure the performance benefits (and costs) of several :different approaches, including per-cpu, per-thread, and per-data :structure/object caching without doing the full implementation up front. :Per-thread caching, for example, can simulate the effects of :non-preemption and mutex avoidance in micro-benchmarking, although in the :general case under macro-benchmark perspective it suffers from a number of :problems (including the draining, balancing, and extra storage cost :issues). I didn't attempt to address these problems under the assumption :that the current implementation is a tool for exploring performance, not :something to actually use. : Well, I see some major problems with this avenue of development. I see this as an end-run around existing, broken (or perceived to be broken) APIs. If you don't believe that the slab allocator has severe performance issues then the slab allocator (aka malloc()) should simply be used directly. If you do believe that the slab allocator has severe performance issues then the correct solution is to FIRST FIX THE SLAB ALLOCATOR. Until the slab allocator is fixed the system-wide overhead will skew the results from any other optimization tests you try to make. i.e. results from other optimizations may appear to be less effective simply due to being washed out by the slab allocator's overhead. In the same respect, the idea that a per-thread memory cache is going to be more efficient then a per-cpu memory cache implies that it is too expensive to implement the locking required to implement a per-cpu memory cache vs a per-thread memory cache. If that is the implication, the solution is to fix the required locking. Frankly it should be no more expensive then a critical section and a critical section to access per-cpu data should be no more then a nanosecond or two more expensive then access to a per-thread data structure. Per-thread caching APIs have major design hurdles to overcome. I've already listed a few of them, but there are many, many more. For example, locality of reference may seem to be a slam dunk but you actually get better locality of reference with a per-cpu cache, especially when a thread migrates between cpus, but also because the cache is able to take advantage of and reuse a very recently reused chunk of data that might have been freed by some other thread on the same cpu... so you get it even after a context switch (and you don't get that with a per-thread cache). This is a case that can occur quite often, especially when interrupt threads are shipping data to protocol threads. I could go on and on, but I am not going to because I think it damn well ought to be obvious. It seems clear to me that it makes little sense to spend time on a per-thread memory cache when a per-cpu memory cache is, just from an algorithmic point of view, going to be far more effective, far easier to manage, possible to have larger (deterministic) hysteresis without creating too much non-deterministic slop, and so on and so forth. If this is just an experiment, and therefore something that will never be committed, then I still don't understand why you are even wasting time working on it when you could be fixing the slab allocator instead. IMHO I do not believe that a per-thread memory cache would even come close to characterizing the performance benefits of a mutexless per-cpu cache. While the base overhead is similar, the side effects and management requirements are going to be very different. If the experiment is supposed to characterize the potential performance improvement over the slab allocator, then I believe it is already too flawed to be an accurate measure of that. Now does this mean that caches in front of the slab allocator are bad? No, it doesn't. I use front-end caches in several places in DragonFly and you definitely want to do the same thing in FreeBSD. The difference, I think, is that we only use front-end caches for performance-critical subsystems and when we do, we implement the code directly into the subsystem. We depend on our core slab allocator (i.e. malloc()) far more then you seem to want to depend on yours. FreeBSD seems to depend on its UMA type-stable zone allocator for the same thing but that has a lot of extra, unnecessary overhead. The MBUF allocator is a good example of this. You guys are making several extra procedural calls that we aren't in the mbuf allocation path. The way I see it, if it's important enough to require its own front-end cache over simply using the slab allocator, then you might as well go whole hog. But it is also important to make your slab allocator fast so there are fewer situations where you feel that you need to bypass it. A front-end cache is not supposed to be a workaround for a slow slab allocator but instead is supposed to be an ultra-optimized implementation to reduce overhead in a critical subsystem. :In doing so, my hope was to identify which areas will offer the most :immediate performance benefits, be it simply cutting down on costly :operations (such as the entropy harvesting code for Yarrow which appears :to have found its way into our interrupt path), rethinking locking :strategies, optimizing out/coalescing locking, optimizing out excess :memory allocation, optimizing synchronization primitives with the same :semantics, changing synchronization assumptions to offer weaker/stronger :semantics, etc. :... : :Robert N M Watson FreeBSD Core Team, TrustedBSD Projects :robert@fledge.watson.org Principal Research Scientist, McAfee Research Well, I guess identifying the problem areas is good, but it isn't rocket science. It should be glaringly obvious without requiring all that much actual testing. One big problem you face is that a great deal of the performance issues in FreeBSD are not from any single subsystem, but from overheads sprinkled throughout the entire codebase. Taken singly, these overheads are not significant. A single mutex could be argued to be not all that significant... but having to obtain and release 11 mutexes in a code path *IS* significant. A single memory allocation might not be significant, but having to make three or four in a critical path can be. And so on, and so forth. Testing performance with little tweaks here and there is not going to give you any worthwhile results, IMHO. Just fixing one thing isn't going to solve the performance problem. You have to fix the entire path. It isn't JUST the mutex overhead that's the problem. It's the mutex overhead, the scheduler overhead, all the myrid calls you have to make to pin the thread, or enter a critical section (if you have to do it too often)... it's the atomic-access requirement to the per-cpu %fs:globaldata data which makes it impossible to cache per-cpu data. It's the 4BSD/ULE scheduler being used to schedule kernel threads, its the thread switching overhead, the microtime calls in the switch path, coding requirements to deal with preemption, cpu migration, giant lock handling, uma zone allocator's callback API, and a dozen other things. Taken singly these items produce a fractional degredation. Taken singly these items aren't necessarily even 'bad'. Taken together and you have... well, you have the situation you find yourself in now. To really fix the problem you need to be willing to clean up ALL of these subsystems. Well, first you need to recognize that they all need to be cleaned up, and I gather that some FreeBSD developers still don't recognize that as the problem. Once you recognize that its a problem you then need to go and do the work. Cleaning up the subsystems is only the first step... it gives you a nice solid reasonably high performing base to work with. You still have to deal with the higher-level critical pathing issues once you've cleaned up the subsystems. This is where things like front-end caches really show their stuff. -Matt Matthew Dillon