From owner-freebsd-current@FreeBSD.ORG Thu Sep 9 04:45:28 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AD9B516A4CE; Thu, 9 Sep 2004 04:45:28 +0000 (GMT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6F86243D1D; Thu, 9 Sep 2004 04:45:28 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) i894jSla071607; Wed, 8 Sep 2004 21:45:28 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id i894jRei071606; Wed, 8 Sep 2004 21:45:27 -0700 (PDT) (envelope-from dillon) Date: Wed, 8 Sep 2004 21:45:27 -0700 (PDT) From: Matthew Dillon Message-Id: <200409090445.i894jRei071606@apollo.backplane.com> To: Robert Watson References: cc: Gerrit Nagelhout cc: current@freebsd.org cc: slong@freebsd.org Subject: Re: FreeBSD 5.3 Bridge performance take II X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Sep 2004 04:45:28 -0000 :In the rwatson_umaperthread branch, what I've done is started to associate :struct uma_cache structures with threads. Since caches are "per-zone", I :allow threads to register for zones of interest; these caches are hung off :of struct thread, and must be explicitly registered and released. While :.. : :In practice, this eliminates mutex acquisition for mbuf allocation and :free in the forwarding and bridging paths, and halves the number of :operations when interacting with user threads (as they don't have the :.. : :My interest in looking at per-thread caches was to explore ways in which :to reduce the cost of zone allocation without making modifications to our :synchronization model. It has been proposed that a better way to achieve :... : :Robert N M Watson FreeBSD Core Team, TrustedBSD Projects :robert@fledge.watson.org Principal Research Scientist, McAfee Research I would recommend against per-thread caches. Instead, make the per-cpu caches actually *be* per-cpu (that is, not require a mutex). This is what I do in DragonFly's Slab allocator. For the life of me I just don't understand why one would spend so much effort creating a per-cpu caching subsystem and then slap a mutex right smack in the middle of the critical allocation and deallocation paths. Non critical operations, such as high level zone management, can be done passively (in DragonFly's case through IPI messaging which, when I get to it, can be queued passively rather then actively), or by a helper thread which migrates to the cpu whos cache it needs to operate on, does its stuff, then migrates to the next cpu, or by any number of other clever mechanisms none of which require a brute-force mutex to access the data. I use this cpu migration trick for a number of things in DragonFly. Jeff and I use it for wildcard pcb registration (which is replicated across cpus). The thread list sysctl code collects per-cpu thread data by iterating through the cpus (migrating the thread to each cpu to collect the data and then ending up on the cpu it began on before returning to user mode). Basically, any non-critical-path operation can use this trick in order to allow the real critical path -- the actual packet traffic, to operate without mutexes. So, instead of adding more hacks, please just *fix* the slab allocator in FreeBSD-5. You will find that suddenly a lot of things you were contemplating writing additional subsystems for will then suddenly work (and work very efficiently) by just calling the slab allocator directly. The problem with per-thread caching is that you greatly increase the amount of waste in the system. If you have 50 threads each with their own per-thread cache and a hysteresis of, say, 32 allocations, you wind up with 50*32 = 1600 allocations worth of potential waste. With a per-cpu case the slop is a lot more deterministic (since the number of cpus is a fixed, known quantity). Another problem with per-thread caching is that it greatly reduces performance in certain common allocation cases... in particular the case where data is allocated by one subsystems (say, an interrupt thread), and freed by another subsystem (say, a protocol thread or other consumer). This sort of problem is a lot easier to fix with a per-cpu cache organization and a lot harder to fix with a per-thread cache organization. -Matt Matthew Dillon