From owner-freebsd-current@FreeBSD.ORG Tue Aug 24 15:38:54 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8D4C916A4CE; Tue, 24 Aug 2004 15:38:54 +0000 (GMT) Received: from saturn.criticalmagic.com (saturn.criticalmagic.com [64.74.124.105]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4BDED43D3F; Tue, 24 Aug 2004 15:38:54 +0000 (GMT) (envelope-from rcoleman@criticalmagic.com) Received: from [172.16.0.202] (c-24-99-11-35.atl.client2.attbi.com [24.99.11.35]) by saturn.criticalmagic.com (Postfix) with ESMTP id 58EF33BD21; Tue, 24 Aug 2004 11:38:52 -0400 (EDT) Message-ID: <412B618F.2000305@criticalmagic.com> Date: Tue, 24 Aug 2004 11:41:03 -0400 From: Richard Coleman User-Agent: Mozilla Thunderbird 0.7.3 (Windows/20040803) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Robert Watson References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: current@FreeBSD.org Subject: Re: Running the network stack without Giant -- change in default coming X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2004 15:38:54 -0000 Very very cool. It's exciting to see many of the long term FreeBSD projects coming together like this. Richard Coleman rcoleman@criticalmagic.com Robert Watson wrote: > For some time, one of the major goals of the FreeBSD Project has been > to allow the network stack to run in parallel on multiple processors > at a time. Per my July 19, 2004 post to the freebsd-current mailing > list, much of this support has now been merged to the FreeBSD > 5-CURRENT branch (and now 6-CURRENT), with the intent of shipping > this support in 5.3. And, per that post, it's now possible to run > large parts of the network stack in this manner through the use of a > system tunable at boot, debug.mpsafenet. This can result in a variety > of performance benefits, especially on SMP, by improving concurrency > and reducing latency. While it presents a "first cut" locking > strategy, these benefits are still pretty tangible, and the resulting > system is an excellent starting architecture for a broad range of > performance work. > > Right now, that tunable "debug.mpsafenet" defaults to off (0) in the > 5-CURRENT and 6-CURRENT branches. However, this will shortly change > in 6-CURRENT to on (1), as most commonly exercised parts of the > network stack are now ready for testing in this environment. Some > caveats before I go into the details as to how to determine whether > this is right for you: > > - While we've been doing pretty heavy testing in MPSAFE > configurations, the nature of multiprocessor development and adapting > code for MP safety means that it's unlikely this will "just work" for > every last person who tries it. However, it appears to work well in > a broad variety of environments and with fairly strenuous testing. > > - We've focussed primarily on getting mainstream network > configurations to run without Giant: this means that less mainstream > subsystems (parts of IPv6, some netgraph nodes, IPX, etc) are > currently unsafe without the Giant lock turned on. Less mainstream > network devices, even if the device drivers are not able to run > without the Giant lock. are able to operate without Giant over the > remainder of the stack due to compatibility code. This code comes > with a performance penalty beyond just running with the Giant lock, > so there is a strong motivation to complete locking for these > straggling drivers. > > - You may run into hard to diagnose problems. We'd like to try to > diagnose them anyway, but if you start to experience new problems, > you'll want to go read the Handbook chapter on preparing kernel bug > reports and diagnosing problems. You'll also want to be prepared to > run the system with INVARIANTS and WITNESS turned on. The first step > in debugging will be to try running with Giant turned back on by > changing the debug.mpsafenet flag and seeing if the problem can be > reproduced. Details below. > > - Not all workloads will experience a performance benefit -- some, > for various reasons, will get worse. However, several interesting > performance loads get measurably better. If you don't see an > improvement, or you see things get worse, please don't be surprised > -- you may want to look at some of the suggestions I make below on > ways to make the results more predictable. Generally, you shouldn't > see substantial performance degradation, if any, but it can't be > ruled out, especially due to outstanding scheduler issues that are > being worked on. > > - We can and will destroy your data. We don't mean to, because we > like your data (and you!), and we try not to, but this is, after all, > operating system development, and comes with risks. > > With this in mind, now is a good time to increase exposure for these > changes, because they will become the default in the near future. > > Here's some technical information on how to get started: > > (1) Determine if all of the stack components you will operate with > are MPsafe. For common configurations, answering the following > questions will help you decide this: > > - Are you actively using IPv6, IPX, ATM, or KAME IPSEC? If you > answered yes to any of these questions, it is not yet safe for you to > run without Giant. Note that most use of IPv6 is safe, but there are > some areas (multicast) that are not entirely safe yet. > > - Are your using Netgraph? If yes, it may be that you are not yet > able to run without Giant. The framework and many nodes are MPSAFE, > but some remain that are not. It is worth giving it a try, but you > may experience panics, etc, especially in MP configurations. > > - Are you using SLIP or kernel PPP (not to be confused with user ppp, > which is what most FreeBSD users use with modems). If so, there are > experimental patches to make SLIP safe, but out of the box you may > see lock assertion failures. We are working to resolve this issue. > > - Are you using any physical network interfaces other than the > following: ath, bge, dc, em, ep, fxp, rl, sis, xl, wi. If so, you > may see a performance drop. > > NOTE: Do you maintain a network interface driver? Is it not on this > list? Shame on you! Or maybe shame on me for not listing it, even > though it should work. Drop me a private e-mail with any questions > or comments. Please update the busdma driver status web page with > your driver's status. > > (2) If you are comfortable that you are using an MPSAFE-supported > configuration, then you can use the following tunable in loader.conf > to disable the Giant lock over the network stack on your system: > > debug.mpsafenet="1" > > Note that this is a boot-time only flag; you can inspect the setting > with a sysctl, but it cannot currently be changed at runtime. You > will need to reboot for the change to take effect. > > Once the default has changed, it will be necessary to explicitly > disable Giant-free networking if that is the desired operating mode. > Specifically, you will need to place the following in loader.conf to > get that mode of operation: > > debug.mpsafenet="0" > > Some notes: > > On SMP-centric performance measurements, such as local UNIX domain > socket use by MySQL on MP systems, I've observed 30%-40% performance > improvements by disabling Giant (some details below). My recommended > configuration for testing out the impact of disabling Giant on MP > systems is: > > - Running with adaptive mutexes (now the default) and with > ADAPTIVE_GIANT (also now the default) appears to make a big > difference. > > - Try disabling HTT. In my workloads, which tend to pound the > kernel, HTT appears to hurt quite a bit. Obviously, the > effectiveness of HTT depends on the instruction mix, so this may not > be for you. Builds, for example, may benefit. > > - Pick one of ULE and 4BSD, and then try the other. I found 4BSD > helped a lot for MySQL, but I've seen other benchmarks with quite > different results. > > - For stability purposes with MySQL, I currently have to disable > PREEMPTION (currently the default), as the MySQL benchmarks I use are > pretty thread-centric and trigger preemption-related bugs with the > kernel threading bits. Recent work-arounds committed should resolve > this but I have not yet run stability tests. > > - If you want to measure performance, make sure to disable > INVARIANTS, INVARIANTS_SUPPORT, WITNESS, etc. Also, confirm that the > userland malloc debugging features are disabled, as they add cost to > each free() operation. I believe we now have a handbook with a > variety of recommendations on performance measurement, such as > disabling various daemons (such as dhclient, etc). For latency > measurements, PREEMPTION is generally desired, subject to stability. > > - To increase parallelism, especially for inbound packet paths on > multiple interfaces, set the sysctl/tunable net.isr.enable=1, which > enables direct dispatch in network interface ithreads, rather than > defering to the netisr thread. If each interface is assigned a > different ithread, their inbound processing paths can run in > parallel, as well as with loop back traffic running in the global > netisr thread. We have additional work to do here in terms of > increasing the chances of parallel dispatch, etc, and it could be > some environments this is not a useful setting. I'd be interested in > learning about the environments where a negative performance impact > is measured. > > Some notes on bug reporting: > > - Make sure to identify that you are running with debug.mpsafenet on. > If the problem is reproduceable, make sure to indicate if it goes > away or persists when you disable debug.mpsafenet. This will help to > distinguish network stack problems which are (and are not) a result > of this work. > > - If you appear to be experiencing a hang/deadlock, please try > running with WITNESS. I'd actually like to see most people running > with WITNESS for a bit to shake out lock order issues, as I've > introduced a lot of orders. If experiencing lock order reversals, > please include the full console warning including stack trace and any > warning messages prior to the trace identifying locks, etc. If > dropped to DDB, "show locks" is useful. > > - INVARIANTS also considered good. Even if you aren't running with > WITNESS, do run with INVARIANTS. Note that there is a measurable > performance hit for doing so. > > - If you experience a hang, see if you can get into DDB -- if you are > having problems getting in using a console break, try a serial > console. When debugging, at minimum DDB 'ps' output, along with > traces of interesting processes. Typically interesting will be > processes that appear to be involved in the hang, etc. Obviously, > this requires some intuition about what causes the hang and I can't > offer hard and fast rules here. NMI, SW_WATCHDOG, and MP_WATCHDOG > can all increase the chances of getting to DDB even in hard hangs. > > - Experimenting with debug.mpsafenet=1 and UP is also interesting, > not just SMP. With PREEMPTION turned on, it may result in lower > latency and/or lower throughput. Or not. Regardless, it's > interesting -- you don't have to have SMP to give it a spin. > > FYI, while results can and will vary, I was pleased to observe moving > from a UP->MP speedup of 1.07 on a dual-processor box to a speedup of > 1.42 with the supersmack benchmark using 11 workers and 1000 select > transactions with MySQL. For reference, that was with the 4BSD > scheduler and adaptive mutexes. For loopback netperf with TCP and > UDP, I observed no change in performance (well, 1% better for UDP RR, > but basically no change). Note that the MySQL benchmark here is > basically a UNIX domain socket IPC test, and so real world databases > will give pretty different results since they won't be pure IPC. The > results appear to be very sensitive to the choice of scheduler, and > for a variety of reasons I've preferred 4BSD during recent testing > (not least, better results in terms of throughput). > > There are a lot of people who have been working on this for quite > some time -- I can't thank them all here, but I will point at the > netperf web page as a place to look for ongoing patches, change logs, > and some credits: > > http://www.watson.org/~robert/freebsd/netperf/ > > The hard work and contributions of these many developers over several > years is finally coming to fruition! I try to keep it up to date > about once a week or so as I drop new patch sets. There's also an > RSS feed on the change log, which is fairly technical but might be > interesting to some readers. > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects > robert@fledge.watson.org Principal Research Scientist, McAfee > Research > > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current To > unsubscribe, send any mail to > "freebsd-current-unsubscribe@freebsd.org"