From owner-freebsd-current@FreeBSD.ORG  Tue Aug 24 15:38:54 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8D4C916A4CE; Tue, 24 Aug 2004 15:38:54 +0000 (GMT)
Received: from saturn.criticalmagic.com (saturn.criticalmagic.com
	[64.74.124.105])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 4BDED43D3F; Tue, 24 Aug 2004 15:38:54 +0000 (GMT)
	(envelope-from rcoleman@criticalmagic.com)
Received: from [172.16.0.202] (c-24-99-11-35.atl.client2.attbi.com
	[24.99.11.35])
	by saturn.criticalmagic.com (Postfix) with ESMTP id 58EF33BD21;
	Tue, 24 Aug 2004 11:38:52 -0400 (EDT)
Message-ID: <412B618F.2000305@criticalmagic.com>
Date: Tue, 24 Aug 2004 11:41:03 -0400
From: Richard Coleman <rcoleman@criticalmagic.com>
User-Agent: Mozilla Thunderbird 0.7.3 (Windows/20040803)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <Pine.NEB.3.96L.1040824102109.89999C-100000@fledge.watson.org>
In-Reply-To: <Pine.NEB.3.96L.1040824102109.89999C-100000@fledge.watson.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
cc: current@FreeBSD.org
Subject: Re: Running the network stack without Giant -- change in default
 coming
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2004 15:38:54 -0000

Very very cool.  It's exciting to see many of the long term
FreeBSD projects coming together like this.

Richard Coleman
rcoleman@criticalmagic.com

Robert Watson wrote:
> For some time, one of the major goals of the FreeBSD Project has been
> to allow the network stack to run in parallel on multiple processors
> at a time.  Per my July 19, 2004 post to the freebsd-current mailing
> list, much of this support has now been merged to the FreeBSD
> 5-CURRENT branch (and now 6-CURRENT), with the intent of shipping
> this support in 5.3.  And, per that post, it's now possible to run
> large parts of the network stack in this manner through the use of a
> system tunable at boot, debug.mpsafenet. This can result in a variety
> of performance benefits, especially on SMP, by improving concurrency
> and reducing latency.  While it presents a "first cut" locking
> strategy, these benefits are still pretty tangible, and the resulting
> system is an excellent starting architecture for a broad range of
> performance work.
> 
> Right now, that tunable "debug.mpsafenet" defaults to off (0) in the 
> 5-CURRENT and 6-CURRENT branches.  However, this will shortly change
> in 6-CURRENT to on (1), as most commonly exercised parts of the
> network stack are now ready for testing in this environment.  Some
> caveats before I go into the details as to how to determine whether
> this is right for you:
> 
> - While we've been doing pretty heavy testing in MPSAFE
> configurations, the nature of multiprocessor development and adapting
> code for MP safety means that it's unlikely this will "just work" for
> every last person who tries it.  However, it appears to work well in
> a broad variety of environments and with fairly strenuous testing.
> 
> - We've focussed primarily on getting mainstream network
> configurations to run without Giant: this means that less mainstream
> subsystems (parts of IPv6, some netgraph nodes, IPX, etc) are
> currently unsafe without the Giant lock turned on.  Less mainstream
> network devices, even if the device drivers are not able to run
> without the Giant lock. are able to operate without Giant over the
> remainder of the stack due to compatibility code.  This code comes
> with a performance penalty beyond just running with the Giant lock,
> so there is a strong motivation to complete locking for these
> straggling drivers.
> 
> - You may run into hard to diagnose problems.  We'd like to try to 
> diagnose them anyway, but if you start to experience new problems, 
> you'll want to go read the Handbook chapter on preparing kernel bug 
> reports and diagnosing problems.  You'll also want to be prepared to
> run the system with INVARIANTS and WITNESS turned on.  The first step
> in debugging will be to try running with Giant turned back on by
> changing the debug.mpsafenet flag and seeing if the problem can be
> reproduced. Details below.
> 
> - Not all workloads will experience a performance benefit -- some,
> for various reasons, will get worse.  However, several interesting 
> performance loads get measurably better.  If you don't see an 
> improvement, or you see things get worse, please don't be surprised
> -- you may want to look at some of the suggestions I make below on
> ways to make the results more predictable.  Generally, you shouldn't
> see substantial performance degradation, if any, but it can't be
> ruled out, especially due to outstanding scheduler issues that are
> being worked on.
> 
> - We can and will destroy your data.  We don't mean to, because we
> like your data (and you!), and we try not to, but this is, after all,
>  operating system development, and comes with risks.
> 
> With this in mind, now is a good time to increase exposure for these
>  changes, because they will become the default in the near future.
> 
> Here's some technical information on how to get started:
> 
> (1) Determine if all of the stack components you will operate with
> are MPsafe.  For common configurations, answering the following
> questions will help you decide this:
> 
> - Are you actively using IPv6, IPX, ATM, or KAME IPSEC?  If you 
> answered yes to any of these questions, it is not yet safe for you to
> run without Giant.  Note that most use of IPv6 is safe, but there are
> some areas (multicast) that are not entirely safe yet.
> 
> - Are your using Netgraph?  If yes, it may be that you are not yet 
> able to run without Giant.  The framework and many nodes are MPSAFE,
> but some remain that are not.  It is worth giving it a try, but you
> may experience panics, etc, especially in MP configurations.
> 
> - Are you using SLIP or kernel PPP (not to be confused with user ppp,
> which is what most FreeBSD users use with modems).  If so, there are
> experimental patches to make SLIP safe, but out of the box you may
> see lock assertion failures.  We are working to resolve this issue.
> 
> - Are you using any physical network interfaces other than the 
> following: ath, bge, dc, em, ep, fxp, rl, sis, xl, wi.  If so, you
> may see a performance drop.
> 
> NOTE: Do you maintain a network interface driver?  Is it not on this
> list?  Shame on you!  Or maybe shame on me for not listing it, even
> though it should work.  Drop me a private e-mail with any questions
> or comments.  Please update the busdma driver status web page with
> your driver's status.
> 
> (2) If you are comfortable that you are using an MPSAFE-supported 
> configuration, then you can use the following tunable in loader.conf 
> to disable the Giant lock over the network stack on your system:
> 
> debug.mpsafenet="1"
> 
> Note that this is a boot-time only flag; you can inspect the setting 
> with a sysctl, but it cannot currently be changed at runtime.  You 
> will need to reboot for the change to take effect.
> 
> Once the default has changed, it will be necessary to explicitly 
> disable Giant-free networking if that is the desired operating mode.
>  Specifically, you will need to place the following in loader.conf to
>  get that mode of operation:
> 
> debug.mpsafenet="0"
> 
> Some notes:
> 
> On SMP-centric performance measurements, such as local UNIX domain
> socket use by MySQL on MP systems, I've observed 30%-40% performance
> improvements by disabling Giant (some details below).  My recommended
> configuration for testing out the impact of disabling Giant on MP
> systems is:
> 
> - Running with adaptive mutexes (now the default) and with
> ADAPTIVE_GIANT (also now the default) appears to make a big
> difference.
> 
> - Try disabling HTT.  In my workloads, which tend to pound the
> kernel, HTT appears to hurt quite a bit.  Obviously, the
> effectiveness of HTT depends on the instruction mix, so this may not
> be for you.  Builds, for example, may benefit.
> 
> - Pick one of ULE and 4BSD, and then try the other.  I found 4BSD
> helped a lot for MySQL, but I've seen other benchmarks with quite
> different results.
> 
> - For stability purposes with MySQL, I currently have to disable 
> PREEMPTION (currently the default), as the MySQL benchmarks I use are
>  pretty thread-centric and trigger preemption-related bugs with the 
> kernel threading bits.  Recent work-arounds committed should resolve 
> this but I have not yet run stability tests.
> 
> - If you want to measure performance, make sure to disable
> INVARIANTS, INVARIANTS_SUPPORT, WITNESS, etc.  Also, confirm that the
> userland malloc debugging features are disabled, as they add cost to
> each free() operation.  I believe we now have a handbook with a
> variety of recommendations on performance measurement, such as
> disabling various daemons (such as dhclient, etc).  For latency
> measurements, PREEMPTION is generally desired, subject to stability.
> 
> - To increase parallelism, especially for inbound packet paths on
> multiple interfaces, set the sysctl/tunable net.isr.enable=1, which
> enables direct dispatch in network interface ithreads, rather than
> defering to the netisr thread.  If each interface is assigned a
> different ithread, their inbound processing paths can run in
> parallel, as well as with loop back traffic running in the global
> netisr thread.  We have additional work to do here in terms of
> increasing the chances of parallel dispatch, etc, and it could be
> some environments this is not a useful setting. I'd be interested in
> learning about the environments where a negative performance impact
> is measured.
> 
> Some notes on bug reporting:
> 
> - Make sure to identify that you are running with debug.mpsafenet on.
> If the problem is reproduceable, make sure to indicate if it goes
> away or persists when you disable debug.mpsafenet.  This will help to
>  distinguish network stack problems which are (and are not) a result
> of this work.
> 
> - If you appear to be experiencing a hang/deadlock, please try
> running with WITNESS.  I'd actually like to see most people running
> with WITNESS for a bit to shake out lock order issues, as I've
> introduced a lot of orders.  If experiencing lock order reversals,
> please include the full console warning including stack trace and any
> warning messages prior to the trace identifying locks, etc.  If
> dropped to DDB, "show locks" is useful.
> 
> - INVARIANTS also considered good.  Even if you aren't running with 
> WITNESS, do run with INVARIANTS.  Note that there is a measurable 
> performance hit for doing so.
> 
> - If you experience a hang, see if you can get into DDB -- if you are
>  having problems getting in using a console break, try a serial
> console. When debugging, at minimum DDB 'ps' output, along with
> traces of interesting processes.  Typically interesting will be
> processes that appear to be involved in the hang, etc.  Obviously,
> this requires some intuition about what causes the hang and I can't
> offer hard and fast rules here.  NMI, SW_WATCHDOG, and MP_WATCHDOG
> can all increase the chances of getting to DDB even in hard hangs.
> 
> - Experimenting with debug.mpsafenet=1 and UP is also interesting,
> not just SMP.  With PREEMPTION turned on, it may result in lower
> latency and/or lower throughput.  Or not.  Regardless, it's
> interesting -- you don't have to have SMP to give it a spin.
> 
> FYI, while results can and will vary, I was pleased to observe moving
> from a UP->MP speedup of 1.07 on a dual-processor box to a speedup of
> 1.42 with the supersmack benchmark using 11 workers and 1000 select
> transactions with MySQL.  For reference, that was with the 4BSD
> scheduler and adaptive mutexes.  For loopback netperf with TCP and
> UDP, I observed no change in performance (well, 1% better for UDP RR,
> but basically no change).  Note that the MySQL benchmark here is
> basically a UNIX domain socket IPC test, and so real world databases
> will give pretty different results since they won't be pure IPC.  The
> results appear to be very sensitive to the choice of scheduler, and
> for a variety of reasons I've preferred 4BSD during recent testing
> (not least, better results in terms of throughput).
> 
> There are a lot of people who have been working on this for quite
> some time -- I can't thank them all here, but I will point at the
> netperf web page as a place to look for ongoing patches, change logs,
> and some credits:
> 
> http://www.watson.org/~robert/freebsd/netperf/
> 
> The hard work and contributions of these many developers over several
>  years is finally coming to fruition!  I try to keep it up to date
> about once a week or so as I drop new patch sets.  There's also an
> RSS feed on the change log, which is fairly technical but might be
> interesting to some readers.
> 
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects 
> robert@fledge.watson.org      Principal Research Scientist, McAfee
> Research
> 
> _______________________________________________ 
> freebsd-current@freebsd.org mailing list 
> http://lists.freebsd.org/mailman/listinfo/freebsd-current To
> unsubscribe, send any mail to
> "freebsd-current-unsubscribe@freebsd.org"