From owner-freebsd-current@FreeBSD.ORG  Wed Sep  8 17:28:31 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 313CF16A4CE
	for <current@freebsd.org>; Wed,  8 Sep 2004 17:28:31 +0000 (GMT)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8855E43D48
	for <current@freebsd.org>; Wed,  8 Sep 2004 17:28:30 +0000 (GMT)
	(envelope-from scottl@freebsd.org)
Received: from pooker.samsco.org (scottl@localhost [127.0.0.1])
	by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i88HSp2q009452
	for <current@freebsd.org>; Wed, 8 Sep 2004 11:28:51 -0600 (MDT)
	(envelope-from scottl@freebsd.org)
Received: from localhost (scottl@localhost)i88HSpc6009449
	for <current@freebsd.org>; Wed, 8 Sep 2004 11:28:51 -0600 (MDT)
	(envelope-from scottl@freebsd.org)
X-Authentication-Warning: pooker.samsco.org: scottl owned process doing -bs
X-Received: from pooker.samsco.org ([unix socket])
	by pooker.samsco.org (Cyrus v2.2.3) with LMTP;
	Mon, 06 Sep 2004 14:16:12 -0600
X-Sieve: CMU Sieve 2.2
X-Received: from mx2.freebsd.org (mx2.freebsd.org [216.136.204.119])
	by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i86KG9Gj000251
	for <scottl@samsco.org>; Mon, 6 Sep 2004 14:16:10 -0600 (MDT)
	(envelope-from gnagelhout@sandvine.com)
X-Received: from hub.freebsd.org (hub.freebsd.org [216.136.204.18])
	by mx2.freebsd.org (Postfix) with ESMTP id 9D28A5541F
	for <scottl@samsco.org>; Mon,  6 Sep 2004 20:15:52 +0000 (GMT)
	(envelope-from gnagelhout@sandvine.com)
X-Received: by hub.freebsd.org (Postfix)
	id 9BA0116A4CF; Mon,  6 Sep 2004 20:15:52 +0000 (GMT)
Delivered-To: scottl@freebsd.org
X-Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8209416A4CE
	for <scottl@freebsd.org>; Mon,  6 Sep 2004 20:15:52 +0000 (GMT)
X-Received: from exchange.sandvine.com (sandvine.com [199.243.201.138])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0C30943D31
	for <scottl@freebsd.org>; Mon,  6 Sep 2004 20:15:41 +0000 (GMT)
	(envelope-from gnagelhout@sandvine.com)
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-MimeOLE: Produced By Microsoft Exchange V6.0.6556.0
Date: Mon, 6 Sep 2004 16:15:38 -0400
Message-ID: <A8535F8D62F3644997E91F4F66E341FC1F1CA5@exchange.sandvine.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: FreeBSD 5.3 Bridge performance take II
Thread-Index: AcSUTka06BY0TaXHTDuFsBrPZ/SiiA==
From: "Gerrit Nagelhout" <gnagelhout@sandvine.com>
To: <current@freebsd.org>, "Scott Long" <scottl@freebsd.org>,
	"Robert Watson" <rwatson@freebsd.org>
X-Spam-Status: No, hits=0.0 required=3.8 tests=none autolearn=no version=2.63
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on pooker.samsco.org
ReSent-Date: Wed, 8 Sep 2004 11:28:37 -0600 (MDT)
Resent-From: Scott Long <scottl@freebsd.org>
Resent-To: current@freebsd.org
ReSent-Subject: FreeBSD 5.3 Bridge performance take II
ReSent-Message-ID: <20040908112837.A59291@pooker.samsco.org>
cc: Richard Legault <rlegault@sandvine.com>
cc: Ed Maste <emaste@sandvine.com>
cc: Alex Hoff <ahoff@sandvine.com>
Subject: FreeBSD 5.3 Bridge performance take II
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Sep 2004 17:28:31 -0000

Hi,

I have just finished some profiling and analysis of the FREEBSD_5_BP =
code=20
running a standard 4-port ethernet bridge (not netgraph).  On the =
upside,=20
some of the features such as the netperf stuff, MUTEX_PROFILING and=20
UMA are very cool, and (I think) give the potential for a really fast =
bridge=20
(or similar application).  However, the current performance is still =
rather=20
poor compared to 4.x, but I think that with the groundwork now in place, =
and
some minor changes and a couple of new features, it can be made much =
much faster.
I would like to discuss some possible optimizations (will suggest some =
below), and
then we are willing to take on some of them, and give the code back to =
FreeBSD.
Hopefully these changes can be made on RELENG_5 to be used by by 5.4.
The tests that I have run so far have focussed on the different between=20
running in polling mode (dual 2.8Ghz Xeon, 2 2-port em NICs) versus =
interrupt=20
mode (with debug.mpsafenet=3D1, and no INVARIANTS/WITNESS or anything=20
like that).  In both setups I actually get similar throughput (300kpps =
total in=20
and out divided evenly over the 4 ports).  I think it should be possible =
to
get >> 1Mpps bridging on this platform.

In the polling case, there is still only one active thread, and the =
limiting
factor seems to be simply the number of mutexes (11 per packet
according to MUTEX_PROFILING), and overhead from UMA, bus_dma, etc. =20
With polling disabled, I think the fact that PREEMPTION was disabled (I =
can't even
boot with it on), and some sub-optimal mutex usage resulting in a lot
of collisions caused some problems, even though in theory all 4 cores =
should
be able to run simultaneously.

Here is a sample profile (while in polling mode).  The cpu idle, halt =
etc are simply
indicating that 3 of the cores have nothing to do.  But it does give a =
pretty
good sense of where all time is being spent.  There are definitely a lot =
of cycles
going to UMA, mutexes, etc.  (This profile only shows the top functions, =

and has the call tree disabled ... ie only interrupt based profiling =
because it slows
the test down too much otherwise).

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 18.4      10.25    10.25                             cpu_idle_default =
[1]
 13.8      17.94     7.69                             cpu_idle [2]
  6.5      21.57     3.63                             critical_exit [3]
  6.5      25.17     3.61                             _mtx_lock_spin [4]
  5.0      27.95     2.78                             uma_zalloc_arg [5]
  4.6      30.52     2.56                             cpu_halt [6]
  4.4      32.94     2.43                             uma_zfree_arg [7]
  3.9      35.12     2.18                             maybe_preempt [8]
  3.2      36.91     1.79                             bridge_in [9]
  2.8      38.46     1.55                             =
em_process_receive_interrupts [10]
  2.6      39.89     1.43                             =
_bus_dmamap_load_buffer [11]
  2.3      41.19     1.30                             bdg_forward [12]
  2.3      42.48     1.29                             mb_free_ext [13]
  1.8      43.49     1.01                             malloc_type_freed =
[14]
  1.7      44.44     0.95                             ether_input [15]
  1.7      45.39     0.94                             em_start [16]
  1.7      46.33     0.94                             _bus_dmamap_sync =
[17]
  1.5      47.18     0.84                             em_start_locked =
[18]
  1.2      47.85     0.68                             =
malloc_type_zone_allocated [19]
  1.2      48.52     0.67                             __mcount [20]
  1.2      49.17     0.65                             mb_ctor_pack [21]
  1.1      49.80     0.63                             em_encap [22]
  1.1      50.39     0.59                             free [23]
  1.0      50.94     0.56                             =
bus_dmamap_load_mbuf [24]
  0.9      51.46     0.51                             generic_bzero [25]
  0.9      51.96     0.50                             m_freem [26]
  0.8      52.42     0.46                             generic_bcopy [27]
  0.7      52.79     0.38                             em_get_buf [28]
  0.6      53.13     0.34                             =
em_clean_transmit_interrupts [29]
  0.5      53.42     0.29                             bus_dmamap_load =
[30]
  0.4      53.66     0.24                             m_adj [31]
  0.4      53.90     0.23                             malloc [32]
  0.4      54.11     0.22                             bus_dmamap_create =
[33]
  0.2      54.24     0.12                             bus_dmamem_free =
[35]
  0.2      54.35     0.11                             mb_dtor_pack [36]
  0.2      54.45     0.10                             em_tx_cb [37]
  0.2      54.54     0.09                             =
em_receive_checksum [38]
  0.1      54.61     0.08                             em_dmamap_cb [39]
  0.1      54.69     0.07                             m_tag_delete_chain =
[40]
  0.1      54.75     0.07                             _bus_dmamap_unload =
[41]
  0.1      54.82     0.06                             em_poll [42]
  0.1      54.88     0.06                             =
em_transmit_checksum_setup [43]
  0.1      54.93     0.05                             bus_dmamap_destroy =
[44]
  0.1      54.97     0.04                             _mtx_lock_sleep =
[47]
  0.1      55.00     0.03                             if_start [49]
  0.1      55.03     0.03                             =
bus_dmamap_load_uio [50]
  0.1      55.07     0.03    75189     0.00     0.00  netisr_poll [51]
  0.1      55.10     0.03                             em_smartspeed [52]
  0.1      55.13     0.03                             ithread_loop [34]

Here are the (top) results of the mutex profiling (these are basically =
all the locks
that get called once or twice per packet):

max     total    count  avg     cnt_hold cnt_lock name
24344	37552473 309134   121	151712	101781	if_em.c:956 (em5)		(1)
31578	10548396 309131   34	44233	81751	if_em.c:3432 (em4)		(2)
460	5813698  620705    9	16	79	uma_core.c:1800 (UMA pcpu) 	(3)
428	4304975  619846    6	26	24	uma_core.c:2206 (UMA pcpu)	(4)
445	3129168  309127   10	30828	28115	bridge.c:1201 (em5)		(5)
462	3125131  309127   10	125294	122560	bridge.c:816 (bridge)		(6)
489	2815715  309134   9	14610	20050	if_em.c:926 (em5)		(7)
450	2573019  309170   8	94471	101577	kern_malloc.c:185 (devbuf)	(8)
419	2113089  309275    6	67982	65871	kern_malloc.c:210 (devbuf)	(9)

The line numbers will be close to RELENG_5_BP code but not exactly the =
same=20
because of some local modifications, so here are the descriptions of the =
mutexes=20
involved:
1) em_start  (used for transmit)
2) em_process_receive_interrupts (re-lock just after if_input)
3) uma_zalloc_arg (per CPU lock)
4) uma_zfree_arg (per CPU lock)
5) bdb_forward (IFQ_HANDOFF)
6) bridge_in (global bridge lock)
7) em_start_locked (IF_DEQUEUE)
8) malloc_type_zone_allocated
9) malloc_type_freed


>From these numbers, the uma locks seem to get called twice for every =
packet,=20
but have no collisions.  All other locks have significant collision =
problems resulting
in a lot of overhead.

Based on these stats, I have come up with the following =
observations/suggestions/etc
that I would like to discuss.

As discussed before, there is a significant cost associated with every =
mutex.  I'd
like to be able to get down to less than 1 mutex per packet (on average) =
through this
path.  Some of the possibilities to do this are:
- Implement workQ's of packets (also suggested by Robert Watson in the =
past).  This
will reduce the mutexes in number 1, 2, 5, 6 & 7 above because it should =
be possible
to only take the lock for a queue of packets, instead of every one.
- Implement device level caching for the UMA mbuf zones.  If a driver =
could allocate
one bucket of mbufs at a time, no locking would be required per =
allocation.  The same
goes for the free side of things, if you can allocate an empty bucket, =
fill it up, and then
return it, only a couple of mutexes are required per bucket.  This would =
also reduce
the function call overhead for every packet.  This change should =
actually get rid
of most of the remaining mutex overhead.

I think that one of the major reasons that polling with one thread had =
about the same
performance as interrupts with 4 threads/cores is that some of the =
mutexes are held
far too long, thus reducing parallelism.  The biggest culprit of this is =
in the em driver.
First of all, there is only one global lock for the driver, but there =
should be no reason
that the rx & tx paths couldn't be run simultanously.  If we setup =
something like:
EM_TX_LOCK()
EM_TX_UNLOCK()
EM_RX_LOCK()
EM_RX_UNLOCK()
EM_LOCK() {EM_TX_LOCK(); EM_RX_LOCK()}
EM_UNLOCK() {EM_TX_UNLOCK(); EM_RX_UNLOCK()}
this driver will run much faster.  Even within the receive and transmit =
functions,=20
the mutexes are held for a long time.  It should be possible to code in =
such a way
that the mutex is released before trying to free or allocate an mbuf.  =
This should
reduce the holding time and thus collisions a lot.


When overloading the bridge in interrupt mode, the system becomes =
completely
unresponsive (can't even get into ddb) until the packet source is =
removed.  This is
highly undesirable behaviour, but currently the only way to use multiple =
kernel=20
threads to handle the workload.
Extending polling to use multiple threads instead of one should work =
around this
problem.  This is a bit of a design in itself, and probably worthy of a =
separate=20
discussion.  We are certainly willing to give this a shot (hopefully =
with with some
external input)

The latest generation Xeons (Nocona) have a couple of new features that =
are
very useful for optimizing code.  One of them is the ability to prefetch =
a cache line
for which a page is not yet in the tlb.  It should be possible to =
strategically sprinkle
a few prefetches in the code, and get a big performance boost.  This is =
probably
pretty platform specific though, so I don't know how to do this in =
general because
it will only benefit some platforms (don't know about AMD/alpha), and =
may slightly
hurt some others.

In terms of cache efficiency, I am not sure that using the UMA mbuf =
packet zone
is the best way to go.  To be able to put a cluster on a DMA descriptor, =
you=20
currently need to read the mbuf header to get its pointer.  It may be =
more efficient
to have the local cache of just clusters and mbufs.  To allocate a =
cluster you=20
just need to read the bucket array, and can add the cluster to the =
descriptor without
having anything but the array itself in cache.  Once the packet is =
filled up, it can
be coupled to an mbuf header.  The other advantage of this is that =
pointers for
both are always easily available in an array, they lend themselves well =
to s/w=20
prefetching.


The choice of schedulers, and use of PREEMPTION will probably make a bit =
of a=20
difference for these tests too, but I did not do much experimentation =
because I=20
couldn't even boot with the ULE scheduler & PREEMPTION enabled.  I =
suspect
that preemption will help quite a bit when there are mutex collisions.

This is all I have for now.  As I mentioned previously, I'd like to =
generate some=20
discussion on some of these points, as well as hear ideas for additional =
optimizations.
We will definitely implement some of these features ourselves, but would =
much
rather give back the code and make this a "cooperative effort".
Also, I haven't done any testing on the netgraph side of things yet, but =
that will
probably be next on the list.
Comments?
Thanks,

Gerrit Nagelhout