From owner-freebsd-current@FreeBSD.ORG Wed Sep 8 17:28:31 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 313CF16A4CE for ; Wed, 8 Sep 2004 17:28:31 +0000 (GMT) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8855E43D48 for ; Wed, 8 Sep 2004 17:28:30 +0000 (GMT) (envelope-from scottl@freebsd.org) Received: from pooker.samsco.org (scottl@localhost [127.0.0.1]) by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i88HSp2q009452 for ; Wed, 8 Sep 2004 11:28:51 -0600 (MDT) (envelope-from scottl@freebsd.org) Received: from localhost (scottl@localhost)i88HSpc6009449 for ; Wed, 8 Sep 2004 11:28:51 -0600 (MDT) (envelope-from scottl@freebsd.org) X-Authentication-Warning: pooker.samsco.org: scottl owned process doing -bs X-Received: from pooker.samsco.org ([unix socket]) by pooker.samsco.org (Cyrus v2.2.3) with LMTP; Mon, 06 Sep 2004 14:16:12 -0600 X-Sieve: CMU Sieve 2.2 X-Received: from mx2.freebsd.org (mx2.freebsd.org [216.136.204.119]) by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i86KG9Gj000251 for ; Mon, 6 Sep 2004 14:16:10 -0600 (MDT) (envelope-from gnagelhout@sandvine.com) X-Received: from hub.freebsd.org (hub.freebsd.org [216.136.204.18]) by mx2.freebsd.org (Postfix) with ESMTP id 9D28A5541F for ; Mon, 6 Sep 2004 20:15:52 +0000 (GMT) (envelope-from gnagelhout@sandvine.com) X-Received: by hub.freebsd.org (Postfix) id 9BA0116A4CF; Mon, 6 Sep 2004 20:15:52 +0000 (GMT) Delivered-To: scottl@freebsd.org X-Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8209416A4CE for ; Mon, 6 Sep 2004 20:15:52 +0000 (GMT) X-Received: from exchange.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0C30943D31 for ; Mon, 6 Sep 2004 20:15:41 +0000 (GMT) (envelope-from gnagelhout@sandvine.com) content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.0.6556.0 Date: Mon, 6 Sep 2004 16:15:38 -0400 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: FreeBSD 5.3 Bridge performance take II Thread-Index: AcSUTka06BY0TaXHTDuFsBrPZ/SiiA== From: "Gerrit Nagelhout" To: , "Scott Long" , "Robert Watson" X-Spam-Status: No, hits=0.0 required=3.8 tests=none autolearn=no version=2.63 X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on pooker.samsco.org ReSent-Date: Wed, 8 Sep 2004 11:28:37 -0600 (MDT) Resent-From: Scott Long Resent-To: current@freebsd.org ReSent-Subject: FreeBSD 5.3 Bridge performance take II ReSent-Message-ID: <20040908112837.A59291@pooker.samsco.org> cc: Richard Legault cc: Ed Maste cc: Alex Hoff Subject: FreeBSD 5.3 Bridge performance take II X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Sep 2004 17:28:31 -0000 Hi, I have just finished some profiling and analysis of the FREEBSD_5_BP = code=20 running a standard 4-port ethernet bridge (not netgraph). On the = upside,=20 some of the features such as the netperf stuff, MUTEX_PROFILING and=20 UMA are very cool, and (I think) give the potential for a really fast = bridge=20 (or similar application). However, the current performance is still = rather=20 poor compared to 4.x, but I think that with the groundwork now in place, = and some minor changes and a couple of new features, it can be made much = much faster. I would like to discuss some possible optimizations (will suggest some = below), and then we are willing to take on some of them, and give the code back to = FreeBSD. Hopefully these changes can be made on RELENG_5 to be used by by 5.4. The tests that I have run so far have focussed on the different between=20 running in polling mode (dual 2.8Ghz Xeon, 2 2-port em NICs) versus = interrupt=20 mode (with debug.mpsafenet=3D1, and no INVARIANTS/WITNESS or anything=20 like that). In both setups I actually get similar throughput (300kpps = total in=20 and out divided evenly over the 4 ports). I think it should be possible = to get >> 1Mpps bridging on this platform. In the polling case, there is still only one active thread, and the = limiting factor seems to be simply the number of mutexes (11 per packet according to MUTEX_PROFILING), and overhead from UMA, bus_dma, etc. =20 With polling disabled, I think the fact that PREEMPTION was disabled (I = can't even boot with it on), and some sub-optimal mutex usage resulting in a lot of collisions caused some problems, even though in theory all 4 cores = should be able to run simultaneously. Here is a sample profile (while in polling mode). The cpu idle, halt = etc are simply indicating that 3 of the cores have nothing to do. But it does give a = pretty good sense of where all time is being spent. There are definitely a lot = of cycles going to UMA, mutexes, etc. (This profile only shows the top functions, = and has the call tree disabled ... ie only interrupt based profiling = because it slows the test down too much otherwise). % cumulative self self total time seconds seconds calls ms/call ms/call name 18.4 10.25 10.25 cpu_idle_default = [1] 13.8 17.94 7.69 cpu_idle [2] 6.5 21.57 3.63 critical_exit [3] 6.5 25.17 3.61 _mtx_lock_spin [4] 5.0 27.95 2.78 uma_zalloc_arg [5] 4.6 30.52 2.56 cpu_halt [6] 4.4 32.94 2.43 uma_zfree_arg [7] 3.9 35.12 2.18 maybe_preempt [8] 3.2 36.91 1.79 bridge_in [9] 2.8 38.46 1.55 = em_process_receive_interrupts [10] 2.6 39.89 1.43 = _bus_dmamap_load_buffer [11] 2.3 41.19 1.30 bdg_forward [12] 2.3 42.48 1.29 mb_free_ext [13] 1.8 43.49 1.01 malloc_type_freed = [14] 1.7 44.44 0.95 ether_input [15] 1.7 45.39 0.94 em_start [16] 1.7 46.33 0.94 _bus_dmamap_sync = [17] 1.5 47.18 0.84 em_start_locked = [18] 1.2 47.85 0.68 = malloc_type_zone_allocated [19] 1.2 48.52 0.67 __mcount [20] 1.2 49.17 0.65 mb_ctor_pack [21] 1.1 49.80 0.63 em_encap [22] 1.1 50.39 0.59 free [23] 1.0 50.94 0.56 = bus_dmamap_load_mbuf [24] 0.9 51.46 0.51 generic_bzero [25] 0.9 51.96 0.50 m_freem [26] 0.8 52.42 0.46 generic_bcopy [27] 0.7 52.79 0.38 em_get_buf [28] 0.6 53.13 0.34 = em_clean_transmit_interrupts [29] 0.5 53.42 0.29 bus_dmamap_load = [30] 0.4 53.66 0.24 m_adj [31] 0.4 53.90 0.23 malloc [32] 0.4 54.11 0.22 bus_dmamap_create = [33] 0.2 54.24 0.12 bus_dmamem_free = [35] 0.2 54.35 0.11 mb_dtor_pack [36] 0.2 54.45 0.10 em_tx_cb [37] 0.2 54.54 0.09 = em_receive_checksum [38] 0.1 54.61 0.08 em_dmamap_cb [39] 0.1 54.69 0.07 m_tag_delete_chain = [40] 0.1 54.75 0.07 _bus_dmamap_unload = [41] 0.1 54.82 0.06 em_poll [42] 0.1 54.88 0.06 = em_transmit_checksum_setup [43] 0.1 54.93 0.05 bus_dmamap_destroy = [44] 0.1 54.97 0.04 _mtx_lock_sleep = [47] 0.1 55.00 0.03 if_start [49] 0.1 55.03 0.03 = bus_dmamap_load_uio [50] 0.1 55.07 0.03 75189 0.00 0.00 netisr_poll [51] 0.1 55.10 0.03 em_smartspeed [52] 0.1 55.13 0.03 ithread_loop [34] Here are the (top) results of the mutex profiling (these are basically = all the locks that get called once or twice per packet): max total count avg cnt_hold cnt_lock name 24344 37552473 309134 121 151712 101781 if_em.c:956 (em5) (1) 31578 10548396 309131 34 44233 81751 if_em.c:3432 (em4) (2) 460 5813698 620705 9 16 79 uma_core.c:1800 (UMA pcpu) (3) 428 4304975 619846 6 26 24 uma_core.c:2206 (UMA pcpu) (4) 445 3129168 309127 10 30828 28115 bridge.c:1201 (em5) (5) 462 3125131 309127 10 125294 122560 bridge.c:816 (bridge) (6) 489 2815715 309134 9 14610 20050 if_em.c:926 (em5) (7) 450 2573019 309170 8 94471 101577 kern_malloc.c:185 (devbuf) (8) 419 2113089 309275 6 67982 65871 kern_malloc.c:210 (devbuf) (9) The line numbers will be close to RELENG_5_BP code but not exactly the = same=20 because of some local modifications, so here are the descriptions of the = mutexes=20 involved: 1) em_start (used for transmit) 2) em_process_receive_interrupts (re-lock just after if_input) 3) uma_zalloc_arg (per CPU lock) 4) uma_zfree_arg (per CPU lock) 5) bdb_forward (IFQ_HANDOFF) 6) bridge_in (global bridge lock) 7) em_start_locked (IF_DEQUEUE) 8) malloc_type_zone_allocated 9) malloc_type_freed >From these numbers, the uma locks seem to get called twice for every = packet,=20 but have no collisions. All other locks have significant collision = problems resulting in a lot of overhead. Based on these stats, I have come up with the following = observations/suggestions/etc that I would like to discuss. As discussed before, there is a significant cost associated with every = mutex. I'd like to be able to get down to less than 1 mutex per packet (on average) = through this path. Some of the possibilities to do this are: - Implement workQ's of packets (also suggested by Robert Watson in the = past). This will reduce the mutexes in number 1, 2, 5, 6 & 7 above because it should = be possible to only take the lock for a queue of packets, instead of every one. - Implement device level caching for the UMA mbuf zones. If a driver = could allocate one bucket of mbufs at a time, no locking would be required per = allocation. The same goes for the free side of things, if you can allocate an empty bucket, = fill it up, and then return it, only a couple of mutexes are required per bucket. This would = also reduce the function call overhead for every packet. This change should = actually get rid of most of the remaining mutex overhead. I think that one of the major reasons that polling with one thread had = about the same performance as interrupts with 4 threads/cores is that some of the = mutexes are held far too long, thus reducing parallelism. The biggest culprit of this is = in the em driver. First of all, there is only one global lock for the driver, but there = should be no reason that the rx & tx paths couldn't be run simultanously. If we setup = something like: EM_TX_LOCK() EM_TX_UNLOCK() EM_RX_LOCK() EM_RX_UNLOCK() EM_LOCK() {EM_TX_LOCK(); EM_RX_LOCK()} EM_UNLOCK() {EM_TX_UNLOCK(); EM_RX_UNLOCK()} this driver will run much faster. Even within the receive and transmit = functions,=20 the mutexes are held for a long time. It should be possible to code in = such a way that the mutex is released before trying to free or allocate an mbuf. = This should reduce the holding time and thus collisions a lot. When overloading the bridge in interrupt mode, the system becomes = completely unresponsive (can't even get into ddb) until the packet source is = removed. This is highly undesirable behaviour, but currently the only way to use multiple = kernel=20 threads to handle the workload. Extending polling to use multiple threads instead of one should work = around this problem. This is a bit of a design in itself, and probably worthy of a = separate=20 discussion. We are certainly willing to give this a shot (hopefully = with with some external input) The latest generation Xeons (Nocona) have a couple of new features that = are very useful for optimizing code. One of them is the ability to prefetch = a cache line for which a page is not yet in the tlb. It should be possible to = strategically sprinkle a few prefetches in the code, and get a big performance boost. This is = probably pretty platform specific though, so I don't know how to do this in = general because it will only benefit some platforms (don't know about AMD/alpha), and = may slightly hurt some others. In terms of cache efficiency, I am not sure that using the UMA mbuf = packet zone is the best way to go. To be able to put a cluster on a DMA descriptor, = you=20 currently need to read the mbuf header to get its pointer. It may be = more efficient to have the local cache of just clusters and mbufs. To allocate a = cluster you=20 just need to read the bucket array, and can add the cluster to the = descriptor without having anything but the array itself in cache. Once the packet is = filled up, it can be coupled to an mbuf header. The other advantage of this is that = pointers for both are always easily available in an array, they lend themselves well = to s/w=20 prefetching. The choice of schedulers, and use of PREEMPTION will probably make a bit = of a=20 difference for these tests too, but I did not do much experimentation = because I=20 couldn't even boot with the ULE scheduler & PREEMPTION enabled. I = suspect that preemption will help quite a bit when there are mutex collisions. This is all I have for now. As I mentioned previously, I'd like to = generate some=20 discussion on some of these points, as well as hear ideas for additional = optimizations. We will definitely implement some of these features ourselves, but would = much rather give back the code and make this a "cooperative effort". Also, I haven't done any testing on the netgraph side of things yet, but = that will probably be next on the list. Comments? Thanks, Gerrit Nagelhout