From owner-freebsd-current@FreeBSD.ORG Thu Apr 19 21:19:11 2012 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 325281065670 for ; Thu, 19 Apr 2012 21:19:11 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 8EA2F8FC1A for ; Thu, 19 Apr 2012 21:19:10 +0000 (UTC) Received: (qmail 15558 invoked from network); 19 Apr 2012 21:14:23 -0000 Received: from unknown (HELO [62.48.0.94]) ([62.48.0.94]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 19 Apr 2012 21:14:23 -0000 Message-ID: <4F908180.6010408@freebsd.org> Date: Thu, 19 Apr 2012 23:20:00 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 MIME-Version: 1.0 To: Luigi Rizzo References: <20120419133018.GA91364@onelab2.iet.unipi.it> <4F907011.9080602@freebsd.org> <20120419204622.GA94904@onelab2.iet.unipi.it> In-Reply-To: <20120419204622.GA94904@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: current@freebsd.org, net@freebsd.org Subject: Re: Some performance measurements on the FreeBSD network stack X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Apr 2012 21:19:11 -0000 On 19.04.2012 22:46, Luigi Rizzo wrote: > On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: >> On 19.04.2012 15:30, Luigi Rizzo wrote: >>> I have been running some performance tests on UDP sockets, >>> using the netsend program in tools/tools/netrate/netsend >>> and instrumenting the source code and the kernel do return in >>> various points of the path. Here are some results which >>> I hope you find interesting. >> >> Jumping over very interesting analysis... >> >>> - the next expensive operation, consuming another 100ns, >>> is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator >>> seems to scale decently at least with 4 cores. The copyin() is >>> relatively inexpensive (not reported in the data below, but >>> disabling it saves only 15-20ns for a short packet). >>> >>> I have not followed the details, but the allocator calls the zone >>> allocator and there is at least one critical_enter()/critical_exit() >>> pair, and the highly modular architecture invokes long chains of >>> indirect function calls both on allocation and release. >>> >>> It might make sense to keep a small pool of mbufs attached to the >>> socket buffer instead of going to the zone allocator. >>> Or defer the actual encapsulation to the >>> (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. >> >> The UMA mbuf allocator is certainly not perfect but rather good. >> It has a per-CPU cache of mbuf's that are very fast to allocate >> from. Once it has used them it needs to refill from the global >> pool which may happen from time to time and show up in the averages. > > indeed i was pleased to see no difference between 1 and 4 threads. > This also suggests that the global pool is accessed very seldom, > and for short times, otherwise you'd see the effect with 4 threads. Robert did the per-CPU mbuf allocator pools a few years ago. Excellent engineering. > What might be moderately expensive are the critical_enter()/critical_exit() > calls around individual allocations. Can't get away from those as a thread must not migrate away when manipulating the per-CPU mbuf pool. > The allocation happens while the code has already an exclusive > lock on so->snd_buf so a pool of fresh buffers could be attached > there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is experimental not enabled by default) I did just that for the receive path. It's quite a significant gain there. IMHO better resolve the locking order than to juggle yet another mbuf sink. > But the other consideration is that one could defer the mbuf allocation > to a later time when the packet is actually built (or anyways > right before the thread returns). > What i envision (and this would fit nicely with netmap) is the following: > - have a (possibly readonly) template for the headers (MAC+IP+UDP) > attached to the socket, built on demand, and cached and managed > with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. We want to get away from that to untangle the (locking) mess that eventually results from it. > - possibly extend the pru_send interface so one can pass down the uio > instead of the mbuf; > - make an opportunistic buffer allocation in some place downstream, > where the code already has an x-lock on some resource (could be > the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. >>> - another big bottleneck is the route lookup in ip_output() >>> (between entries 51 and 56). Not only it eats another >>> 100ns+ on an empty routing table, but it also >>> causes huge contentions when multiple cores >>> are involved. >> >> This is indeed a big problem. I'm working (rough edges remain) on >> changing the routing table locking to an rmlock (read-mostly) which > > i was wondering, is there a way (and/or any advantage) to use the > fastforward code to look up the route for locally sourced packets ? No. The main advantage/difference of fastforward is the short code path and processing to completion. -- Andre