From owner-freebsd-net@FreeBSD.ORG Tue Jan 1 00:43:26 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A22B16A468 for ; Tue, 1 Jan 2008 00:43:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id BEDFB13C455 for ; Tue, 1 Jan 2008 00:43:25 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m010hIF9014397 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 1 Jan 2008 11:43:20 +1100 Date: Tue, 1 Jan 2008 11:43:18 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Julian Elischer In-Reply-To: <4779697A.4050806@elischer.org> Message-ID: <20080101105918.I9594@delplex.bde.org> References: <4779697A.4050806@elischer.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: FreeBSD Net Subject: Re: m_freem() X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Jan 2008 00:43:26 -0000 On Mon, 31 Dec 2007, Julian Elischer wrote: > m_freem() would be a perfect candidate for an inline function. > (or even macro). > in the case where m is null, no function call would even be made... > The whole function is only 2 lines, and it gets called once for every packet > :-) On the contrary, m_freem() is a large function that is fairly unsuitable for inlining. I just happened to count that it usually takes 180 instructions in -current with no INVARIANTS etc. (down from 245 instructions in ~5.2). Further counting gave 112 and 132 instructions for it (180 was for ttcp udp input packets and 112 and 132 are for ping packets in 2 directions). m_freem() is only one statement, but that statement consists mainly of a function call to a function that is inline (m_free()). m_free() sometimes calls m_free_ext(), which is not inline, and usually calls uma_zfree(), which is inline, but which is just a wrapper for uma_zfree_arg(), which is not inline. uma_zfree_arg() is very large and thus very unsuitable for inlining. I didn't check for [nested] inlining of its internals at the source level. At runtime it usually calls the non-inline-function m_dtor_mbuf() which calls the non-inline function m_tag_delete_chain(); then it calls critical_enter() and critical_exit(). critical_exit() is fairly large and sometimes calls thread_lock(), mi_switch() and thread_unlock(), but usually doesn't. So the non-inline part of the call chain is usually: m_freem() uma_zfree_arg() # the following is just 1 short path through this m_dtor_mbuf() m_tag_delete_chain() critical_enter() critical_exit() [Pause to recover from a double fault panic in critical*(). critical*() or kdb is non-reeantrant somehwere, so tracing through critical_*() or one of its callers in order to count instructions tends to cause panics.] All this is too large to inline. Inlining only the top level of it would only make a tiny difference. It might make a positive or negative difference, depending on whether the reduced instruction count has a larger effect than the increased cache pressure. Generally I think it is bogus to inline at the top level. Here inlining at the top level may win in 2 ways: - by avoiding the function call to the next level (and thus all function calls) in the usual case. I think this doesn't happen here. I think it is the usual case for the m_free_ext() call in m_free(), so inlining m_free() is a clear win. - by improving branch prediction. With a branch in a non-inline function, it may be mispredicted often because different classes of callers make it go in different ways. With branch the distributed in callers by inlining, it can be predicted perfectly in individual callers that don't change its direction often and/or change its direction in predictable ways. On Athlon CPUs, mispredicting a single branch costs the same several function calls provided the implicit branches for all the function calls are not mispredicted. Too much inlining is still bad. Apart from busting icaches, it can bust branch prediction caches -- with enough distribution of branches, all branches will be mispredicted. The m_freem() wrapper currently limits the icache bloat from the m_free() inline. In RELENG_4, both m_free() and m_freem() are non-inline and non-macro. That may be why networking in RELENG_4 is so much more efficient than in -current ;-). (Actually it makes little difference.) Bruce