From owner-freebsd-net@FreeBSD.ORG Mon Sep 27 13:14:33 2010 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BBC961065670 for ; Mon, 27 Sep 2010 13:14:33 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 131578FC14 for ; Mon, 27 Sep 2010 13:14:32 +0000 (UTC) Received: (qmail 81781 invoked from network); 27 Sep 2010 13:07:06 -0000 Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 27 Sep 2010 13:07:06 -0000 Message-ID: <4CA098BA.2010106@freebsd.org> Date: Mon, 27 Sep 2010 15:14:34 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.9) Gecko/20100825 Thunderbird/3.1.3 MIME-Version: 1.0 To: Luigi Rizzo References: <4C9DA26D.7000309@freebsd.org> <4C9DB0C3.5010601@freebsd.org> <20100925163010.GA76213@onelab2.iet.unipi.it> <4CA09451.7010401@freebsd.org> <20100927131836.GA99909@onelab2.iet.unipi.it> In-Reply-To: <20100927131836.GA99909@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net Subject: Re: mbuf changes X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Sep 2010 13:14:33 -0000 On 27.09.2010 15:18, Luigi Rizzo wrote: > On Mon, Sep 27, 2010 at 02:55:45PM +0200, Andre Oppermann wrote: > ... >>> my idea was to have an extra field in the mbuf to tell how much room >>> should be reserved/used for metadata (such as mtags) after >>> the payload area so you don't need to change the allocator, and >>> possibly can even modify this on an existing mbuf. >>> Almost always mbufs have spare room (e.g. incoming pkts have all >>> data in the cluster and mostly empty mdata; outgoing, except >>> for rare cases, tend to be in a similar situation. >>> So this approach would allow to take an already allocated >>> mbuf and put the mtag in the spare area after the data. >> >> For incoming data this approach could work as usually 2K mbuf clusters >> are used and they have trailing space available, or rather the normal >> mbuf referencing the cluster doesn't have its own data section unused. >> >> When trailing space should be used the M_TAILINGSPACE() needs modifications >> and a full tree audit is required to make sure that all mbuf consumers are >> correctly using it and not some own version that directly assumes certain >> mbuf sizes, etc. A lot of work. >> >> For locally generated mbufs and socket buffers we try to use the mbufs to >> their maximal extent. When the socket buffer data is packetized it normally >> is referenced then we get the normal mbuf with its data portion unused. So >> that could work. >> >> A complication is the m_tag_free() field and function which puts the memory >> deallocation into the hands of the mtag user. That means all mtag consumers >> have to made aware of provided storage w/o having to return the memory >> directly >> to the memory allocator (malloc/UMA). >> >> So the only way I realistically see is to make use of the mbuf's unused >> data portion when it has external storage to it. This should probably >> cover about 98% of all cases. The rest has to malloc() the mtag storage >> as usual. > > so it wouldn't be bad -- i cannot judge the numbers, but definitely > it would work for all incoming traffic, plus all tcp data packets > (as the payload is in the cluster), plus all pure acks (which are small), > plus all UDP above some 200 bytes... Yes, about that. >> I could whip up a prototype for review in the next weeks. > > I seem to remember that jeffr had already something done in Perforce. That's a more general overhaul of the way mbuf's are structured and allocated with UMA. I'm not sure it provides for the mtag issue. Will check though. -- Andre