From owner-freebsd-net@FreeBSD.ORG Mon Sep 27 12:55:46 2010 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2F8661065679 for ; Mon, 27 Sep 2010 12:55:46 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 3F9238FC0A for ; Mon, 27 Sep 2010 12:55:43 +0000 (UTC) Received: (qmail 81613 invoked from network); 27 Sep 2010 12:48:17 -0000 Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 27 Sep 2010 12:48:17 -0000 Message-ID: <4CA09451.7010401@freebsd.org> Date: Mon, 27 Sep 2010 14:55:45 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.9) Gecko/20100825 Thunderbird/3.1.3 MIME-Version: 1.0 To: Luigi Rizzo References: <4C9DA26D.7000309@freebsd.org> <4C9DB0C3.5010601@freebsd.org> <20100925163010.GA76213@onelab2.iet.unipi.it> In-Reply-To: <20100925163010.GA76213@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net Subject: Re: mbuf changes X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Sep 2010 12:55:46 -0000 On 25.09.2010 18:30, Luigi Rizzo wrote: > On Sat, Sep 25, 2010 at 10:20:19AM +0200, Andre Oppermann wrote: >> On 25.09.2010 09:19, Julian Elischer wrote: >>> over the last few years there has been a bit of talk about some changes >>> people want to see in mbufs >>> for 9.x >>> extra fields, changes in the way things are done, etc. >>> >>> If you are one of these people, pipe up now.. >>> >>> to get the ball rolling.. >>> >>> * Add a field for the current FIB.. currently this is 4 bits stolen from >>> the flags. >>> what would be a good width: 8,12,16,24,32 bits? >>> this would allow setfib to use numbers greater than 16 (the current max) >> >> 16 bits for 65535 FIB's should be sufficient. More than that seems really >> excessive. >> >>> * Preallocating some room for some number of tags before we start >>> allocating >>> (expensively) new ones. >> >> Within the mbuf? Or at external and attached mbuf allocation time? Tags >> are variable width and such not really suitable for pre-allocation. > > my idea was to have an extra field in the mbuf to tell how much room > should be reserved/used for metadata (such as mtags) after > the payload area so you don't need to change the allocator, and > possibly can even modify this on an existing mbuf. > Almost always mbufs have spare room (e.g. incoming pkts have all > data in the cluster and mostly empty mdata; outgoing, except > for rare cases, tend to be in a similar situation. > So this approach would allow to take an already allocated > mbuf and put the mtag in the spare area after the data. For incoming data this approach could work as usually 2K mbuf clusters are used and they have trailing space available, or rather the normal mbuf referencing the cluster doesn't have its own data section unused. When trailing space should be used the M_TAILINGSPACE() needs modifications and a full tree audit is required to make sure that all mbuf consumers are correctly using it and not some own version that directly assumes certain mbuf sizes, etc. A lot of work. For locally generated mbufs and socket buffers we try to use the mbufs to their maximal extent. When the socket buffer data is packetized it normally is referenced then we get the normal mbuf with its data portion unused. So that could work. A complication is the m_tag_free() field and function which puts the memory deallocation into the hands of the mtag user. That means all mtag consumers have to made aware of provided storage w/o having to return the memory directly to the memory allocator (malloc/UMA). So the only way I realistically see is to make use of the mbuf's unused data portion when it has external storage to it. This should probably cover about 98% of all cases. The rest has to malloc() the mtag storage as usual. I could whip up a prototype for review in the next weeks. >>> * dynamically working out what the front padding size should be.. per >>> session.. i.e. > ... >> We already have "max_linkhdr" that specifies how much space is left > > the issue is that this is global (kern.ipc.max_linkhdr) > but perhaps it would be good if we could make it per-socket > so either we set it with a setsockopt, or the system can > adjust back the value for specific sockets once it detects > that it needs extra room > (if you make the default too large, the useful room in the mdata > area becomes too small unless perhaps we move to 512-bytes mbufs For most protocols it is only necessary to leave enough space in the mbuf to fit their header. In the case of TCP this is currently 100 bytes. That leaves 156 bytes for the mbuf header and prepend space. TCP can easily be modified to allocate an mbuf cluster when the prepend space is too large. IIRC UDP already get this right. The setsockopt route is suboptimal because it requires the application know about all the encapsulation steps that may happen to its packets. A backcall to update the prepend value is also very difficult because it has to be implemented for every protocol. Taking into consideration that prepending and encapsulation is limited to some reasonable amount (you don't want to have half of your MTU sized packet to be additional encapsulation headers) I'd say the global max_linkhdr is sufficient. -- Andre