From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 21:25:56 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 897865A0 for ; Tue, 29 Oct 2013 21:25:56 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EAD65291A for ; Tue, 29 Oct 2013 21:25:55 +0000 (UTC) Received: (qmail 57950 invoked from network); 29 Oct 2013 21:56:25 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 21:56:25 -0000 Message-ID: <527027CE.5040806@freebsd.org> Date: Tue, 29 Oct 2013 22:25:34 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Navdeep Parhar , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> In-Reply-To: <527022AC.4030502@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 21:25:56 -0000 On 29.10.2013 22:03, Navdeep Parhar wrote: > On 10/29/13 13:41, Andre Oppermann wrote: >> Let me jump in here and explain roughly the ideas/path I'm exploring >> in creating and eventually implementing a big picture for drivers, >> queues, queue management, various QoS and so on: >> >> Situation: We're still mostly based on the old 4.4BSD IFQ model with >> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we >> have in tree aren't helpful at all. >> >> Steps: >> >> 1. take the soft-queuing method out of the ifnet layer and make it >> a property of the driver, so that the upper stack (or actually >> protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) >> without any queuing at that point. It then is up to the driver >> to decide how it multiplexes multi-core access to its queue(s) >> and how they are configured. > > It would work out much better if the kernel was aware of the number of > tx queues of a multiq driver and explicitly selected one in if_transmit. > The driver has no information on the CPU affinity etc. of the > applications generating the traffic; the kernel does. In general, the > kernel has a much better "global view" of the system and some of the > stuff currently in the drivers really should move up into the stack. I've been thinking a lot about this and come to the preliminary conclusion that the upper stack should not tell the driver which queue to use. There are way to many possible and depending on the use-case, better or worse performing approaches. Also we have a big problem with cores vs. queues mismatches either way (more cores than queues or more queues than cores, though the latter is much less of problem). For now I see these primary multi-hardware-queue approaches to be implemented first: a) the drivers (*if_transmit) takes the flowid from the mbuf header and selects one of the N hardware DMA rings based on it. Each of the DMA rings is protected by a lock. Here the assumption is that by having enough DMA rings the contention on each of them will be relatively low and ideally a flow and ring sort of sticks to a core that sends lots of packets into that flow. Of course it is a statistical certainty that some bouncing will be going on. b) the driver assigns the DMA rings to particular cores which by that, through a critnest++ can drive them lockless. The drivers (*if_transmit) will look up the core it got called on and push the traffic out on that DMA ring. The problem is the actual upper stacks affinity which is not guaranteed. This has to consequences: there may be reordering of packets of the same flow because the protocols send function happens to be called from a different core the second time. Or the drivers (*if_transmit) has to switch to the right core to complete the transmit for this flow if the upper stack migrated/bounced around. It is rather difficult to assure full affinity from userspace down through the upper stack and then to the driver. c) non-multi-queue capable hardware uses a kernel provided set of functions to manage the contention for the single resource of a DMA ring. The point here is that the driver is the right place to make these decisions because the upper stack lacks (and shouldn't care about) the actual available hardware and its capabilities. All necessary information is available to the driver as well through the appropriate mbuf header fields and the core it is called on. -- Andre