From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 21:25:56 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 897865A0
 for <net@freebsd.org>; Tue, 29 Oct 2013 21:25:56 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id EAD65291A
 for <net@freebsd.org>; Tue, 29 Oct 2013 21:25:55 +0000 (UTC)
Received: (qmail 57950 invoked from network); 29 Oct 2013 21:56:25 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <np@FreeBSD.org>; 29 Oct 2013 21:56:25 -0000
Message-ID: <527027CE.5040806@freebsd.org>
Date: Tue, 29 Oct 2013 22:25:34 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Navdeep Parhar <np@FreeBSD.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
In-Reply-To: <527022AC.4030502@FreeBSD.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 21:25:56 -0000

On 29.10.2013 22:03, Navdeep Parhar wrote:
> On 10/29/13 13:41, Andre Oppermann wrote:
>> Let me jump in here and explain roughly the ideas/path I'm exploring
>> in creating and eventually implementing a big picture for drivers,
>> queues, queue management, various QoS and so on:
>>
>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>> have in tree aren't helpful at all.
>>
>> Steps:
>>
>> 1. take the soft-queuing method out of the ifnet layer and make it
>>     a property of the driver, so that the upper stack (or actually
>>     protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>     without any queuing at that point.  It then is up to the driver
>>     to decide how it multiplexes multi-core access to its queue(s)
>>     and how they are configured.
>
> It would work out much better if the kernel was aware of the number of
> tx queues of a multiq driver and explicitly selected one in if_transmit.
>   The driver has no information on the CPU affinity etc. of the
> applications generating the traffic; the kernel does.  In general, the
> kernel has a much better "global view" of the system and some of the
> stuff currently in the drivers really should move up into the stack.

I've been thinking a lot about this and come to the preliminary conclusion
that the upper stack should not tell the driver which queue to use.  There
are way to many possible and depending on the use-case, better or worse
performing approaches.  Also we have a big problem with cores vs. queues
mismatches either way (more cores than queues or more queues than cores,
though the latter is much less of problem).

For now I see these primary multi-hardware-queue approaches to be implemented
first:

a) the drivers (*if_transmit) takes the flowid from the mbuf header and
    selects one of the N hardware DMA rings based on it.  Each of the DMA
    rings is protected by a lock.  Here the assumption is that by having
    enough DMA rings the contention on each of them will be relatively low
    and ideally a flow and ring sort of sticks to a core that sends lots
    of packets into that flow.  Of course it is a statistical certainty that
    some bouncing will be going on.

b) the driver assigns the DMA rings to particular cores which by that, through
    a critnest++ can drive them lockless.  The drivers (*if_transmit) will look
    up the core it got called on and push the traffic out on that DMA ring.
    The problem is the actual upper stacks affinity which is not guaranteed.
    This has to consequences: there may be reordering of packets of the same
    flow because the protocols send function happens to be called from a
    different core the second time.  Or the drivers (*if_transmit) has to
    switch to the right core to complete the transmit for this flow if the
    upper stack migrated/bounced around.  It is rather difficult to assure
    full affinity from userspace down through the upper stack and then to
    the driver.

c) non-multi-queue capable hardware uses a kernel provided set of functions
    to manage the contention for the single resource of a DMA ring.

The point here is that the driver is the right place to make these decisions
because the upper stack lacks (and shouldn't care about) the actual available
hardware and its capabilities.  All necessary information is available to the
driver as well through the appropriate mbuf header fields and the core it is
called on.

-- 
Andre