From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 23:35:36 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 46E73198
 for <net@freebsd.org>; Tue, 29 Oct 2013 23:35:36 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7875920E7
 for <net@freebsd.org>; Tue, 29 Oct 2013 23:35:35 +0000 (UTC)
Received: (qmail 58447 invoked from network); 30 Oct 2013 00:05:57 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <np@FreeBSD.org>; 30 Oct 2013 00:05:57 -0000
Message-ID: <5270462B.8050305@freebsd.org>
Date: Wed, 30 Oct 2013 00:35:07 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Navdeep Parhar <np@FreeBSD.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
In-Reply-To: <5270309E.5090403@FreeBSD.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 23:35:36 -0000

On 29.10.2013 23:03, Navdeep Parhar wrote:
> On 10/29/13 14:25, Andre Oppermann wrote:
>> On 29.10.2013 22:03, Navdeep Parhar wrote:
>>> On 10/29/13 13:41, Andre Oppermann wrote:
>>>> Let me jump in here and explain roughly the ideas/path I'm exploring
>>>> in creating and eventually implementing a big picture for drivers,
>>>> queues, queue management, various QoS and so on:
>>>>
>>>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>>>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>>>> have in tree aren't helpful at all.
>>>>
>>>> Steps:
>>>>
>>>> 1. take the soft-queuing method out of the ifnet layer and make it
>>>>      a property of the driver, so that the upper stack (or actually
>>>>      protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>>>      without any queuing at that point.  It then is up to the driver
>>>>      to decide how it multiplexes multi-core access to its queue(s)
>>>>      and how they are configured.
>>>
>>> It would work out much better if the kernel was aware of the number of
>>> tx queues of a multiq driver and explicitly selected one in if_transmit.
>>>    The driver has no information on the CPU affinity etc. of the
>>> applications generating the traffic; the kernel does.  In general, the
>>> kernel has a much better "global view" of the system and some of the
>>> stuff currently in the drivers really should move up into the stack.
>>
>> I've been thinking a lot about this and come to the preliminary conclusion
>> that the upper stack should not tell the driver which queue to use.  There
>> are way to many possible and depending on the use-case, better or worse
>> performing approaches.  Also we have a big problem with cores vs. queues
>> mismatches either way (more cores than queues or more queues than cores,
>> though the latter is much less of problem).
>>
>> For now I see these primary multi-hardware-queue approaches to be
>> implemented
>> first:
>>
>> a) the drivers (*if_transmit) takes the flowid from the mbuf header and
>>     selects one of the N hardware DMA rings based on it.  Each of the DMA
>>     rings is protected by a lock.  Here the assumption is that by having
>>     enough DMA rings the contention on each of them will be relatively low
>>     and ideally a flow and ring sort of sticks to a core that sends lots
>>     of packets into that flow.  Of course it is a statistical certainty that
>>     some bouncing will be going on.
>>
>> b) the driver assigns the DMA rings to particular cores which by that,
>> through
>>     a critnest++ can drive them lockless.  The drivers (*if_transmit)
>> will look
>>     up the core it got called on and push the traffic out on that DMA ring.
>>     The problem is the actual upper stacks affinity which is not guaranteed.
>>     This has to consequences: there may be reordering of packets of the same
>>     flow because the protocols send function happens to be called from a
>>     different core the second time.  Or the drivers (*if_transmit) has to
>>     switch to the right core to complete the transmit for this flow if the
>>     upper stack migrated/bounced around.  It is rather difficult to assure
>>     full affinity from userspace down through the upper stack and then to
>>     the driver.
>>
>> c) non-multi-queue capable hardware uses a kernel provided set of functions
>>     to manage the contention for the single resource of a DMA ring.
>>
>> The point here is that the driver is the right place to make these
>> decisions
>> because the upper stack lacks (and shouldn't care about) the actual
>> available
>> hardware and its capabilities.  All necessary information is available
>> to the
>> driver as well through the appropriate mbuf header fields and the core
>> it is
>> called on.
>>
>
> I mildly disagree with most of this, specifically with the part that the
> driver is the right place to make these decisions.  But you did say this
> was a "preliminary conclusion" so there's hope yet ;-)

I've mostly arrived at this conclusion as the least evil place to do it
because of the complexity that would otherwise hit the ifnet boundary.
Having to deal with simple one DMA ring only cards and high end cards
that support 64 times 8 QoS WFQ classes DMA rings in one place is messy
to properly abstract.  Also supporting API/ABI forward and backwards
compatibility would likely be nightmarish.

The driver isn't really making the decision, it is acting upon the mbuf
header information (flowid, qoscos) and using it together with its intimate
knowledge of the hardware capabilities to get a hopefully close to optimal
result.

The holy grail so to say would be to run the entire stack with full
affinity up and down.  That is certainly possible, provided the application
is fully aware of it as well.  In typical mixed load cases this is unlikely
the case and the application(s) are floating around.  A full affinity stack
then would have to switch to the right core when the kernel is entered.
This has its own drawbacks again.  However nothing in the new implementations
should prevent us from running the stack in full affinity mode.

> Let's wait till you have an early implementation and we are all able to
> experiment with it.  To be continued...

By all means feel free to bring up your own ideas and experiences from
other implementations as well, either in public or private.  I'm more
than happy to discuss and include other ideas.  In the end the cold hard
numbers and the suitability for a general purpose OS.  My goal is to be
good to very good in > 90% of all common use cases, while providing all
necessary knobs, and be it in the form of KLDs with a well defined API,
to push particular workloads to the full 99.9%.

-- 
Andre