From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 20:42:10 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 213012F4
 for <net@freebsd.org>; Tue, 29 Oct 2013 20:42:10 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3DC992533
 for <net@freebsd.org>; Tue, 29 Oct 2013 20:42:08 +0000 (UTC)
Received: (qmail 57695 invoked from network); 29 Oct 2013 21:12:39 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rizzo@iet.unipi.it>; 29 Oct 2013 21:12:39 -0000
Message-ID: <52701D8B.8050907@freebsd.org>
Date: Tue, 29 Oct 2013 21:41:47 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
In-Reply-To: <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 20:42:10 -0000

Let me jump in here and explain roughly the ideas/path I'm exploring
in creating and eventually implementing a big picture for drivers,
queues, queue management, various QoS and so on:

Situation: We're still mostly based on the old 4.4BSD IFQ model with
a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
have in tree aren't helpful at all.

Steps:

1. take the soft-queuing method out of the ifnet layer and make it
    a property of the driver, so that the upper stack (or actually
    protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
    without any queuing at that point.  It then is up to the driver
    to decide how it multiplexes multi-core access to its queue(s)
    and how they are configured.  Some hardware supports multiple
    queues and some even support WFQ models among these queues in
    hardware.  In that case any soft-queue layer would be omitted.
    For the other cases the kernel will provide one or two proven
    and optimized soft-queue and multi-writer access implementations
    to be used by the drivers.  Drivers should avoid having their
    own soft-queue implementations but they can if they really want
    to.

2. make flowid's (or hashes) an integral part of the network stack.
    The mbuf header fully supports it.  If the hardware provides a
    flowid (toeplitz for example) use it, otherwise compute a hash
    a bit up the stack for incoming packets.  Outgoing packets get
    their hash based on the inpcb or whatever.  In- and outbound
    directions are totally separate and don't have to use the same
    hash, it only has to be constant with a flow.  In theory it
    could be randomly chosen at flow setup time (eg. tcp connect).
    This way the load can be distributed among multiple hw queues
    or interfaces in the case of lagg(4) with a single mbuf header
    lookup.  When we can make sure that every packet has a flowid
    many things become possible and even easy.  Again drivers should
    not invent their own software implementations and rely on the
    kernel to provide it.

3. make QoS/CoS an integral part of the network stack.  The first
    step is done with the qoscos field in the mbuf header.  It is
    eight bits wide and its use/semantics haven't been fully
    established yet.  However the idea is to have a classifier tag
    the packet when it enters the network stack, either by coming
    in on an interface or by being generated within the stack.
    The qoscos tag can be taken from layer2 information (vlan header)
    or chosen based on more complex rules through a packet filter
    such as ipfw, pf or ipf.  There won't be any separate classifier
    as in ALTQ anymore.  This is also the path OpenBSD has taken.
    Depending on the ingress/egress encapsulation the range of
    qos/cos information may be more limited than the 8 bits we have
    in the mbuf header.  In that case the larger range has to be
    mapped into the smaller range by putting neighboring bins together.
    This is how it is done in all routers and routing switches by
    various vendors.  The administrator decides how the mapping is
    done and where it is taken from.

4. adjust the stack and drivers to do all of the above and to
    optimally make use of the hardware capabilities.  If a hardware
    supports multi-queue and SP/WFQ at once (ie. ixgbe(4)) then there
    is no need for any soft-queuing.  Otherwise the various queuing
    and queue management disciplines will hook into (*if_transmit)
    and do their magic before the packet reaches the DMA ring.  To
    reach this level a bit of infrastructure work has to be done
    first, for example the DMA ring depth needs to be adjustable
    through a generic mechanism for all drivers, and the new-ALTQ
    should be able to hook into the drivers TX completion interrupt
    to clock out the packets.

This should give a rough outline of the path(s) to be explored in
the next weeks.

-- 
Andre


On 29.10.2013 20:58, Luigi Rizzo wrote:
> my short, top-post comment is that I'd rather see some more
> coordination with Andre, and especially some high level README
> or other form of documentation explaining the architecture
> you have in mind before this goes in.
>
> To expand my point of view (and please do not read me as negative,
> i am trying to be constructive and avoid future troubles and
> volunteer to help with the design and implementation):
>
> (i'll omit issues re. style and unrelated patches in the diff
> because they are premature)
>
> 1. Having multiple separate software queues attached to a physical queue
> makes sense only if we have a clear and documented plan
> for scheduling traffic from these queues into the hw one.
> Otherwise it ends up being just another confusing hack
> that makes it difficult to reason about device drivers.
>
> We already have something similar now (with the drbr queue on top
> used in some cases when the hw ring overflows), the ALTQ hooks,
> and without documentation this does not seem to improve the
> current situation.
>
> 2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI,
> but a coherent framework where you can classify/partition traffic
> into separate queues, apply one of several queue management
> (taildrop/RED/CODEL/whatever) and scheduling (which queue to serve next)
> policies in an efficient way.
>
> Linux mostly gets this right (they even support hierarchical schedulers).
>
> Dummynet has a reasonable architecture although not hierarchical
> and it operates at the IP level (or possibly at layer 2),
> which is probably too high (but not necessarily).
> We can also recycle the components, i.e. the classifier in ipfw
> and the scheduling algorithms. I am happy to help on this.
>
> ALTQ is too old and complex and inefficient and unmaintained to be considered.
>
> And i cannot comment on your code because you don't really explain
> what you want to do and how. Codel/PI are only queue management,
> not qos; and strict priority is just one (and probably the worse) policy
> one can have.
>
> One comment i can make, however, on the fact that 256 queues are
> way too few for a proper system. You need the number to be
> dynamic and much larger (e.g. using flowid as a key).
>
> So, to conclude: i fully support any plan to design something that lets us
> implement scheduling (and qos, if you want to call it this way)
> in a reasonable way, but what is in your patch now does not really
> seem to improve the current situation in any way.
>
> cheers
> luigi
>
>
>
> On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann <andre@freebsd.org <mailto:andre@freebsd.org>> wrote:
>
>     On 29.10.2013 11:50, Randall Stewart wrote:
>
>         Hi:
>
>         As discussed at vBSDcon with andre/emaste and gnn, I am sending
>         this patch out to all of you ;-)
>
>
>     I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>
>
>         I have previously sent it to gnn, andre, jhb, rwatson, and several other
>         of the usual suspects (as gnn put it) and received dead silence.
>
>
>     Sorry 'bout that.  Too many things going on recently.
>
>
>         What does this patch do?
>
>         Well it add the ability to do multi-queue at the driver level. Basically
>         any driver that uses the new interface gets under it N queues (default
>         is 8) for each physical transmit ring it has. The driver picks up
>         its queue 0 first, then queue 1 .. up to the max.
>
>
>     To make I understand this correctly there are 8 soft-queues for each real
>     transmit ring, correct?  And the driver will dequeue the lowest numbered
>     queue for as long as there are packets in it.  Like a hierarchical strict
>     queuing discipline.
>
>     This is prone to head of line blocking and starvation by higher priority
>     queues.  May become a big problem under adverse traffic patterns.
>
>
>         This allows you to prioritize packets. Also in here is the start of some
>         work I will be doing for AQM.. think either Pi or Codel ;-)
>
>         Right now thats pretty simple and just (in a few drivers) as the ability
>         to limit the amount of data on the ring… which can help reduce buffer
>         bloat. That needs to be refined into a lot more.
>
>
>     We actually have two queues, the soft-queue and the hardware ring which
>     both can be rather large leading to various issues as you mention.
>
>     I've started work on an FF contract to rethink the whole IFQ* model and
>     to propose and benchmark different approaches.  After that to convert all
>     drivers in the tree to the chosen model(s) and get rid of the legacy.  In
>     general the choice of model will be done in the driver and no longer by
>     the ifnet layer.  One or (most likely) more optimized models will be
>     provided by the kernel for drivers to chose from.  The idea that most,
>     if not all drivers use these standard kernel provided models to avoid
>     code duplication.  However as the pace of new features is quite high
>     we provide the full discretion for the driver to choose and experiment
>     with their own ways of dealing with it.  This is under the assumption
>     that once a now model has been found it is later moved to the kernel
>     side and subsequently used by other drivers as well.
>
>
>         This work is donated by Adara Networks and has been discussed in several
>         of the past vendor summits.
>
>         I plan on committing this before the IETF unless I hear major objections.
>
>
>     There seems to be a couple of white space issues where first there is a tab
>     and then actual whitespace for the second one and others all over the place.
>
>     There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
>     sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c,
>     usr.sbin/ofwdump/ofwdump.c.
>
>     It would be good to separate out the soft multi-queue changes from the ring
>     depth changes and do each in at least one commit.
>
>     There are two separate changes to sys/dev/oce/, one is renaming of the lock
>     macros and the other the change to drbr.
>
>     The changes to sys/kern/subr_bufring.c are not style compliant and we normally
>     don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* should
>     be used instead.
>
>     Why would we need a multi-consumer dequeue?
>
>     The new bufring functions on a first glance do seem to be safe on architectures
>     with a more relaxed memory ordering / cache coherency model than x86.
>
>     The atomic dance in a number of drbr_* functions doesn't seem to make much sense
>     and a single spin-lock may result in atomic operations and bus lock cycles.
>
>     There is a huge amount of includes pollution in sys/net/drbr.h which we are
>     currently trying to get rid of and to avoid for the future.
>
>
>     I like the general conceptual approach but the implementation feels bumpy and
>     not (yet) ready for prime time.  In any case I'd like to take forward conceptual
>     parts for the FF sponsored IFQ* rework.
>
>     --
>     Andre
>
>
>     _________________________________________________
>     freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org> mailing list
>     http://lists.freebsd.org/__mailman/listinfo/freebsd-net
>     <http://lists.freebsd.org/mailman/listinfo/freebsd-net>
>     To unsubscribe, send any mail to "freebsd-net-unsubscribe@__freebsd.org
>     <mailto:freebsd-net-unsubscribe@freebsd.org>"
>
>
>
>
> --
> -----------------------------------------+-------------------------------
>   Prof. Luigi RIZZO, rizzo@iet.unipi.it <mailto:rizzo@iet.unipi.it>  . Dip. di Ing. dell'Informazione
> http://www.iet.unipi.it/~luigi/ <http://www.iet.unipi.it/%7Eluigi/>        . Universita` di Pisa
>   TEL      +39-050-2211611               . via Diotisalvi 2
>   Mobile   +39-338-6809875               . 56122 PISA (Italy)
> -----------------------------------------+-------------------------------