Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 30 Oct 2013 22:24:07 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        Adrian Chadd <adrian@freebsd.org>
Cc:        "freebsd-net@freebsd.org" <net@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>
Subject:   Re: MQ Patch.
Message-ID:  <527178F7.1070800@freebsd.org>
In-Reply-To: <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com>
References:  <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>	<CA%2BhQ2%2BgTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>	<52701D8B.8050907@freebsd.org>	<527022AC.4030502@FreeBSD.org>	<527027CE.5040806@freebsd.org>	<5270309E.5090403@FreeBSD.org>	<5270462B.8050305@freebsd.org>	<CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>	<5270F101.6020701@freebsd.org> <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 30.10.2013 18:48, Adrian Chadd wrote:
> On 30 October 2013 04:44, Andre Oppermann <andre@freebsd.org> wrote:
>
>>> We can't assume the hardware has deep queues _and_ we can't just hand
>>> packets to the DMA engine.
>>
>>>
>>>
>>> Why?
>>>
>>> Because once you've pushed it into the transmit ring, you can't
>>> guarantee / impose any ordering on things. You can't guarantee that
>>> you can abort a frame that has been queued because it now breaks the
>>> queue rules.
>>>
>>> That's why we don't want to just have a light wrapper around hardware
>>> transmit queues. We give up way too much useful control.
>>
>>
>> The stack can't possibly know about all these differences in current
>> and future technologies and requirements.  That's why this decision
>> should be pushed into the L3/L2 mapping/encapsulation and driver layer.
>
> That's why you split it.
>
> You allow the upper layers (things like altq) to track things like
> per-IP, per-traffic-class traffic and tag things appropriate.

Any QoS scheme is split into two distinct steps: a) the classifier;
b) the queuing and packet scheduler.

The classification is totally taken out of ifnet/IFQ* and done a) through
a packet filter, ipfw, pf, ipf; b) taken from the PCB if the packet is
locally generated; c) on ingress packet from a vlan or IP header.  The
last for example is typically done in MPLS network where classification
only happens at the edges and the way all brand name routers work, with
the option of doing a) as well.

The queuing and scheduling happens after L3/L2 mapping/encapsulation and
before the packets are put onto the DMA ring.  Please not that this is
somewhat independent from additional pre-DMA queuing as in ieee80211 and
comes before it.

> You then let some software queue implement the queue discipline and
> only drain frames to the hardware at a rate that's fast enough to keep
> up with the hardware, and no faster.

For a QoS queue/scheduler to be fully effective the DMA ring should be
as small as reasonable to keep the interface busy, but not more.  All
queuing then happens in software with appropriately sized queues.

> Why?
>
> Because if you have new traffic come along from a new client, it may
> be higher priority than the traffic queued to the hardware. But it's
> at the same QoS level as what's currently queued to the hardware, or
> map to the same physical queue.

When a packet has been handed to the DMA ring there's no stopping it
anymore and the order is fixed.  That's why in a QoS setup the DMA
ring should be as small as it can be to barely keep the interface
busy.  Everything else happens in software and is subject to packet
scheduler decisions.  If a higher priority packet arrives before the
next packet scheduler run it will be dequeued first (subject to WFQ
or other fair scheduling disciplines to prevent total starvation).

You may find this presentation I did some time back at SWINOG helpful:
http://www.networx.ch/Understanding%20QoS%20by%20Andre%20Oppermann%20-%2020090402.pdf

When QoS is active there can be only one active DMA ring per interface
unless the hardware supports the necessary scheduling discipline among
the DMA rings.  Most multi DMA ring NICs employ a simple round-robin
algorithm on a per-packet basis.  With TSO these packets can be very
large.  Any such multi DMA ring setup would render any software QoS
attempts futile.  Hence only one DMA ring can be used/active with QoS.

As far as I'm aware the only NIC that officially supports multi DMA
rings including WFQ among them is the Intel ixgbe(4).  Other 10G cards
may support it but their datasheets are not public.

> So yes, we do need that split for a lot of cases. There will be
> bare-metal cases for highly low latency but if we implement the
> correct queue API here it'll just collapse down to either NULL, or
> just the existing software queue in front of the DMA rings to avoid
> locking overhead.

The L3/L2 mapping/encapsulation step may or may not need any locking
depending on what it has to do.  However its locking requirements may
be totally different from the DMA ring protection.

If there is no QoS enabled/active on an interface the packet after the
L3/L2 step goes straight through to the driver.  If there are multiple
DMA rings the driver looks at the flowid field in the mbuf header and
selects one of the DMA rings.  These DMA rings naturally have to be
protected by a (spin) lock to prevent concurrent access by multiple
cores.  Unless there is contention software queuing doesn't happen and
the DMA rings are sufficiently deep.

-- 
Andre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?527178F7.1070800>