Date: Wed, 30 Oct 2013 22:24:07 +0100 From: Andre Oppermann <andre@freebsd.org> To: Adrian Chadd <adrian@freebsd.org> Cc: "freebsd-net@freebsd.org" <net@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net> Subject: Re: MQ Patch. Message-ID: <527178F7.1070800@freebsd.org> In-Reply-To: <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <CA%2BhQ2%2BgTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com> <5270F101.6020701@freebsd.org> <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 30.10.2013 18:48, Adrian Chadd wrote: > On 30 October 2013 04:44, Andre Oppermann <andre@freebsd.org> wrote: > >>> We can't assume the hardware has deep queues _and_ we can't just hand >>> packets to the DMA engine. >> >>> >>> >>> Why? >>> >>> Because once you've pushed it into the transmit ring, you can't >>> guarantee / impose any ordering on things. You can't guarantee that >>> you can abort a frame that has been queued because it now breaks the >>> queue rules. >>> >>> That's why we don't want to just have a light wrapper around hardware >>> transmit queues. We give up way too much useful control. >> >> >> The stack can't possibly know about all these differences in current >> and future technologies and requirements. That's why this decision >> should be pushed into the L3/L2 mapping/encapsulation and driver layer. > > That's why you split it. > > You allow the upper layers (things like altq) to track things like > per-IP, per-traffic-class traffic and tag things appropriate. Any QoS scheme is split into two distinct steps: a) the classifier; b) the queuing and packet scheduler. The classification is totally taken out of ifnet/IFQ* and done a) through a packet filter, ipfw, pf, ipf; b) taken from the PCB if the packet is locally generated; c) on ingress packet from a vlan or IP header. The last for example is typically done in MPLS network where classification only happens at the edges and the way all brand name routers work, with the option of doing a) as well. The queuing and scheduling happens after L3/L2 mapping/encapsulation and before the packets are put onto the DMA ring. Please not that this is somewhat independent from additional pre-DMA queuing as in ieee80211 and comes before it. > You then let some software queue implement the queue discipline and > only drain frames to the hardware at a rate that's fast enough to keep > up with the hardware, and no faster. For a QoS queue/scheduler to be fully effective the DMA ring should be as small as reasonable to keep the interface busy, but not more. All queuing then happens in software with appropriately sized queues. > Why? > > Because if you have new traffic come along from a new client, it may > be higher priority than the traffic queued to the hardware. But it's > at the same QoS level as what's currently queued to the hardware, or > map to the same physical queue. When a packet has been handed to the DMA ring there's no stopping it anymore and the order is fixed. That's why in a QoS setup the DMA ring should be as small as it can be to barely keep the interface busy. Everything else happens in software and is subject to packet scheduler decisions. If a higher priority packet arrives before the next packet scheduler run it will be dequeued first (subject to WFQ or other fair scheduling disciplines to prevent total starvation). You may find this presentation I did some time back at SWINOG helpful: http://www.networx.ch/Understanding%20QoS%20by%20Andre%20Oppermann%20-%2020090402.pdf When QoS is active there can be only one active DMA ring per interface unless the hardware supports the necessary scheduling discipline among the DMA rings. Most multi DMA ring NICs employ a simple round-robin algorithm on a per-packet basis. With TSO these packets can be very large. Any such multi DMA ring setup would render any software QoS attempts futile. Hence only one DMA ring can be used/active with QoS. As far as I'm aware the only NIC that officially supports multi DMA rings including WFQ among them is the Intel ixgbe(4). Other 10G cards may support it but their datasheets are not public. > So yes, we do need that split for a lot of cases. There will be > bare-metal cases for highly low latency but if we implement the > correct queue API here it'll just collapse down to either NULL, or > just the existing software queue in front of the DMA rings to avoid > locking overhead. The L3/L2 mapping/encapsulation step may or may not need any locking depending on what it has to do. However its locking requirements may be totally different from the DMA ring protection. If there is no QoS enabled/active on an interface the packet after the L3/L2 step goes straight through to the driver. If there are multiple DMA rings the driver looks at the flowid field in the mbuf header and selects one of the DMA rings. These DMA rings naturally have to be protected by a (spin) lock to prevent concurrent access by multiple cores. Unless there is contention software queuing doesn't happen and the DMA rings are sufficiently deep. -- Andre
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?527178F7.1070800>