Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 30 Oct 2013 06:00:56 +0100
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>, Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>, "freebsd-net@freebsd.org" <net@freebsd.org>
Subject:   [long] Network stack -> NIC flow (was Re: MQ Patch.)
Message-ID:  <20131030050056.GA84368@onelab2.iet.unipi.it>
In-Reply-To: <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
References:  <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <CA%2BhQ2%2BgTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote:
> Hi,
> 
> We can't assume the hardware has deep queues _and_ we can't just hand
> packets to the DMA engine.
> [Adrian explains why]

i have the feeling that the variuos folks who stepped into this
discussion seem to have completely different (and orthogonal) goals
and as such these goals should be discussed separately.

Below is the architecture i have in mind and how i would implement it
(and it would be extremely simple since we have most of the pieces
in place).

It would be useful if people could discuss what problem they are
addressing before coming up with patches.

---

The architecture i think we should pursue is this (which happens to be
what linux implements, and also what dummynet implements, except
that the output is to a dummynet pipe or to ether_output() or to
ip_output() depending on the configuration):

   1. multiple (one per core) concurrent transmitters t_c

	which use ether_output_frame() to send to

   2. multiple disjoint queues q_j
	(one per traffic group, can be *a lot*, say 10^6)

	which are scheduled with a scheduler S
        (iterate step 2 for hierarchical schedulers)
	and

   3. eventually feed ONE transmit ring R_j on the NIC.
	Once a packet reaches R_j, for all practical purpose
	is on the wire. We cannot intercept extractions,
	we cannot interfere with the scheduler in the NIC in
	case of multiqueue NICs. The most we can do (and should,
	as in Linux) is notify the owner of the packet once its
	transmission is complete.

Just to set the terminology:
QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT
	or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES .
	This is what implements DROPTAIL (also improperly called FIFO),
	RED, CODEL. Note that for CODEL you need to intercept extractions
	from the queue, whereas DROPTAIL and RED only act on
	insertions.

SCHEDULER is the entity which decides which queue to serve among
	the many possible ones. It is called on INSERTIONS and
	EXTRACTIONS from a queue, and passes packets to the NIC's queue.

The decision on which queue and ring (Q_i and R_j) to use should be made
by a classifier at the beginning of step 2 (or once per iteration,
if using a hierarchical scheduler). Of course they can be precomputed
(e.g. with annotations in the mbuf coming from the socket).

Now when it comes to implementing the above, we have three
cases (or different optimization levels, if you like)

-- 1. THE SIMPLE CASE ---

In the simplest possible case we have can let the NIC do everything.
Necessary conditions are:
- queue management policies acting only on insertions
  (e.g. DROPTAIL or RED or similar);
- # of traffic classes <= # number of NIC rings
- scheduling policy S equal to the one implemented in the NIC
  (trivial case: one queue, one ring, no scheduler)

All these cases match exactly what the hardware provides, so we can just
use the NIC ring(s) without extra queue(s), and possibly use something
like buf_ring to manage insertions (but note that insertions in
an empty queue will end up requiring a lock; and i think the
same happens even now with the extra drbr queue in front of the ring).


-- 2. THE INTERMEDIATE CASE ---

If we do not care about a scheduler but want a more complex QUEUE
MANAGEMENT, such as CODEL, that acts on extractions, we _must_
implement an intermediate queue Q_i before the NIC ring.  This is
our only chance to act on extractions from the queue (which CODEL
requires).  Note that we DO NOT NEED to create multiple queues for
each ring.

-- 3. THE COMPLETE CASE ---

This is when the scheduler we want (DRR, WFQ variants, PRIORITY...)
is not implemented in the NIC, or we have more queues than those
available in the NIC. In this case we need to invoke this extra
block before passing packets to the NIC.

Remember that dummynet implements exactly #3, and it comes with a
set of pretty efficient schedulers (i have made extensive measurements
on them, see links to papers on my research page
http://info.iet.unipi.it/~luigi/research.html ).
They are by no means a performance bottleneck (scheduling takes
50..200ns depending on the circumstances) in the cases where
it matters to have a scheduler (which is, when the sender is
faster than the NIC, which in turn only happens with large packets
which take 1..30us to get through at the very least..

--- IMPLEMENTATION ---

Apart from ALTQ (which is very slow and has inefficient schedulers
and i don't think anybody wants to maintain), and with the exception
of dummynet which I'll discuss later, at the moment FreeBSD do not
support schedulers in the tx path of the device driver.

So we can only deal with cases 1 and 2, and for them the software
queue + ring suffices to implement any QUEUE MANAGEMENT policy
(but we don't implement anything).

If we want support the generic case (#3), we should do the following:

1. device drivers export a function to transmit on an individual ring,
  basically the current if_transmit(), and a hook to play with the
  corresponding queue lock (the scheduler needs to run under lock,
  and we can as well use the ring lock for that).
  Note that the ether_output_frame does not always need to
  call the scheduler: if a packet enters a non-empty queue, we are done.
  
2. device drivers also export the number of tx queues, and
  some (advisory) information on queue status

3. ether_output_frame() runs the classifier (if needed), invokes
  the scheduler (if needed) and possibly falls through into if_transmit()
  for the specific ring.

4. on transmit completions (*_txeof(), typically), a callback invokes
  the scheduler to feed the NIC ring with more packets

I mentioned dummynet: it already implements ALL of this,
including the completion callback in #4. There is a hook
in ether_output_frame(), and the hook was called (up to 8.0
i believe) if_tx_rdy(). You can see wat it does in
RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy()

http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup

if_tx_rdy() does not exist anymore because almost nobody used it,
but it is trivial to reimplement, and can be called by device drivers
when *_txeof() finds that is running low on packets _and_ the
specific NIC needs to implement the "complete" scheduling.

The way it worked in dummynet (I think i used it in on 'tun' and 'ed')
is also documented in the manpage:
define a pipe whose bandwidth is set as a the device name instead
of a number. Then you can attach a scheduler to the pipe, queues
to the scheduler, and you are done.  Example:

    // this is the scheduler's configuration
	ipfw pipe 10 config bw 'em2' sched 
	ipfw sched 10 config type drr // deficit round robin
	ipfw queue 1 config weight 30 sched 10 // important
	ipfw queue 2 config weight 5 sched 10 // less important
	ipfw queue 3 config weight 1 sched 10 // who cares...

    // and this is the classifier, which you can skip if the
    // packets are already pre-classified.
    // The infrastructure is already there to implement per-interface
    // configurations.
	ipfw add queue 1 src-port 53
	ipfw add queue 2 src-port 22
	ipfw add queue 2 ip from any to any

Now, surely we can replace the implementation of packet queues in dummynet
from the current TAILQ to something resembling buf_ring to improve
write parallelism; and a bit of glue code is needed to attach
per-interface ipfw instances to each interface, and some smarts in
the configuration commands is needed to figure out when we can
bypass everything or not.

But this seems to me a much more viable approach to achieve proper QoS
support in our architecture.

cheers
luigi

cheers
luigi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131030050056.GA84368>