From owner-freebsd-net@FreeBSD.ORG  Sun Jul 30 19:23:14 2006
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: net@freebsd.org
Delivered-To: freebsd-net@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3D6F516A4DA;
	Sun, 30 Jul 2006 19:23:14 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id D333043D45;
	Sun, 30 Jul 2006 19:23:13 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 783F646CCD;
	Sun, 30 Jul 2006 15:23:13 -0400 (EDT)
Date: Sun, 30 Jul 2006 20:23:13 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Sam Leffler <sam@errno.com>
In-Reply-To: <44CCFC2C.20402@errno.com>
Message-ID: <20060730200929.J16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, net@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 19:23:14 -0000

On Sun, 30 Jul 2006, Sam Leffler wrote:

> I have a fair amount of experience with the linux model and it works ok. 
> The main complication I've seen is when a driver needs to process multiple 
> queues of packets things get more involved.  This is seen in 802.11 drivers 
> where there are two q's, one for data frames and one for management frames. 
> With the current scheme you have two separate queues and the start method 
> handles prioritization by polling the mgt q before the data q.  If instead 
> the packet is passed to the start method then it needs to be tagged in some 
> way so the it's prioritized properly.  Otherwise you end up with multiple 
> start methods; one per type of packet.  I suspect this will be ok but the 
> end result will be that we'll need to add a priority field to mbufs (unless 
> we pass it as an arge to the start method).
>
> All this is certainly doable but I think just replacing one mechanism with 
> the other (as you specified) is insufficient.

Hmm.  This is something that I had overlooked.  I was loosely aware that the 
if_sl code made use of multiple queues, but was under the impression that the 
classification to queues occured purely in the SLIP code.  Indeed, it does, 
but structurally, SLIP is split over the link layer (if_output) and driver 
layer (if_start), which I had forgotten.  I take it from your comments that 
802.11 also does this, which I was not aware of.

I'm a little uncomfortable with our current m_tag model, as it requires 
significant numbers of additional allocations and frees for each packet, as 
well as walking link lists.  It's fine for occasional discretionary use (i.e., 
MAC labels), but I worry about cases where it is used with every packet, and 
we start seeing moderately non-zero numbers of tags on every packet.  I think 
I would be more comfortable with an explicit queue identifier argument to 
if_start, where the link layer and driver layer agree on how to identify 
queues.

As a straw man, how would the following strike you:

 	int	if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid);

where for most link layers, the value would be zero, but for some link 
layer/driver combinations, it would identify a specific queue which the link 
layer believes the mbuf should be assigned, if implemented?

>> Attached is a patch that maintains the current if_start, but adds 
>> if_startmbuf.  If a device driver implements if_startmbuf and the global 
>> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the 
>> driver will be used.  Otherwise, if_start is used.  I have modified the 
>> if_em driver to implement if_startmbuf also.  If there is no packet backlog 
>> in the if_snd queue, it directly places the packet in the transmit 
>> descriptor ring. If there is a backlog, it uses the if_snd queue protected 
>> by driver mutex, rather than a separate ifq mutex.
>>
>> In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte 
>> paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% performance 
>> improvement in the bulk serving of 1k files over HTTP. These are only 
>> micro-benchmarks, and reflect a configuration in which the CPU is unable to 
>> keep up with the output rate of the 1gbps ethernet card in the device, so 
>> reductions in host CPU usage are immediately visible in increased output as 
>> the CPU is able to better keep up with the network hardware.  Other 
>> configurations are also of interest of interesting, especially ones in 
>> which the network device is unable to keep up with the CPU, resulting in 
>> more queueing.
>>
>> Conceptual review as well as banchmarking, etc, would be most welcome.
>
> Why is the startmbuf knob global and not per-interface?  Seems like you want 
> to convert drivers one at a time?

I may have under-described what I have implemented.  The decision is currently 
made based on two factors: a global frob, and per-interface definition of 
if_startmbuf being non-zero.  The global frob is intended to make it easy to 
benchmark the difference.  I should modify the patch so that the global frob 
doesn't override the driver back to if_start in the event that if_startmbuf is 
defined and if_start isn't.  The global frob is intended to be removed in the 
long run, and I intend for us to continue to support both the old and new 
start methods for the forseeable future, since I don't intend to update every 
device driver we have to the new method, at least not personally :-).

> FWIW the original model was driven by the expectation that you could raise 
> the spl so the tx path was entirely synchronized from above.  With the SMPng 
> work we're synchronizing transfer through each control layer.  If the driver 
> softc lock (or similar) were exposed to upper layers we could possibly 
> return the "lock the tx path" model we had before and eliminate all the 
> locking your changes target.  But that would be a big layering violation and 
> would add significant contention in the SMP case.

In some ways, what I propose comes to much the same thing: the change I 
propose basically delegates the queueing and synchronization decisions to the 
device driver, which might choose either to use the lock already in the ifq, 
to use its own lock, or to use some other synchronization strategy.  In the 
case of if_em, I've implemented bypass of software queueing entirely in the 
common case, but in the event that the hardware ring backs up, then we still 
fall back to the if_snd queue, only we lock it using the device driver's 
transmit path mutex.  Delegating the synchronization down the stack comes with 
risks, as device driver writers will inevitably take liberties: on the other 
hand, it appears that devices are quite diverse, and those liberties have 
advantages.

> I think the key observation is that most network hardware today takes 
> packets directly from private queues so the fast path needs to push things 
> down to those queues w/ minimal overhead.  This includes devices that 
> implement QoS in h/w w/ multiple queues.

Yes -- however, you're right that the link layer needs to be able to pass more 
information down.  I'd like it to be able to do so without an m_tag 
allocation, though, which suggests (as you point out) an explicit argument to 
if_startmbuf.

Thanks,

Robert N M Watson
Computer Laboratory
University of Cambridge