From owner-freebsd-net@FreeBSD.ORG  Thu Oct 26 05:15:49 2006
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: freebsd-net@FreeBSD.org
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1DD6116A407
	for <freebsd-net@FreeBSD.org>; Thu, 26 Oct 2006 05:15:49 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 901E743D4C
	for <freebsd-net@FreeBSD.org>; Thu, 26 Oct 2006 05:15:48 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id A734861FEDA;
	Thu, 26 Oct 2006 15:15:42 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 171468C30;
	Thu, 26 Oct 2006 15:15:40 +1000 (EST)
Date: Thu, 26 Oct 2006 15:15:35 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Doug Ambrisko <ambrisko@ambrisko.com>
In-Reply-To: <200610251818.k9PIIe7p062530@ambrisko.com>
Message-ID: <20061026134024.Y6316@delplex.bde.org>
References: <200610251818.k9PIIe7p062530@ambrisko.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-net <freebsd-net@FreeBSD.org>, Scott Long <scottl@samsco.org>,
	John Polstra <jdp@polstra.com>
Subject: Re: em network issues
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Oct 2006 05:15:49 -0000

On Wed, 25 Oct 2006, Doug Ambrisko wrote:
> John Polstra writes:
> | On 19-Oct-2006 Scott Long wrote:
> | > The performance measurements that Andre and I did early this year showed
> | > that the INTR_FAST handler provided a very large benefit.
> |
> | I'm trying to understand why that's the case.  Is it because an
> | INTR_FAST interrupt doesn't have to be masked and unmasked in the
> | APIC?  I can't see any other reason for much of a performance
> | difference in that driver.  With or without INTR_FAST, you've got
> | the bulk of the work being done in a background thread -- either the
> | ithread or the taskqueue thread.  It's not clear to me that it's any
> | cheaper to run a task than it is to run an ithread.
> | ...

The answer was given indirectly in a previous reply: its cost is
insignifcantly different, but it gives more parallelism, so in some
configurations (mainly ones with more CPU than I/O) it gives more I/O
bandwidth.  ("it" == just INTR_FAST vs normal interrupts with an
identical interrupt handler.)

> Something that we've fixed locally in atleast one version is:
>     1)	Limit the loop in em_intr to 3 iterations
>     2)	Pass a valid value to em_process_receive_interrupts/em_rxeof
> 	a good value like 100 instead of -1.  Since this is the count
> 	for how many time to iterate over the rx stuff.  Seems this
> 	got lost in the some change of APIs.
>     3)	In em_process_receive_interrupts/em_rxeof always decrement
> 	the count on every run through the loop.  If you notice
> 	count is an is an int that starts at the passed in value
> 	of -1.  It then count-- until count==0.  Doing -1, -2, -3
> 	takes awhile until the int rolls over to 0.   Passing 100
> 	limits it more :-)  So this can run 3 * 100 versuses
> 	infinite * int roll over assuming we don't skip a count--.
> Doing these changes made our multiple em based machines a lot happier
> when slammed with traffic without starving other things that shared
> interrupts like other em cards (especially in 4.X).

This basically works in the opposite way, by giving less parallelism,
so in some configurations (mainly ones with less CPU than I/O) it gives
less I/O bandwidth and thus frees some CPU for doing something other
than I/O.

I don't understand why simple limits on the counts make much difference.
Returning prematurely from the interrupt handler is normally a bad
idea, since unless you have disabled interrupts then you get another
interrupt which returns you to the interrupt handler immediately, and
the only effect of returning immediately is extra cost for the return
and reentry.  The hardware should refrain from generating a new interrupt
if you handle enough rx+tx descriptors, but if you can't keep up then
this won't help for more than one burst in bursty traffic, and with
bursty traffic that can be kept up with on average there will be enough
CPU for other things eventually.

> Interrupt handler
> should have limits of how long they should be able to run then let
> someone else go.

SWI threads or tasks plus scheduling of tasks should give this
automatically in -current.  I think this is the right way to go once
you accept the large overhead of context switches to get to ithreads
(I don't accept this :-).  However, ithreads have fixed priorities,
and em's interrupt task has the highest priority that is actually used,
so the em task is rarely if ever preempted in -current.

> We use this in 6.X as well and haven't had any problems
> with our config's that use this.  We haven't tested much without these
> since we need to fix other issues and this is now a non-issue for us.
> ...

6.X has a fast interrupt handler for em so it is quite different from 6.1.

> I haven't pushed this more since I first found issue 1 and the concept
> was rejected ... my machine hung in the interrupt spin loop :-(

~5.2 has hard-coded 3 calls to the rx+tx interrupt sub-handlers, but this
seems to be bogus.  em_intr() in -current (for the DEVICE_POLLING configured
but not enabled case) cdoes the natural check of a status register to
determine whether it should keep calling the sub-handlers, and has no limit
on the number of calls except for the status register.  This should only hang
if there is always more em-I/O than CPU, or the status register is broken.
Without DEVICE_POLLING, the reasons for the hang would be slightly
different.

Bruce