From owner-freebsd-hackers  Wed Nov 13 16:57: 7 2002
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0479337B401
	for <freebsd-hackers@freebsd.org>; Wed, 13 Nov 2002 16:57:03 -0800 (PST)
Received: from conure.mail.pas.earthlink.net (conure.mail.pas.earthlink.net [207.217.120.54])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6699643E3B
	for <freebsd-hackers@freebsd.org>; Wed, 13 Nov 2002 16:57:02 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0374.cvx22-bradley.dialup.earthlink.net ([209.179.199.119] helo=mindspring.com)
	by conure.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 18C8EL-0002e5-00; Wed, 13 Nov 2002 16:51:38 -0800
Message-ID: <3DD2F33E.BE136568@mindspring.com>
Date: Wed, 13 Nov 2002 16:50:06 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: David Gilbert <dgilbert@velocet.ca>
Cc: dolemite@wuli.nu, freebsd-hackers@freebsd.org
Subject: Re: [hackers] Re: Netgraph could be a router also.
References: <20021109180321.GA559@unknown.nycap.rr.com>
		<3DCD8761.5763AAB2@mindspring.com>
		<15823.51640.68022.555852@canoe.velocet.net>
		<3DD1865E.B9C72DF5@mindspring.com> <15826.24074.605709.966155@canoe.velocet.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

David Gilbert wrote:
> Terry> The problem is that they don't tell me about where you are
> Terry> measuring your packets-per-second rate, or how it's being
> Terry> measured, or whether the interrupt or processing load is high
> Terry> enough to trigger livelock, or not, or the size of the packet.
> Terry> And is that a unidirectional or bidirectional rate?  UDP?
> 
> Terry> I guess I could guess with 200kpps:
> 
> Terry>  100mbit/s / 200kp/s = 500 bytes per packet
> 
> Terry> ...and that an absolute top end.  Somehow, I think the packets
> Terry> are smaller.  Bidirectionally, not FDX, we're talking 250 bytes
> Terry> per packet maximum theoretical throughput.
> 
> Well... I have all those stats, but I wasn't wanting to type that
> much.  IIRC, we normally test with 80 byte packets ... they can be UDP
> or TCP ... we're testing the routing.  The box has two interfaces and
> we measure the number of PPS that get to the box on the other side.
> 
> Without polling patches, the single processor box definately
> experiences live lock.  Interestingly, the degree of livelock is
> fairly motherboard dependant.  We have tested many cards and so far
> fxp's are our best performers.

You keep saying this, and I keep finding it astounding.  8-).

My best performance has always been with Tigon III cards, which
are the only cards that seem to be able to keep up with the
full Gigabit rate with small packets.

It helps to know up to what layer you are pushing the data
through the kernel.  You seem to be saying that it's happening
at Layer 2 -- IP layer.  If that's the case, it makes things a
bit easier.

The livelock is actually expected to be motherboard dependent;
I personally would also expect it to be RAM size dependent.  It's
possible to deal with it via latency reduction in at least two
places, I think (TLB overhead and pool retention time overhead
for in-transit mbufs).

I still don't quite know about the zebra involvement, or whether
it's starving at that point, or not.


> >> One of the largest problems we've found with GigE adapters on
> >> FreeBSD is that their pps ability (never mind the volume of data)
> >> is less than half that of the fxp driver.
> 
> Terry> I've never found this to be the case, using the right hardware,
> Terry> and a combination of hard and soft interrupt coelescing.  You'd
> Terry> have to tell me what hardware you are using for me to be able
> Terry> to stare at the driver.  My personal hardware recommendation in
> Terry> this regard would be the Tigon III, assuming that the packet
> Terry> size was 1/3 to 1/6th the MTU, as you implied by your numbers.
> 
> we were using the intel, which aparently was a mistake.  We had a
> couple of others, too, but they were dissapointing.  I can get their
> driver name later.

OK.  My personal experience iwht Intel is that the ability to
support a (relatively) larger number of VIPs is the nicest thing
about the card, but that it's less nice in almost every other
measure.


> Terry> Personnally, I would *NOT* use polling, particularly if you
> Terry> were using user space processing with Zebra, since any load at
> Terry> all would push you to the point of starving the user space
> Terry> process for CPU time; it's not really worth it (IMO) to do the
> Terry> work necessary to go to weighted fair share queueing for
> Terry> scheduling, if it came to that.
> 
> The polling patches made zebra happy, actually.  Under livelock, zebra
> would stop sending bgp hello packets.  Under polling, we could pass
> the 150k+ packets and still have user time to run bgp.

The polling is a partial fix.  It deals with the lowest level of
livelock, but it should still be possible to livelock the thing;
there are basically three livelock boundaries in the standard
FreeBSD stack (or four, if you have interrupt threads).  They
are:

1)	PCI bus monopolized by DMA transfers from the card
	- card dependent; depends on whether the card writes
	  over its own ring without acknowlegement, or not.  If
	  it does not, then quenching interrupt processing also
	  quences DMA; if not, then you are screwed.  Looks like
	  the fxp hardware quenches.

2)	Interrupt processing overhead is so high that interrupt
	processing monopolizes CPU
	- This is your 1 CPU case; the overhead goes up with
	  data rate, so this probably explains why the 100Mbit
	  cards have been kinder to you that the 1G cards.  The
	  problem is that if you spend all your time processing
	  interrupts, you have no time to run NETISR, or
	  applications.  Polling fixes this by stopping the
	  interrupt processing until an explicit restart, so
	  NETISR has opportunity to run.

3)	Interrupt processing *plus* NETISR processing monopolizes
	the CPU.
	- This basically starves user space applications for CPU
	  time.  This is a common case, even in the polling case,
	  unless you hack the scheduler.  Even so, you do not
	  achieve optimium loading, and you have to manually tune
	  the "time off" ratio for interrupts.  I expect that in
	  addition to Luigi's polling, you are also using his
	  scheduler changes?

LRP can actually deal with all three of these; it deals with #1,
the same way polling does.  It deals with #2 by eliminating the
NETISR altogether, and it deals with #3 by providing the hooks
needed to make interrupt reenabling dependent on the depth of
the queue to user space.  It also helps by eliminating the normal
quantum/2 latency that's introduced into protocol processing by
the NETISR code.

An implementation of a non-rescon based LRP for -current is the
set of patches that I have that you could maybe try.


> >> But we havn't tested every driver.  The Intel GigE cards were
> >> especially disapointing.
> 
> Terry> Have you tried the Tigon III, with Bill Paul's driver?
> 
> Terry> If so, did you include the polling patches that I made against
> Terry> the if_ti driver, and posted to -net, when you tested it?
> 
> Terry> Do you have enough control over the load clients that you can
> Terry> ramp the load up until *just before* the performance starts to
> Terry> tank?  If so, what's the high point of the curve on the
> Terry> Gigabit, before it tanks (and it will)?
> 
> We need new switches, actually, but we'll be testing this soon.

Without similar patches, you will probably fins most 1G cards
very disappointing.  The problem is that you will hit livelock,
and polling is not supported for all cards.  My patches to the
Tigon driver were to add polling support for it.


> Terry> Frankly, I am not significantly impressed by the Click and
> Terry> other code.  If all you are doing is routing, and everything
> Terry> runds in a fixed amount of time at interrupt, it's fine, but it
> Terry> quickly gets less fine, as you move away from that setup.
> 
> Terry> If you are running Zebra, you really don't want Click.
> 
> I've had that feeling.  A lot of people seem to be working on click,
> but it seems to abstract things that I don't see as needing
> abstracting.

Zebra's an application.  8-).  Source of the problem.


> Terry> If you can gather enough statistics to graph the drop-off
> Terry> curve, so it's possible to see why the problems you are seeing
> Terry> are happening, then I can probably provide you some patches
> Terry> that will increase performance for you.  It's important to know
> Terry> if you are livelocking, or if you are running out of mbufs, or
> Terry> if it's a latency issue you are facing, or if we are talking
> Terry> about context switch overhead, instead, etc..
> 
> We're definately livelocking with the fxps.  I'd be interested in your
> patches for the GigE drivers.

The if_ti patches to add polling support are:

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=407328+0+archive/2002/freebsd-net/20021013.freebsd-net

The patches I'm interested in you seeing, though, are patches for
support of LRP in FreeBSD-current.  If you have a testing setup
that can benchmark them, then you can prove them out relative to
the current code.  If you can't measure a difference, though, then
there's really no need to pursue them.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message