Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Oct 2001 00:06:04 -0600
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        current@FreeBSD.ORG
Subject:   Re: Why do soft interrupt coelescing?
Message-ID:  <20011010000604.A19388@panzer.kdm.org>
In-Reply-To: <3BC34FC2.6AF8C872@mindspring.com>; from tlambert2@mindspring.com on Tue, Oct 09, 2001 at 12:28:02PM -0700
References:  <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 09, 2001 at 12:28:02 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> [ ... soft interrupt coelescing ... ]
> 
> > As you say above, this is actually a good thing.  I don't see how this ties
> > into the patch to introduce some sort of interrupt coalescing into the
> > ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> > on the board to do what you want.
> 
> I have tweaked all tunables on the board, and I have not gotten
> anywhere near the increased performance.
> 
> The limit on how far you can push this is based on how much
> RAM you can have on the card, and the limits to coelescing.
> 
> Here's the reason: when you receive packets to the board, they
> get DMA'ed into the ring.  No matter how large the ring, it
> won't matter, if the ring is not being emptied asynchronously
> relative to it being filled.
> 
> In the case of a full-on receiver livelock situation, the ring
> contents will be continuously overwritten.  This is actually
> what happens when you put a ti card into a machine with a
> slower processor, and hit it hard.

eh?  The card won't write past the point that has been acked by the kernel.
If the kernel hasn't acked the packets and one of the receive rings fills
up, the card will hold off on sending packets up to the kernel.

> In the case of interrupt processing, where you jam the data up
> through ether input at interrupt time, the buffer will be able
> to potentially overrun, as well.  Admittedly, you can spend a
> huge percentage of your CPU time in interrupt processing, and
> if your CPU is fast enough, unload the queue very quickly.
> 
> But if you then look at doing this for multiple gigabit cards
> at the same time, you quickly reach the limits... and you
> spend so much of your time in interrupt processing, that you
> spend none running NETISR.

I agree that you can end up spending large portions of your time doing
interrupt processing, but I haven't seen instances of "buffer overrun", at
least not in the kernel.  The case where you'll see a "buffer overrun", at
least with the ti(4) driver, is when you have a sender that's faster than
the receiver.

So the receiver can't process the data in time and the card just drops
packets.

That's a different situation from the card spamming the receive ring over
and over again, which is what you're describing.  I've never seen that
happen, and if it does actually happen, I'd be interested in seeing
evidence.

> So you have moved your livelock up one layer.
> 
> 
> In any case, doing the coelescing on the board delays the
> packet processing until that number of packets has been
> received, or a timer expires.  The timer latency must be
> increased proportionally to the maximum number of packets
> that you coelesce into a single interrupt.
> 
> In other words, you do not interleave your I/O when you
> do this, and the bursty conditions that result in your
> coelescing window ending up full or close to full are
> the conditions under which you should be attempting the
> maximum concurrency you can possibly attain.
> 
> Basically, in any case where the load is high enough to
> trigger the hardware coelescing, the ring would need to
> be the next power of two larger to ensure that the end
> does not overwrite the beginning of the ring.
> 
> In practice, the firmware on the card does not support
> this, so what you do instead is push a couple of packets
> that may have been corrupted through DMA occurring during
> the fact -- in other words, you drop packets.
> 
> This is arguably "correct", in that it permits you to shed
> load, _but_ the DMAs still occur into your rings; it would
> be much better if the load were shed by the card firmware,
> based on some knowledge of ring depth instead (RED Queueing),
> since this would leave the bus clear for other traffinc (e.g.
> communication with main memory to provide network content for
> the cards for, e.g., and NFS server, etc.).
> 
> Without hacking firmware, the best you can do is to ensure
> that you process as much of all the traffic as you possibly
> can, and that means avoiding livelock.

Uhh, the Tigon firmware *does* drop packets when there is no more room in
the proper receive ring on the host side.  It doesn't spam things.

What gives you that idea?  You've really got some strange ideas about what
goes on with that board.  Why would someone design firmware so obviously
broken?

> [ ... LRP ... ]
> 
> > That sounds cool, but I still don't see how this ties into the patch you
> > sent out.
> 
> OK.  LRP removes NETISR entirely.
> 
> This is the approach Van Jacobson stated he used in his
> mythical TCP/IP stack, which we may never see.
> 
> What this does is push the stack processing down to the
> interrupt time for the hardware interrupt.  This is a
> good idea, in that it avoids the livelock for the NETISR
> never running because you are too busy taking hardware
> interrupts to be able to do any stack processing.
> 
> The way this ties into the patch is that doing the stack
> processing at interrupt time increases the per ether
> input processing cycle overhead up.
> 
> What this means is that you get more benefit in the soft
> interrupt coelescing than you otherwise would get, when
> you are doing LRP.
> 
> But, you do get *some* benefit from doing it anyway, even
> if your ether input processing is light: so long as it is
> non-zero, you get benefit.
> 
> Note that LRP itself is not a panacea for livelock, since
> it just moves the scheduling problem from the IRQ<->NETISR
> scheduling into the NETISR<->process scheduling.  You end
> up needing to implement weighted fair share or other code
> to ensure that the user space process is permitted to run,
> so you end up monitoring queue depth or something else,
> and deciding not to reenable interrupts on the card until
> you hit a low water mark, indicating processing has taken
> place (see the papers by Druschel et. al. and Floyd et. al.).

It sounds like it might be better handled with better scheduling in
combination with interrupt coalescing in the hardware.

Is burdening the hardware interrupt handler even more the best solution?

> > > > It isn't terribly clear what you're doing in the patch, since it isn't a
> > > > context diff.
> > >
> > > It's a "cvs diff" output.  You could always check out a sys
> > > tree, apply it, and then cvs diff -c (or -u or whatever your
> > > favorite option is) to get a diff more to your tastes.
> > 
> > As Peter Wemm pointed out, we can't use non-context diffs safely without
> > the exact time, date and branch of the source files.  This introduces an
> > additional burden for no real reason other than you neglected to use -c or
> > -u with cvs diff.
> 
> I was chewed on before for context diffs.  As I said before,
> I can provide tem, if that's the current coin of the realm; it
> doesn't matter to me.

That should be the default mode -- as Peter pointed out, non-context diffs
will work far less often and are far more difficult to decipher than
context diffs.

> [ ... jumbogram autonegotiation ... ]
> 
> > > I believe it was the implementation of the length field.  I
> > > would have to get more information from the person who did
> > > the interoperability testing for the autonegotiation (which
> > > failed between the Tigon II and the Intel Gigabit cards).  I
> > > can assure you anecdotally, however, that autonegotiation
> > > _did_ fail.
> > 
> > I would believe that autonegotiation (i.e. 10/100/1000) might fail,
> > especially if you're using 1000BaseT Tigon II boards.  However, I
> > would like more details on the failure.  It's entirely possible
> > that it could be fixed in the firmware, probably without too much
> > trouble.
> 
> Possibly.
> 
> The problem I have is that you simply can't use jumbograms
> in a commercial product, if they can't be autonegotiated,
> or you will burn all your profit in technical support calls
> very quickly.

The MTU isn't something that gets negotiated -- it's something that is set
per-network-segment.

Properly functioning routers will either fragment packets (that don't have
the DF bit set) or send back ICMP responses telling the sender their packet
should be smaller.

> > I find it somewhat hard to believe that Intel would ship a gigabit
> > board that didn't interoperate with the board that up until probably
> > recently was probably the predominant gigabit board out there.
> 
> Intel can autonegotiate with several manufacturers, and the
> Tigon III.  It can interiperate with the Tigon II, if you
> statically configure it for jumbograms.

This sounds screwed up; you don't "autonegotiate" jumbo frames, at least on
the NIC/ethernet level.  You set the MTU to whatever the maximum is for
that network, and then you let your routing equipment handle the rest.

If you're complaining that the Intel board won't accept jumbo frames from
the Alteon board by default (i.e. without setting the MTU to 9000) that's
no surprise at all.  Why should a board accept a large packet when its MTU
is set to something smaller?

I don't think there's any requirement that a card accept packets larger
than its configured MTU.

> A big problem with jumbograms is that there are a number of
> cards with an 8k limit before they can't offload checksum
> processing.

That's not a problem with jumbo frames, but rather the implementation.
You have boards with 8K DMA FIFOs (where they do the checksumming), and
sure you're going to have that problem.  It isn't an issue with jumbo
frames in and of themselves.

> Another interesting thing is that it is often a much better
> idea to negotiate an 8k MTU for jumbograms.  The reason for
> this is that it fits evenly into 4 mbuf clusters.

Yeah, that's nice.  FreeBSD's TCP stack will send 8KB payloads
automatically if your MTU is configured for something more than that.

That comes in handy for zero copy implementations.

> There are actually some good arguments in there for having
> non-fixed-sized mbufs...

Perhaps, but not that many.

Ken
-- 
Kenneth Merry
ken@kdm.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011010000604.A19388>