Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Oct 2001 23:20:21 -0600
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        current@FreeBSD.ORG
Subject:   Re: Why do soft interrupt coelescing?
Message-ID:  <20011010232020.A27019@panzer.kdm.org>
In-Reply-To: <3BC40E04.D89ECB05@mindspring.com>; from tlambert2@mindspring.com on Wed, Oct 10, 2001 at 01:59:48AM -0700
References:  <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com> <20011010000604.A19388@panzer.kdm.org> <3BC40E04.D89ECB05@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Oct 10, 2001 at 01:59:48 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> > eh?  The card won't write past the point that has been acked by the kernel.
> > If the kernel hasn't acked the packets and one of the receive rings fills
> > up, the card will hold off on sending packets up to the kernel.
> 
> Uh, eh?
> 
> You mean the card will hold off on DMA and interrupts?  This
> has not been my experience.  Is this with firmware other than
> the default and/or the 4.3-RELEASE FreeBSD driver?

If the receive ring for that packet size is full, it will hold off on
DMAs.  If all receive rings are full, there's no reason to send more
interrupts.

Keep in mind that there are three receive rings by default on the Tigon
boards -- mini, standard and jumbo.  The size of the buffers in each ring
is configurable, but basically all packets smaller than a certain size will
get routed into the mini ring.  All packets larger than a certain size will
get routed into the jumbo ring.  All packets in between will get routed
into the standard ring.

If there isn't enough space in the mini or jumbo rings for a packet, it'll
get routed into the standard ring if there is space there.  (In the case of
a jumbo packet, it may take up multiple buffers on the ring.)

Anyway, if all three rings fill up, then yes, there won't be a reason to
send receive interrupts.

> > I agree that you can end up spending large portions of your time doing
> > interrupt processing, but I haven't seen instances of "buffer overrun", at
> > least not in the kernel.  The case where you'll see a "buffer overrun", at
> > least with the ti(4) driver, is when you have a sender that's faster than
> > the receiver.
> > 
> > So the receiver can't process the data in time and the card just drops
> > packets.
> 
> OK, assuming you meant that the copies would stall, and the
> data not be copied (which is technically the right thing to
> do, assuming a source quench style livelock avoidance, which
> doesn't currently exist)...

The data isn't copied, it's DMAed from the card to host memory.  The card
will save incoming packets to a point, but once it runs out of memory to
store them it starts dropping packets altogether.

> The problem is still that you end up doing interrupt processing
> until you run out of mbufs, and then you have the problem of
> not being able to transmit responses, for lack of mbufs.

In theory you would have configured your system with enough mbufs to handle
the situation, and the slowness of the system would cause the windows on
the sender to fill up, so they'll stop sending data until the receiver
starts responding again.  That's the whole purpose of backoff and slow
start -- to find a happy medium for the transmitter and receiver so that
data flows at a constant rate.

> In the ti driver case, the inability to get another mbuf to
> replace the one that will be taken out of the ring means that
> the mbuf gets reused for more data -- NOT that the data flow
> in the form of DMA from the card ends up being halted until
> mbufs become available.

True.

> The real problem here is that most received packets want a
> response; for things like web servers, where the average
> request is ~.5k and the average response is ~11k, this means
> that you would need to establish use-based watermarking, to
> seperate the mbuff pool into transmit and receive resources;
> in practice, this doesn't really work, if you are getting
> your content from a seperate server (e.g. an NFS server that
> provides content for a web farm, etc.).
> 
> 
> > That's a different situation from the card spamming the receive
> > ring over and over again, which is what you're describing.  I've
> > never seen that happen, and if it does actually happen, I'd be
> > interested in seeing evidence.
> 
> Please look at what happens in the case of an allocation
> failure, for any driver that does not allow shrinking the
> ring of receive mbufs (the ti is one example).

It doesn't spam things, which is what you were suggesting before, but
as you pointed out, it will effectively drop packets if it can't get new
mbufs.

Yes, it could shrink the pool, by just not replacing those mbufs in the
ring (and therefore not notifying the card that that slot is available
again), but then it would likely need some mechanism to allow it to be
notified that another buffer is available for it, so it can then allocate
receive buffers again.

In practice I haven't found the number of mbufs in the system to be a
problem that would really make this a big issue.  I generally configure the
number of mbufs to be high enough that it isn't a problem in the first
place.

> > > Without hacking firmware, the best you can do is to ensure
> > > that you process as much of all the traffic as you possibly
> > > can, and that means avoiding livelock.
> > 
> > Uhh, the Tigon firmware *does* drop packets when there is no
> > more room in the proper receive ring on the host side.  It
> > doesn't spam things.
> > 
> > What gives you that idea?  You've really got some strange ideas
> > about what goes on with that board.  Why would someone design
> > firmware so obviously broken?
> 
> The driver does it on purpose, by not giving away the mbuf
> in the receive ring, until it has an mbuf to replace it.

The driver does effectively drop packets, but it doesn't spam over another
packet that has already gone up the stack.  

> Maybe this should be rewritten to not mark the packet as
> received, and thus allow the card to overwrite it.

That wouldn't really work, since the card knows it has DMAed into that
slot, and won't DMA another packet in its place.  The current approach is
the equivalent, really.  The driver tells the card the packet is received,
and if it can't allocate another mbuf to replace that mbuf, it just puts
that mbuf back in place.  So the card will end up overwriting that packet.

> There
> are two problems with that approach however.  The first is
> what happens when you reach mbuf exhaustion, and the only
> way you can clear out received mbufs is to process the data
> in a user space appication which never gets to run, and when
> it does get to run, can't write a response for a request and
> response protocol, such that it can't free up any mbufs?  The
> second is that, in the face of a denial of service attack,
> the correct approach (according to Van Jacobsen) is to do a
> random drop, and rely on the fact that the attack packets,
> being proportionally more of the queue contents, get dropped
> with a higher probability... so while you _can_ do this, it
> is really a bad idea, if you are trying to make your stack
> robust against attacks.
> 
> The other thing that you appear to be missing is that the
> most common case is that you have plenty of mbufs, and you
> keep getting interrupts, replacing the mbufs in the receive
> ring, and pushing the data into the ether input by giving
> away the full mbufs.
> 
> The problem occurs when you are receiving at such a high rate
> that you don't have any free cycles to run NETISR, and thus
> you can not empty the queue from which ipintr is called with
> data.
> 
> In other words, it's not really the card's fault that the OS
> didn't run the stack at hardware interrupt.

I haven't seen this in my tests, but I'm sure it's possible.

> > > What this means is that you get more benefit in the soft
> > > interrupt coelescing than you otherwise would get, when
> > > you are doing LRP.
> > >
> > > But, you do get *some* benefit from doing it anyway, even
> > > if your ether input processing is light: so long as it is
> > > non-zero, you get benefit.
> > >
> > > Note that LRP itself is not a panacea for livelock, since
> > > it just moves the scheduling problem from the IRQ<->NETISR
> > > scheduling into the NETISR<->process scheduling.  You end
> > > up needing to implement weighted fair share or other code
> > > to ensure that the user space process is permitted to run,
> > > so you end up monitoring queue depth or something else,
> > > and deciding not to reenable interrupts on the card until
> > > you hit a low water mark, indicating processing has taken
> > > place (see the papers by Druschel et. al. and Floyd et. al.).
> > 
> > It sounds like it might be better handled with better scheduling in
> > combination with interrupt coalescing in the hardware.
> 
> Hardware interrupt coelescing stalls your processing until
> the coeslecing window becomes full, or the coelescing idle
> timer fires.
> 
> Really, the best way to handle it is polling, but when you
> do that, there is nothing left over for useful work, other
> than the work of moving packets (e.g. it's OK for L2 or L4
> switching, but can't handle L7 switching, and can't handle
> user processes not involved with actual networking).  I've
> seen this same hardware, with rewritten firmware, do about
> 100,000 connections a second, with polling, but that number
> at that point is no longer a figure of merit, since in
> order to actually have a product, you have to be able to
> get real work done (Alteon quotes ~128,000/second for their
> most recent product offering).

Quite impressive. :)

> > Is burdening the hardware interrupt handler even more the
> > best solution?
> 
> I don't really understand the question: you have to do the
> work sometime, in response to an external, real-world event.
> Polling reduces latency in handling the event closer to when
> it actually occurs, but means that you can only do a limited
> set of work.  Handling the event in response to a signal
> that it has taken place (i.e. an interrupt) is the best
> approach to making it so that you can do other work until
> your attention is required.  Polling multiple times until
> there is no more work (e.g. coelescing the work into a
> single interrupt) is as close as you'll get to pure polling
> without actually doing pure polling.
> 
> I don't really see why doing as much work as possible by
> directly coupling to the "work to do" event -- the interrupt --
> is being considered a bad thing.  It goes against all the
> high speed networking research of the last decade, so with
> no contrary evidence, it's really hard to support that view.

The main thing I would see is that when an interrupt handler takes a long
time to complete, it's going to hold off other devices sharing that
interrupt.  (Or interrupt mask, perhaps.)

This may have changed in -current with interrupt threads, though.

> > > I was chewed on before for context diffs.  As I said before,
> > > I can provide tem, if that's the current coin of the realm; it
> > > doesn't matter to me.
> > 
> > That should be the default mode -- as Peter pointed out, non-context diffs
> > will work far less often and are far more difficult to decipher than
> > context diffs.
> 
> Is this a request for me to resend the diffs?

Yes.

> > > The problem I have is that you simply can't use jumbograms
> > > in a commercial product, if they can't be autonegotiated,
> > > or you will burn all your profit in technical support calls
> > > very quickly.
> > 
> > The MTU isn't something that gets negotiated -- it's something that is set
> > per-network-segment.
> > 
> > Properly functioning routers will either fragment packets (that don't have
> > the DF bit set) or send back ICMP responses telling the sender their packet
> > should be smaller.
> 
> Sure.  So you set the DF bit, and then start with honking big
> packets, sending progressively smaller ones, until you get
> a response.

Generally the ICMP response tells you how big the maximum MTU is, so you
don't have to guess.

> > > Intel can autonegotiate with several manufacturers, and the
> > > Tigon III.  It can interiperate with the Tigon II, if you
> > > statically configure it for jumbograms.
> > 
> > This sounds screwed up; you don't "autonegotiate" jumbo frames, at least on
> > the NIC/ethernet level.  You set the MTU to whatever the maximum is for
> > that network, and then you let your routing equipment handle the rest.
> > 
> > If you're complaining that the Intel board won't accept jumbo frames from
> > the Alteon board by default (i.e. without setting the MTU to 9000) that's
> > no surprise at all.  Why should a board accept a large packet when its MTU
> > is set to something smaller?
> 
> It's the Alteon board that craps out.  Intel to Intel works
> fine, when you "just start using them" on one end.

IMO, the Alteon board is behaving correctly.  The MTU is set to say 1500,
it is an error to send larger packets.

> > I don't think there's any requirement that a card accept packets larger
> > than its configured MTU.
> 
> You're right that it's not a requirement: it's just a customer
> expectation that your hardware will try to operate in the most
> efficient mode possible, and all other modes will be fallback
> positions (which you will then complain about in your logs).

It's pretty much impossible, though, to autonegotiate MTU.  MSS is easy --
that's a function of the protocol.  For instance:

Joe User has two of Product X from Company Y.

Joe User has a gigabit switch that only does 1500 byte packets.

Company Y configured Product X to talk jumbo frames by default.

Joe User doesn't know about jumbo frames, since that isn't one of the
checkbox items on the install for Product X.  Joe User configures the two
Product X boxes to talk to each other.

The two Product X boxes use TCP connections between each other, and happily
negotiate a MSS of 8960 or so.  They start sending data packets, but
nothing gets through.

How would product X detect this situation?  Most switches I've seen don't
send back ICMP packets to tell the sender to change his route MTU.  They
just drop the packets.  In that situation, though, you can't tell the
difference between the other end being down, the cable getting pulled,
switch getting powered off or the MTU on the switch being too small.

It's a lot easier to just have the user configure the MTU.

So, what if you did try to back off and retransmit at progressively smaller
sizes?  That won't work in all cases.  If you're the receiver, and the
sender isn't one of your boxes, you have no way of knowing whether the
sender is down or what, and you have no way of guessing that his problem is
that the switch doesn't support the large MSS you've negotiated.  There's
also no way for you to back off, since you're not the one transmitting the
data, and your acks get through just fine.

> > > Another interesting thing is that it is often a much better
> > > idea to negotiate an 8k MTU for jumbograms.  The reason for
> > > this is that it fits evenly into 4 mbuf clusters.
> > 
> > Yeah, that's nice.  FreeBSD's TCP stack will send 8KB payloads
> > automatically if your MTU is configured for something more than that.
> > 
> > That comes in handy for zero copy implementations.
> 
> Ugh.  I was unaware that FreeBSD's stack would not honor a 9k MTU.

It will receive packets that size, but won't send them because it's less
efficient that way.  Why bother, anyway?  Windows boxes will send packets
that are 9K-header length, but that's generally less efficient from a
buffer management standpoint.

> > > There are actually some good arguments in there for having
> > > non-fixed-sized mbufs...
> > 
> > Perhaps, but not that many.
> 
> 1)	mbuf cluster headers waste most of their space.
> 
> 2)	max MTU sized mbufs have no wasted space.
> 
> 3)	(obsolete, sadly) the tcptempl structure is 60
> 	bytes, not 256, but uses a full mbuf, for a full
> 	> 2/3rds space wastage.
> 
> 4)	9k packets

They're harder to deal with, though, if they're done on a global scale.

The way large mbufs are done with the zero copy version of the ti(4) driver
is with a page allocator and chains of mbufs.

On i386 (4K pages), three pages plus a header mbuf are allocated and put
into an extended jumbo receive BD.

I'd really be more worried about performance than memory waste nowadays,
though.  If you're wasting a little bit of space in an mbuf on a 4GB
machine, who cares?  I'd suggest going with larger mbufs if anything, that
way you can minimize list traversal.  mbuf size, at least, is tweakable, so
individual users can tune it for their typical packet size.

Ken
-- 
Kenneth Merry
ken@kdm.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011010232020.A27019>