Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 15 Oct 2001 11:35:51 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Kenneth D. Merry" <ken@kdm.org>
Cc:        current@FreeBSD.ORG
Subject:   Re: Why do soft interrupt coelescing?
Message-ID:  <3BCB2C87.BEF543A0@mindspring.com>
References:  <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com> <20011010000604.A19388@panzer.kdm.org> <3BC40E04.D89ECB05@mindspring.com> <20011010232020.A27019@panzer.kdm.org> <3BC55201.EC273414@mindspring.com> <20011015002407.A59917@panzer.kdm.org>

next in thread | previous in thread | raw e-mail | index | archive | help
"Kenneth D. Merry" wrote:
> Dropping packets before they get into card memory would only be possible
> with some sort of traffic shaper/dropping mechanism on the wire to drop
> things before they get to the card at all.

Actually, DEC had a congestion control mechanism that worked by
marking all packets, over a certainl level of congestion (this
was sometimes called the "DECbits" approach).  You can do the
same thing with any intermediate hop router, so long as it is
better at moving packets than your destination host.

It turns out that even if the intermediate hop and the host at
the destination are running the same hardware and OS, the cost
is going to be higher to do the terminal processing than it is
to do the network processing, so you are able to use the tagging
to indicate to the terminal hop which flows to drop packets out
of before processing.

Cisco routers can do this (using the CMU firmware) on a per
flow basis, leaving policy up to the end node.  Very neat.


[ ... per connection overhead, and overcommit ... ]

> You could always just put 4G of RAM in the machine, since memory is so
> cheap now. :)
> 
> At some point you'll hit a limit in the number of connections the processor
> can actually handle.

In practice, particularly for HTTP or FTP flows, you can
halve the amount of memory expected to be used.  This is
because the vast majority of the data is generally pushed
in only one direction.

For HTTP 1.1 persistant connections, you can, for the most
part, also assume that the connections are bursty -- that
is, that there is a human attached to the other end, who
will spend some time examining the content before making
another request (you can assume the same thing for 1.0, but
that doesn't count against persistant connection count,
unless you also include time spent in TIME_WAIT).

So overcommit turns out to be O.K. -- which is what I was
trying to say in a back-handed way, in the last post.

If you include window control (i.e. you care about overall
throughput, and not about individual connections), then
you can safely service 1,000,000 connections with 4G on a
FreeBSD box.


> > This is actually very bad: you want to drop packets before you
> > insert them into the queue, rather than after they are in the
> > queue.  This is because you want the probability of the drop
> > (assuming the queue is not maxed out: otherwise, the probabilty
> > should be 100%) to be proportional to the exponential moving
> > average of the queue depth, after that depth exceeds a drop
> > threshold.  In other words, you want to use RED.
> 
> Which queue?  The packets are dropped before they get to ether_input().

The easy answer is "any queue", since what you are becoming
concerned with is pool retention time: you want to throw
away packets before a queue overflow condition makes it so
you are getting more than you can actually process.


> Dropping random packets would be difficult.

The "R" in "RED" is "Random" for "Random Early Detection" (or
"Random Early Drop", for a minority of the literature), true.

But the randomness involved is whether you drop vs. insert a
given packet, not whether you drop a random packet from the
queue contents.  Really dropping random queue elements would
be incredibly bad.

The problem is that, during an attack, the number of packets
you get is proportionally huge, compared to the non-attack
packets (the ones you want to let through).  A RED approach
will prevent new packets being enqueued: it protects the host
system's ability to continue degraded processing, by making
the link appear "lossy" -- the closer to queue full, the
more lossy the link.

If you were to drop random packets already in the queue,
then the proportional probability of dumping good packets is
equal to the queue depth times the number of bad packets
divided by the number of total packets.  In other words, a
continuous attack will almost certainly push all good packets
out of the queue before they reach the head.

Dropping packets prior to insertion maintains the ratio of
bad and good packets, so it doesn't inflate the danger to
the good packets by the relative queue depth: thus dropping
before insertion is a significantly better strategy than
dropping after insertion, for any queue depth over 1.

> > Maybe I'm being harsh in calling it "spam'ming".  It does the
> > wrong thing, by dropping the oldest unprocessed packets first.
> > A FIFO drop is absolutely the wrong thing to do in an attack
> > or overload case, when you want to shed load.  I consider the
> > packet that is being dropped to have been "spam'med" by the
> > card replacing it with another packet, rather than dropping
> > the replacement packet instead.
> >
> > The real place for this drop is "before it gets to card memory",
> > not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc.,
> > all agree on that.
> 
> As I mentioned above, how would you do that without some sort of traffic
> shaper on the wire?

The easiest answer is to RED queue in the card firmware.


> My focus with gigabit ethernet was to get maximal throughput out of a small
> number of connections.  Dealing with a large number of connections is a
> different problem, and I'm sure it uncovers lots of nifty bugs.

8-).  I guess that you are more interested in intermediate hops
and site to site VPN, while I'm more interested in connection
termination (big servers, SSL termination, and single client VPN).


> > I'd actually prefer to avoid the other DMA; I'd also like
> > to avoid the packet receipt order change that results from
> > DMA'ing over the previous contents, in the case that an mbuf
> > can't be allocated.  I'd rather just let good packets in with
> > a low (but non-zero) statistical probability, relative to a
> > slew of bad packets, rather than letting a lot of bad packets
> > from a persistant attacker push my good data out with the bad.
> 
> Easier said than done -- dumping random packets would be difficult with a
> ring-based structure.  Probably what you'd have to do is have an extra pool
> of mbufs lying around that would get thrown in at random times when mbufs
> run out to allow some packets to get through.
> 
> The problem is, once you exhaust that pool, you're back to the same old
> problem if you're completely out of mbufs.
> 
> You could probably also start shrinking the number of buffers in the ring,
> but as I said before, you'd need a method for the kernel to notify the
> driver that more mbufs are available.

You'd be better off shrinking the window size across all
the connections, I think.

As to difficult to do, I actually have RED queue code, which
I adapted from the formula in a paper.  I have no problem
giving that code out.

The real issue is that the BSD queue macros involved in the
queues really need to be modified to include an "elements on
queue" count for the calculation of the moving average.


> > If you are sharing interrupts at this network load, then
> > you are doing the wrong thing in your system setup.  If
> > you don't have such a high load, then it's not a problem;
> > either way, you can avoid the problem, for the most part.
> 
> It's easier with a SMP system (i.e. you've got an APIC.)

Virtual wire mode is actually bad.  It's better to do asymmetric
interrupt handling, if you have a number of identical or even
similar interrupt sources (e.g. two gigabit cards in a box).


> > OK, I will rediff and generate context diffs; expect them to
> > be sent in 24 hours or so from now.
> 
> It's been longer than that...

Sorry; I've been doing a lot this weekend.  I will redo them
at work today, and resend them tonight... definitely.


> > > Generally the ICMP response tells you how big the maximum MTU is, so you
> > > don't have to guess.
> >
> > Maybe it's the ICMP response; I still haven't had a chance to
> > hold Michael down and drag the information out of him.  8-).
> 
> Maybe what's the ICMP response?

The difference between working and not working.


> > Cicso boxes detect "black hole" routes; I'd have to read the
> > white paper, rather than just its abstract, to tell you how,
> > though...
> 
> It depends on what they're trying to do with the information.  If they're
> just trying to route around a problem, that's one thing.  If they're trying
> to diagnose MTU problems, that's quite another.
> 
> In general, it's going to be pretty easy for routers to detect when a
> packet exceeds the MTU for one of their interfaces and send back a ICMP
> packet.

A "black hole" route doesn't ICMP back, either because some
idiot has blocked ICMP, or because it's just too dumb...

> > Not for the user.
> 
> Configuring the MTU is a standard part of configuring IP networks.
> If your users aren't smart enough to do it, you'll pretty much have
> to default to 1500 bytes for ethernet.

Or figure out how to negotiate higher...

> You can let the more clueful users increase the MTU.

That doesn't improve performance, and so "default configuration"
benchmarks like "Polygraph" really suffer, as a result.


> If you're supplying enough of the equipment, you can make some assumptions
> about the equipment setup.  This was the case with my former employer -- in
> many cases we supplied the switch as well as the machines to go onto the
> network, so we knew ahead of time that jumbo frames would work.  Otherwise,
> we'd work with the customer to set things up with standard or jumbo frames
> depending on their network architecture.

This approach only works if you're Cisco or another big iron
vendor, in an established niche.

[ ... more on MTU negotiation for jumbograms ... ]

> > In any case, Intel cards appear to do it, and so do Tigon III's.
> 
> That's nice, but there's no reason a card has to accept packets with a
> higher MTU than it is configured for.

True, but then there's no reason I have to buy the cards
that choose to not do this.  8-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3BCB2C87.BEF543A0>