Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 01 Jan 2002 05:22:59 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Louis A. Mamakos" <louie@TransSys.COM>
Cc:        Matthew Dillon <dillon@apollo.backplane.com>, Julian Elischer <julian@elischer.org>, Mike Silbersack <silby@silby.com>, Josef Karthauser <joe@tao.org.uk>, Tomas Svensson <tsn@gbdev.net>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: FreeBSD performing worse than Linux?
Message-ID:  <3C31B833.FF98C07@mindspring.com>
References:  <Pine.BSF.4.21.0112311225150.94344-100000@InterJet.elischer.org> <200112312327.fBVNRt719835@whizzo.transsys.com> <200201010043.g010h0i36281@apollo.backplane.com> <3C311AC9.99B5FC9C@mindspring.com> <200201010246.g012ko721041@whizzo.transsys.com>

next in thread | previous in thread | raw e-mail | index | archive | help
"Louis A. Mamakos" wrote:
> 
> Disabling Nagle's algorithm for no good reason has very poor
> scaling behavior.   This is what happens when TCP_NODELAY is
> enabled on a socket.

Disabling Nagle's algorithm for a good reason would still
result in the observed failure, however.


> If you look at the work function for most network elements, the part
> that runs out of gas first is per-packet forwarding performance. Sure,
> you need to have adequate bus bandwidth to move stuff through a box,
> but it's performing per-packet forwarding operations and policy which
> is the resource that's most difficult to make more of. I think this is
> true for toy routers based on PC platform as well as high-end boxes like
> the Cisco 12000 series. Juniper managed adequate forwarding performance
> using specialized ASIC implementions in the forwarding path.  Of this
> statement, I'm sure; in my day job at UUNET, I talk to all the major
> backbone router vendors, and forwarding performance (and also
> reasonable routing protocol implementions) is a show-stopper
> requirement they labor mightily over.

PCI is sufficient to keep a Gigabit interface saturated, even
without going to jumbograms.  I have personally saturated such
an interface.

PCI-X will scale to 8 Gigabits.


> So here was have a mechanism with wonderful properties - it's a
> trivial yet clever implementation of a self tuning mechanism to
> prevent tinygrams from being generated by a TCP without all manner
> of complicated timers.  It give great performance on LAN and other
> high-speed interconnects where remote echo type applications are
> demanding, yet over long delay paths where remote echo is gonna suck
> no matter what you do, it automatically aggregates packets.

As a bandwidth provider, UUNET is more concerned with aggregate
throughput; this means it cares more about the moving average
of packets getting through than it does about *my* packets
getting through.  When it comes to a conflict of interest, you
will understand my preferences are for my own interests... 8-).


> Nagle's algorithm and Van Jacobson's slow-start algorithm allowed the
> Internet to survive over congested paths.  And they did so with
> a bunch of self-tuning behavior independent of the bandwidth*delay
> product of the path the connection was running over.  It was and is
> amazing stuff.

Yes, I'm well aware of bandwidth delay product calculations;
it's the primary mechanism behind the rate halving algorithm
I keep pointing to (Hoe, Jacobson).  8-).  It's also the
primary limitation on connection speed (remember that it was
my FreeBSD machine that was able to get to 1.6M connections,
with standard sockets).


> Likewise, the original problem in this thread is likely caused by some
> part of the USB Ethernet implementation having inadequate per-packet
> resources. It's probably not about the number of bytes, but the number of
> transactions.  You see here a modern reimplementation of essentially the same
> problem that the 3COM 3C501 ISA ethernet card had 15 years ago - back to
> back packets were consistantly dropped because of the poor per-packet
> buffering implementation.  It was absolutely repeatable.

Clearly.  I think that's well established, since no one has
squawked about the FreeBSD USB driver or the PC USB hardware
being slower than the dongle USB hardware...

> Sure, it's "legal" to generate streams of tinygrams and not use Nagle's
> algorithm to aggregate the sender's traffic, but it's just plain rude
> and on low bandwidth links, it sucks because of all the extra 40 byte
> headers you're carrying around.

I understand this.  But Nagle is not the only mechanism which
would fix the problem.

Given that it's *intentionally* possible *and permitted* to turn
Nagle off (via TCP_NODELAY), it makes sense to look at another
mechanism that is not succeptible to being turned off.


> I'm sure TCP_NODELAY got added because it sounds REALLY C00L to make
> the interactive thing go better.  But clearly people don't understand
> the impact of turning on the cleverly named option and how it probably
> doesn't really improve things.

I'm pretty sure it got added to address interactive response
on intrinsically small packets; telnet was probably the number
one reason, but interactive responsiveness of small requests
over TCP, such as non-pipelined HTTP requests, SMTP server
responses, FTP control channel traffic, and other protocols
benefit significantly from turning Nagle off.  Nagle almost
suggests it, in the original paper.

Since OpenSSH incredibly bloats payload, it's much less
necessary to get performance as it would be out of Telnet,
though you will likely see least common multiples of the
MTU cause occasionaly burstiness.

Probably in this case, it would make sense to increase the
decay rate of the timer based on the amount of data that
is pending, relative to the MTU: defeat Nagle when the
pipe is full, and partially defeat it down to, say, the
pipe being half full.  Thus large streams not ending on MTU
boundaries would not suffer under Nagle.

Remember also that Matt recently shot Reno for performance
reasons, when compared to Linux, when he should probably have
simply cranked initial window size to 3 (Jacobson) and added
piggy-back ACKs (Mogul).  While I'm sure the shooting is
actually temporary, and eventually it will end up back on,
once the performance issue is addressed correctly, realize
that there is heavy pressure in the form of benchmarks to
deal with, where speed is more important.  In fact, until
Windows 2000 (which has a BSDI ported stack that cost them
~$3M in fees), the Microsoft stacks routinely violated the
RFCs on connection closing using RSTs to avoid the FIN_WAIT_2
issue, while at the same time potentially leaving the peer
stuck forever (since RSTs are not retransmitted); you still
have to set a registry entry to get correct behaviour, even
today.

I think with much faster links becoming common (802.11e is
5GHz over the air [Apple Trademark "GigaWire"]), we will see
the MSL drop significantly; we already have problems with
sequence number recycling with the random jump forward for
"security" purposes to avoid session take-over -- even though
we all know that end-to-end doctorine means that security is
not correctly handled, if implemented at that layer.  2MSL is
already incredibly huge, compared to the cycle time for 32bit
sequence numbers on a 1Gbit link.

I really like the self-clocking in the rate halving algorithm,
but I guess that's pretty obvious by now.  8-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C31B833.FF98C07>