Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 01 Feb 2002 03:52:29 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Luigi Rizzo <rizzo@icir.org>
Cc:        Mike Silbersack <silby@silby.com>, Storms of Perfection <gary@outloud.org>, thierry@herbelot.com, replicator@ngs.ru, hackers@FreeBSD.org
Subject:   Re: Clock Granularity (kernel option HZ)
Message-ID:  <3C5A817D.11A5117A@mindspring.com>
References:  <20020131172729.X38382-100000@patrocles.silby.com> <3C59E873.4E8A82B5@mindspring.com> <20020201002339.C48439@iguana.icir.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Luigi Rizzo wrote:
> On Thu, Jan 31, 2002 at 04:59:31PM -0800, Terry Lambert wrote:
> > You will get a factor of 6 (approximately) improvement in
> > throughput vs. overhead if you process packets to completion
> > at interrupt, and process writes to completion at write time
> > from the process.
> 
> this does not match my numbers. e.g. using "fastforwarding"
> (which bypasses netisrs's) improves peak throughput
> by a factor between 1.2 and 2 on our test boxes.

This isn't the same thing; you are measuring something
that is affected, and something that isn't.  I'm measuring
pool retention time in the HW intr to NETISR queue transfer.

I'm talking about the latency in generating the SYN and the
ACK, on one side, and the SYN-ACK on the other, when going
all the way to a user space application.

Basically, most of the latency in a TCP connection is in the
latency of waiting for the NETISR to process the packets from
the receive queue through the stack, and then the context
switch to the user space process.

The improvement is in the throughput vs. the overhead -- the
amount of time you wait for the NETISR to run is on average
half the time between runs, which is HZ dependent.

The "between 1.2 and 2" is what you'd expect for the packet
processing alone.  But for an application like a web server
with 1K of static content, where there is a connection, an
accept, the request (client write), the server read, the
server write, and then the client read, and then the FIN/FIN
ACK/ACK, then you'd expect a 1.5 x 2 for both ends = 3, and,
if you could do the write path as well, then you could expect
6 (you can't really do the write path, because it's process
driven).

I was thinking about this with an FTP or SMTP server, where
you could piggyback the request data on the ACK for the SYN-ACK
from the client to the server, but it's not incredibly practical.

Like I said, this isn't a useful improvement, in any case,
unless you are running yourself out of memory, and you are
much more likely to be doing that in the socket buffers,
since it's not going to increase your overall throughput in
anything but the single client case, or the connect-and-drop
connections-per-second microbenchmark.

I haven't set up equipment to test the connections per second
rate on gigabit using the SYN cache.  I know that by processing
the incoming SYN to completion (all the way through the stack,
without a cache) at interrupt, it goes from ~7,000 per second
on a Tigon III to ~22,000 per second (and 28,000 on a Tigon III).

I rather expect the SYN cache to eat up any measurable gains
that you could have gotten by upping the HZ -- again, unless
you are running out of memory.

If you wanted to get around 400,000 connections per second, I
think I could get you there with some additional hack-foolery,
but of course it's not really a useful metric, IMO.  Total
number of simultaneous connections is much more useful, in the
long run, since that's what arbitrates your real load limits.

If you wanted to hack that number higher, that's pretty easy,
too.

One way would be the way that was suggested on the -arch list a
while back, and being even more agressive: turn the SYN cache
into a connection cache, and don't full instantiate it even
after the ACK, until you get first data.

Another way is that there are a lot of elements in the socket
structure that are never used simultaneously, and could be
reduced via union.

Yet another way would be to reduce the kqueue overhead by putting
the per object queues into the same bucket, instead of having
so mainy TAILQ structures floating around.

A final way would be to change the zone allocator to allocate
on a sizeof(long) boundary, which for 1,000,000 connections
saves a good 128M of memory at one shot.

There's a lot of low hanging fruit.

Frankly, all the interesting applications have CPU overhead
involved, so the trade off on CPU overhead from upping the HZ
value is probably a bad trade, anyway (I hinted at that earlier).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C5A817D.11A5117A>