From owner-freebsd-hackers  Mon Feb  4  1:15:51 2002
Delivered-To: freebsd-hackers@freebsd.org
Received: from avocet.prod.itd.earthlink.net (avocet.mail.pas.earthlink.net [207.217.120.50])
	by hub.freebsd.org (Postfix) with ESMTP id 750AF37B445
	for <hackers@freebsd.org>; Mon,  4 Feb 2002 01:15:05 -0800 (PST)
Received: from pool0039.cvx40-bradley.dialup.earthlink.net ([216.244.42.39] helo=mindspring.com)
	by avocet.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 16XfCl-0007fk-00; Mon, 04 Feb 2002 01:14:28 -0800
Message-ID: <3C5E50EE.C6FBD879@mindspring.com>
Date: Mon, 04 Feb 2002 01:14:22 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Mike Silbersack <silby@silby.com>
Cc: Alfred Perlstein <bright@mu.org>, Luigi Rizzo <rizzo@icir.org>,
	Storms of Perfection <gary@outloud.org>, thierry@herbelot.com,
	replicator@ngs.ru, hackers@FreeBSD.org
Subject: Re: Clock Granularity (kernel option HZ)
References: <20020203220958.Q13287-100000@patrocles.silby.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Mike Silbersack wrote:
> I was looking at the timer implementation a few weeks ago and pondering if
> it could be improved as well.  I think it's quite clever if you're dealing
> with a quantity of timers < 1000 or so, but it looks like it could be more
> optimal when used with the quantity of timers we have these days.

You can up the number of buckets; the theory is that most
network timeouts have "untimeout" called on them before
they have a chance to fire.  This means that any time you
hit a slot in the wheel, you have to traverse the entire
list of timeouts in the chain for that slot, since the
insertion of new timeouts is not ordered.

It's not really easy to do ordered insertion, since you
would need to have a data structure that lets you access
it as a btree, to even get O(log2(N)+.5) time on the
insertion (e.g. if it were made bsearchable).

The timeouts are based on the number of slots vs. the HZ
of the soft timer, vs. the number of outstanding requests
...this is what determines if there is the possibility of
more than one entry per bucket, or if a bucket list spans
more than 2MSL for all buckets (high HZ vs. number of
slots).  Even so, with a large number of connection
requests in a given soft timer quanta, they all end up in
the same bucket, which is worst case performance under an
attack.

So yeah: it works for a small number of connections (e.g.
desktop use), but falls over under a large number of
connections relative to log2(N) for N buckets (server use), 
so it's pretty suboptimal.

And the longer the list, the less likely you are to get to
all the entries within a quantum, so at some point, you
will starvation deadlock doing timer list traversal, and
nothing else.


> Did you attempt to combine the various tcp timers so that there would be
> one callout per socket?  I think it would be a pretty easy task; just
> store the various timers inside each socket, and only set a callout for
> the soonest to wake one.  Upon wakeup, quickly check them all.  That would
> reduce the total number of events in the callout wheel for one, and might
> help with the hash function aliasing long-term events such as keepalives
> into the current time.  (I'm assuming that the retransmission and other
> short-term timers do a good job of keeping themselves just ahead of the
> clockhand that they aren't falsely checked.)

This actually doesn't work, unless you round all events to
the same interval.  Basically, this would mean that you
would be rescheduling the same set of timers for a small
set of durations.

IMO, it's better to avoid the callout wheel entirely, for
this case.  You don't really want to examine all 2MSL
timers for 1,000,000 connections, every 2MSL.

My preferred approach to to create one list per fixed
duration timer, and then you end up with a small number
of these lists, leaving other timeout events to fall on
the callout wheel.

Then when the softclock fires, you go down each of this
small number of lists, until the time to expire exceeds
the current tick, at which point you bail.

This works with fixed duration lists, where it doesn't
work with the callout wheel, because you know that each
element in the list is a fixed duration, and therefore,
if you link onto the end of the list when scheduling a
new one, you are guaranteed that the list will be ordered
in terms of increasing tick-at-which-to-fire.  In the
callout wheel case, each bucket is not necessarily ordered
in increasing tick, and since the number of ticks is not
fixed -- there is no fixed duration guarantee -- you have
to traverse the entire list.

Since you already know that one of your main assumptions
is that the timeouts will be untimed-out before they
fire, then for a list of N elements, you only have to
traverse M + 1 elements, where there are M elements to
fire for this given interval.

The effect of this is to reduce a list of N elements to
traverse in a given bucket to a single element traversal,
in most cases ( M ~= 0, M << N ).

This effectively scales much better.  It scales almost
infinitely, whereas the other approach scales for the
total number of elements K ( M[0] ... M[n], n~= 5 for
a small number of fixed intervals can translate into a
large "K") to K/number_of_buckets_in_wheel ... basically,
an exponential times log2(number_of_buckets_in_wheel).


Personally, I've looked at the "Opportunistic Timers"
paper, and I'm not convinced about their arguments during
overload conditions; specifically, I think that all it
really does is free up CPU time for the non-overload case
that no one cares about, and degrades to hard timers in
the overload case, where the "opportunity" never presents
to do timer processing, since you're already screwed by
livelock at that point.

I think some small number of fixed interval lists is
probably the correct way to go for large scaling.  You
could probably safely eat a variance of +/- < 1MSL, if
you wanted to do so, by only doing the processing of
the soft interrupt list for the 2MSL timer every 1MSL,
or more frequently, if you needed a higher resolution;
in any case, it migh make sense to make _that_ list
traversal a scheduled event on the timer wheel; however,
I think that you probably want to seperate it, and do
the traversal of the fixed intervals up front, for other
timer load reasons (e.g. TCP timers aren't the only ones
that tend to go up hugely when you have 1,000,000 TCP
connections lying around).  I would do the fixed interval
traversal first, before the wheel slot list, as you *know*
that all but the last compare will result in a timer
firing, so it's more like "real work" than the callout
wheel.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message