From owner-freebsd-hackers Mon Feb 4 1:15:51 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from avocet.prod.itd.earthlink.net (avocet.mail.pas.earthlink.net [207.217.120.50]) by hub.freebsd.org (Postfix) with ESMTP id 750AF37B445 for ; Mon, 4 Feb 2002 01:15:05 -0800 (PST) Received: from pool0039.cvx40-bradley.dialup.earthlink.net ([216.244.42.39] helo=mindspring.com) by avocet.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16XfCl-0007fk-00; Mon, 04 Feb 2002 01:14:28 -0800 Message-ID: <3C5E50EE.C6FBD879@mindspring.com> Date: Mon, 04 Feb 2002 01:14:22 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Mike Silbersack Cc: Alfred Perlstein , Luigi Rizzo , Storms of Perfection , thierry@herbelot.com, replicator@ngs.ru, hackers@FreeBSD.org Subject: Re: Clock Granularity (kernel option HZ) References: <20020203220958.Q13287-100000@patrocles.silby.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Mike Silbersack wrote: > I was looking at the timer implementation a few weeks ago and pondering if > it could be improved as well. I think it's quite clever if you're dealing > with a quantity of timers < 1000 or so, but it looks like it could be more > optimal when used with the quantity of timers we have these days. You can up the number of buckets; the theory is that most network timeouts have "untimeout" called on them before they have a chance to fire. This means that any time you hit a slot in the wheel, you have to traverse the entire list of timeouts in the chain for that slot, since the insertion of new timeouts is not ordered. It's not really easy to do ordered insertion, since you would need to have a data structure that lets you access it as a btree, to even get O(log2(N)+.5) time on the insertion (e.g. if it were made bsearchable). The timeouts are based on the number of slots vs. the HZ of the soft timer, vs. the number of outstanding requests ...this is what determines if there is the possibility of more than one entry per bucket, or if a bucket list spans more than 2MSL for all buckets (high HZ vs. number of slots). Even so, with a large number of connection requests in a given soft timer quanta, they all end up in the same bucket, which is worst case performance under an attack. So yeah: it works for a small number of connections (e.g. desktop use), but falls over under a large number of connections relative to log2(N) for N buckets (server use), so it's pretty suboptimal. And the longer the list, the less likely you are to get to all the entries within a quantum, so at some point, you will starvation deadlock doing timer list traversal, and nothing else. > Did you attempt to combine the various tcp timers so that there would be > one callout per socket? I think it would be a pretty easy task; just > store the various timers inside each socket, and only set a callout for > the soonest to wake one. Upon wakeup, quickly check them all. That would > reduce the total number of events in the callout wheel for one, and might > help with the hash function aliasing long-term events such as keepalives > into the current time. (I'm assuming that the retransmission and other > short-term timers do a good job of keeping themselves just ahead of the > clockhand that they aren't falsely checked.) This actually doesn't work, unless you round all events to the same interval. Basically, this would mean that you would be rescheduling the same set of timers for a small set of durations. IMO, it's better to avoid the callout wheel entirely, for this case. You don't really want to examine all 2MSL timers for 1,000,000 connections, every 2MSL. My preferred approach to to create one list per fixed duration timer, and then you end up with a small number of these lists, leaving other timeout events to fall on the callout wheel. Then when the softclock fires, you go down each of this small number of lists, until the time to expire exceeds the current tick, at which point you bail. This works with fixed duration lists, where it doesn't work with the callout wheel, because you know that each element in the list is a fixed duration, and therefore, if you link onto the end of the list when scheduling a new one, you are guaranteed that the list will be ordered in terms of increasing tick-at-which-to-fire. In the callout wheel case, each bucket is not necessarily ordered in increasing tick, and since the number of ticks is not fixed -- there is no fixed duration guarantee -- you have to traverse the entire list. Since you already know that one of your main assumptions is that the timeouts will be untimed-out before they fire, then for a list of N elements, you only have to traverse M + 1 elements, where there are M elements to fire for this given interval. The effect of this is to reduce a list of N elements to traverse in a given bucket to a single element traversal, in most cases ( M ~= 0, M << N ). This effectively scales much better. It scales almost infinitely, whereas the other approach scales for the total number of elements K ( M[0] ... M[n], n~= 5 for a small number of fixed intervals can translate into a large "K") to K/number_of_buckets_in_wheel ... basically, an exponential times log2(number_of_buckets_in_wheel). Personally, I've looked at the "Opportunistic Timers" paper, and I'm not convinced about their arguments during overload conditions; specifically, I think that all it really does is free up CPU time for the non-overload case that no one cares about, and degrades to hard timers in the overload case, where the "opportunity" never presents to do timer processing, since you're already screwed by livelock at that point. I think some small number of fixed interval lists is probably the correct way to go for large scaling. You could probably safely eat a variance of +/- < 1MSL, if you wanted to do so, by only doing the processing of the soft interrupt list for the 2MSL timer every 1MSL, or more frequently, if you needed a higher resolution; in any case, it migh make sense to make _that_ list traversal a scheduled event on the timer wheel; however, I think that you probably want to seperate it, and do the traversal of the fixed intervals up front, for other timer load reasons (e.g. TCP timers aren't the only ones that tend to go up hugely when you have 1,000,000 TCP connections lying around). I would do the fixed interval traversal first, before the wheel slot list, as you *know* that all but the last compare will result in a timer firing, so it's more like "real work" than the callout wheel. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message