From owner-cvs-all@FreeBSD.ORG Sun Nov 27 13:17:48 2005 Return-Path: X-Original-To: cvs-all@FreeBSD.org Delivered-To: cvs-all@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6932116A41F; Sun, 27 Nov 2005 13:17:48 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1BA4F43D6E; Sun, 27 Nov 2005 13:17:41 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id A7FD846B52; Sun, 27 Nov 2005 08:17:39 -0500 (EST) Date: Sun, 27 Nov 2005 13:17:39 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Bruce Evans In-Reply-To: <20051127230412.H28222@delplex.bde.org> Message-ID: <20051127125844.V81764@fledge.watson.org> References: <200511270055.jAR0tIkF032480@repoman.freebsd.org> <20051127230412.H28222@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, cvs-all@FreeBSD.org Subject: Re: cvs commit: src/sys/sys time.h src/sys/kern kern_time.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Nov 2005 13:17:48 -0000 On Sun, 27 Nov 2005, Bruce Evans wrote: >> Add experimental low-precision clockid_t names corresponding to these >> clocks, but implemented using cached timestamps in kernel rather than >> a full time counter query. > > These existence of these interfaces is a mistake even in the kernel. On > all machines that I've looked at, the calls to the high-precision > binuptime() outnumber calls to all the other high-level timecounter > routines combined by a large factor. E.g., on pluto1.freebsd.org (which > seems typical) now, after an uptime of ~8 days, there have been ~1200 > million calls to binuptime(), ~124 million calls to getmicrouptime(), > ~72 million calls to gtemicrotime(), and relatively few other calls. > > Thus we get a small speedup at a cost of some complexity and large > inerface bloat. > > This is partly because there are too many context switches and context > switches necessarily use a precise timestamp, and file timestamps are > under-represented since they normally use a direct access to > time_second. Interestingly, I've now observed several application workloads where the rate of user space high precision time queries far outnumbers the kernel rate of time stamp queries. Specifically, for applications that are event-driven and need to generate time outs to pass to poll() and select(). Applications like BIND9 generate two gettimeofday() system calls for every select() call, in order to manage their own internal event engine. As select() itself has a precision keyed to 1/HZ, using time stamps at a similarly low precision for driving an internal scheduler based on select() or poll() makes some amount of sense. Using the libwrapper.so I attached to my previous e-mail and setting 'FAST' mode, I see a 4% performance improvement in throughput for BIND9. David Xu has reported a similar improvement in MySQL performance using libwrapper.so. For BIND9 under high load, the rate of context switches is much lower than the rate of select() calls, as multiple queries are delivered to the UDP socket per interrupt due to interrupt coalescing (etc). Given the way applications are being written to manage their own event loops using select() or similar interfaces, the ability to quickly request low precision timestamps for use with those interfaces makes a fairly significant difference in macro-level performance. How we expose "cheaper, suckier time" is something I'm quite willing to discuss, but the evidence seems to suggest that if we want to improve the performance of this class of applications, we need to provide time keeping services that match their requirements (run frequently with fairly weak requirements on precision). I'm entirely open to exposing this service in different ways, or offering a different notion of "cheaper, suckier". For example, I could imagine exposing an interface intended to return timing information specifically for HZ-driven sleep mechanisms, such as poll() and select(). The advantage, for experimental purposes, in the approach I committed is that it allows us to easily test the impact of such changes on applications without modifing the application. The disadvantage is that we'll want to change it, but given that I am not yet clear we fully understand the requirements, that is probably inevitable. FWIW, once we have an interface that says "here's how you get bad time", we can implement it in other ways than I've done -- for example, exporting a kernel memory page with the necessary information to somewhat reliably convert rdtsc() into an estimated time stamp without ever doing a system call (this is what Darwin does, btw). Your proposals on how this should be done are most welcome, but the trick will be balancing the needs of several parties -- people interested in highly precise time measurement due to a preoccupation with NTP and atomic clocks, people who just want their applications to run faster, and people who want the system to be clean. I think we can meet most of the needs of most of these people if we do it right, but I'm not sure what right is since (to be honest) I don't have a detailed understanding of what each of these communities really needs (let alone wants). Robert N M Watson