From owner-cvs-all Fri Aug 16 5:15:21 2002 Delivered-To: cvs-all@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3DC9737B400; Fri, 16 Aug 2002 05:15:13 -0700 (PDT) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id A2D6D43E65; Fri, 16 Aug 2002 05:15:11 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id MAA03370; Fri, 16 Aug 2002 12:14:56 GMT Date: Fri, 16 Aug 2002 22:21:59 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: David Greenman-Lawrence Cc: David Greenman , , Subject: Re: cvs commit: src/sys/kern uipc_socket2.c In-Reply-To: <20020815221609.E42978@nexus.root.com> Message-ID: <20020816212205.N6756-100000@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-cvs-all@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Thu, 15 Aug 2002, David Greenman-Lawrence wrote: > >dg 2002/08/15 22:08:46 PDT > > > > Modified files: > > sys/kern uipc_socket2.c > > Log: > > Rewrote the space check algorithm in sbreserve() so that the extremely > > expensive (!) 64bit multiply, divide, and comparison aren't necessary > ... > > The 64bit math in this function was measured in some kernel profiles as > > being as much as 5-8% of the total overhead of the TCP/IP stack and Wow. Doing as much 64-bit arithmetic on 32-bit systems as we do is inelegant at best, but recent changes to do more of it in ffs and vm didn't seem to have much effect on efficiency. I guess this is because the calculations are not at such an inner level or loop as they are in networking. > I'm looking at calcru() as well, since is does four 64bit divides and > was as expensive as sbreserve() in my profiles. The __qdivrem function > that does the 64bit divide appears to consume several thousand instructions > in the common case. Something to avoid if at all possible. calcru() didn't seem to be so bad in my profiles. I think its efficiency only matters if you have a lot of very short lived processes, or if you have processes that call getrusage() too much. calcru() actually needs 128/64 -> 64 bit divides to work correctly. It needs to scale 64-bit tick counts according to ratios of other 64-bit counts, and the only reasonable way to do this seems to be to calculate expressions like ((tu * ut) / uu) where all the variables are 64-bit. The multiplication should be 64*64 -> 128 bit, but we don't have this (een on 64-bit machines?), so the multiplication just overflows. There is an open PR about this. Things start to overflow when any tick counter exceeds 2^32, which happens after about 497 days at HZ=100 and after only about 49 days at HZ=1000. The submitter of the PR actually had a single process which had run for long enough (IIRC, it was for more than 100 days but not 497). This shows that u_int64_t is a bugus type for the tick counters. It just pessimizes incrementing the counters on 32-bit machines and cause overflow if more than 32 bits are actually needed. Further observations of 64-bit multiplication and division: - gcc (on i386's) does 32*64 -> 64-bit multiplications by first extending the 32-bit value to a 64-bit one, so it has to do 3 multiplications instead of 2. - our library 64/64 -> 64-bit division routine is similarly unclever. Many cases can be optimized to 64/32 -> 64-bit division which takes 1 instruction (on i386's). Further observations on calcru(): - short-lived processes only need 32-bit calculations. These could be done explicitly, or perhaps without too much loss of efficiency by doing 64 or 128-bit calculations and relying on the low-level routines not being pessimized as above. "short-lived" means having a lifetime of less than about sqt(2^32) ticks = 655 seconds at HZ=100. - the times for long-lived processes could be scaled using mostly- incrementally calculated approximations. The scaled times are already only approxmate since the scale factors are only approximate. We have force each component of the scaled time (user/sys/interrupt) to advance monotonically, and this may cause the infinite-precision ratios of the scaled times to be somewhat different from the ratios for the tick counts. However, rescaling at every step as accurately as possible as is done now hopefully causes the ratios to not differ by a large amount, and any good approximate method should have the same property. Perhaps doing full 128-bit scaling to set a new goal for the ratios every 2^16 ticks would work well (otherwise use 32-bit scaling). Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe cvs-all" in the body of the message