From owner-cvs-all  Fri Aug 16  5:15:21 2002
Delivered-To: cvs-all@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 3DC9737B400; Fri, 16 Aug 2002 05:15:13 -0700 (PDT)
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id A2D6D43E65; Fri, 16 Aug 2002 05:15:11 -0700 (PDT)
	(envelope-from bde@zeta.org.au)
Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102])
	by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id MAA03370;
	Fri, 16 Aug 2002 12:14:56 GMT
Date: Fri, 16 Aug 2002 22:21:59 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: David Greenman-Lawrence <dg@dglawrence.com>
Cc: David Greenman <dg@FreeBSD.org>, <cvs-committers@FreeBSD.org>,
	<cvs-all@FreeBSD.org>
Subject: Re: cvs commit: src/sys/kern uipc_socket2.c
In-Reply-To: <20020815221609.E42978@nexus.root.com>
Message-ID: <20020816212205.N6756-100000@gamplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-cvs-all@FreeBSD.ORG
Precedence: bulk
List-ID: <cvs-all.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20cvs-all>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20cvs-all>
X-Loop: FreeBSD.ORG

On Thu, 15 Aug 2002, David Greenman-Lawrence wrote:

> >dg          2002/08/15 22:08:46 PDT
> >
> >  Modified files:
> >    sys/kern             uipc_socket2.c
> >  Log:
> >  Rewrote the space check algorithm in sbreserve() so that the extremely
> >  expensive (!) 64bit multiply, divide, and comparison aren't necessary
> ...
> >  The 64bit math in this function was measured in some kernel profiles as
> >  being as much as 5-8% of the total overhead of the TCP/IP stack and

Wow.  Doing as much 64-bit arithmetic on 32-bit systems as we do is
inelegant at best, but recent changes to do more of it in ffs and vm
didn't seem to have much effect on efficiency.  I guess this is because
the calculations are not at such an inner level or loop as they are in
networking.

>    I'm looking at calcru() as well, since is does four 64bit divides and
> was as expensive as sbreserve() in my profiles. The __qdivrem function
> that does the 64bit divide appears to consume several thousand instructions
> in the common case. Something to avoid if at all possible.

calcru() didn't seem to be so bad in my profiles.  I think its efficiency
only matters if you have a lot of very short lived processes, or if you
have processes that call getrusage() too much.

calcru() actually needs 128/64 -> 64 bit divides to work correctly.
It needs to scale 64-bit tick counts according to ratios of other
64-bit counts, and the only reasonable way to do this seems to be to
calculate expressions like ((tu * ut) / uu) where all the variables
are 64-bit.  The multiplication should be 64*64 -> 128 bit, but we
don't have this (een on 64-bit machines?), so the multiplication just
overflows.  There is an open PR about this.  Things start to overflow
when any tick counter exceeds 2^32, which happens after about 497 days
at HZ=100 and after only about 49 days at HZ=1000.  The submitter
of the PR actually had a single process which had run for long enough
(IIRC, it was for more than 100 days but not 497).  This shows that
u_int64_t is a bugus type for the tick counters.  It just pessimizes
incrementing the counters on 32-bit machines and cause overflow if
more than 32 bits are actually needed.

Further observations of 64-bit multiplication and division:
- gcc (on i386's) does 32*64 -> 64-bit multiplications by first extending
  the 32-bit value to a 64-bit one, so it has to do 3 multiplications
  instead of 2.
- our library 64/64 -> 64-bit division routine is similarly unclever.
  Many cases can be optimized to 64/32 -> 64-bit division which takes
  1 instruction (on i386's).

Further observations on calcru():
- short-lived processes only need 32-bit calculations.  These could be
  done explicitly, or perhaps without too much loss of efficiency by
  doing 64 or 128-bit calculations and relying on the low-level routines
  not being pessimized as above.  "short-lived" means having a lifetime
  of less than about sqt(2^32) ticks = 655 seconds at HZ=100.
- the times for long-lived processes could be scaled using mostly-
  incrementally calculated approximations.  The scaled times are already
  only approxmate since the scale factors are only approximate.  We
  have force each component of the scaled time (user/sys/interrupt)
  to advance monotonically, and this may cause the infinite-precision
  ratios of the scaled times to be somewhat different from the ratios
  for the tick counts.  However, rescaling at every step as accurately
  as possible as is done now hopefully causes the ratios to not differ
  by a large amount, and any good approximate method should have the
  same property.  Perhaps doing full 128-bit scaling to set a new goal
  for the ratios every 2^16 ticks would work well (otherwise use 32-bit
  scaling).

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message