From owner-freebsd-arch@FreeBSD.ORG  Thu May 31 00:17:57 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 1171516A468
	for <freebsd-arch@freebsd.org>; Thu, 31 May 2007 00:17:57 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail32.syd.optusnet.com.au (mail32.syd.optusnet.com.au
	[211.29.132.63])
	by mx1.freebsd.org (Postfix) with ESMTP id 8CECD13C489
	for <freebsd-arch@freebsd.org>; Thu, 31 May 2007 00:17:56 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au
	[211.30.225.63])
	by mail32.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l4V0HhuN021232
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 31 May 2007 10:17:45 +1000
Date: Thu, 31 May 2007 10:17:17 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Jeff Roberson <jroberson@chesapeake.net>
In-Reply-To: <20070530115752.F661@10.0.0.1>
Message-ID: <20070531091419.S826@besplex.bde.org>
References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org>
	<20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org>
	<20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org>
	<20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1>
	<20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: Updated rusage patch
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 May 2007 00:17:57 -0000

On Wed, 30 May 2007, Jeff Roberson wrote:

> Ok, hopefully we're in the home stretch here.  Patch is at the same location. 
> I did the following:
> ...

% Index: kern/kern_acct.c
% ===================================================================
% RCS file: /usr/home/ncvs/src/sys/kern/kern_acct.c,v
% retrieving revision 1.89
% diff -u -p -r1.89 kern_acct.c
% --- kern/kern_acct.c	22 May 2007 06:51:37 -0000	1.89
% +++ kern/kern_acct.c	30 May 2007 00:14:12 -0000
% @@ -337,7 +337,7 @@ acct_process(struct thread *td)
%  	struct timeval ut, st, tmp;
%  	struct plimit *newlim, *oldlim;
%  	struct proc *p;
% -	struct rusage *r;
% +	struct rusage ru;
%  	int t, ret, vfslocked;
% 
%  	/*
% @@ -370,6 +370,7 @@ acct_process(struct thread *td)
%  	bcopy(p->p_comm, acct.ac_comm, sizeof acct.ac_comm);
% 
%  	/* (2) The amount of user and system time that was used */
% +	rufetch(p, &ru);
%  	calcru(p, &ut, &st);
%  	acct.ac_utime = encode_timeval(ut);
%  	acct.ac_stime = encode_timeval(st);
% @@ -383,19 +384,18 @@ acct_process(struct thread *td)
%  	acct.ac_etime = encode_timeval(tmp);
% 
%  	/* (4) The average amount of memory used */
% -	r = &p->p_stats->p_ru;
%  	tmp = ut;
%  	timevaladd(&tmp, &st);
%  	/* Convert tmp (i.e. u + s) into hz units to match ru_i*. */
%  	t = tmp.tv_sec * hz + tmp.tv_usec / tick;
%  	if (t)
% -		acct.ac_mem = encode_long((r->ru_ixrss + r->ru_idrss +
% -		    + r->ru_isrss) / t);
% +		acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss +
% +		    + ru.ru_isrss) / t);
%  	else
%  		acct.ac_mem = 0;
% 
%  	/* (5) The number of disk I/O operations done */
% -	acct.ac_io = encode_long(r->ru_inblock + r->ru_oublock);
% +	acct.ac_io = encode_long(ru.ru_inblock + ru.ru_oublock);
% 
%  	/* (6) The UID and GID of the process */
%  	acct.ac_uid = p->p_ucred->cr_ruid;

I see the problem here.  acct_process() should use the finalized p_ru.
However, acct_process() is called from exit1() eve before the premature
finalization there.  Thus the rufetch() here is necessary to aggregate
the stats, but it may give slightly anachronistic stats.  No one cares
(especially for the acct times, which had a granularity of 1/64 second
until recently, so they were usually 0 and rarely showed the differences
of a few uS due to this race), but it would be cleaner not to do extra
work here.

% Index: kern/kern_exit.c
% ===================================================================
% RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v
% retrieving revision 1.298
% diff -u -p -r1.298 kern_exit.c
% --- kern/kern_exit.c	14 May 2007 22:21:58 -0000	1.298
% +++ kern/kern_exit.c	30 May 2007 00:10:21 -0000
% @@ -169,7 +170,8 @@ retry:
%  		 * Threading support has been turned off.
%  		 */
%  	}
% -
% +	KASSERT(p->p_numthreads == 1,
% +	    ("exit1: proc %p exiting with %d threads", p, p->p_numthreads));
%  	/*
%  	 * Wakeup anyone in procfs' PIOCWAIT.  They should have a hold
%  	 * on our vmspace, so we should block below until they have

Lost some formatting.

% @@ -438,15 +442,17 @@ retry:
%  		PROC_UNLOCK(q);
%  	}
% 
% -	/*
% -	 * Save exit status and finalize rusage info except for times,
% -	 * adding in child rusage info later when our time is locked.
% -	 */
% +	/* Save exit status. */
%  	PROC_LOCK(p);
%  	p->p_xstat = rv;
%  	p->p_xthread = td;
% -	p->p_stats->p_ru.ru_nvcsw++;
% -	*p->p_ru = p->p_stats->p_ru;
% +	/*
% +	 * All statistics have been aggregated into the final td_ru by
% +	 * thread_exit(), just copy it into p_ru here.
% +	 */
% +	*ru = td->td_ru;
% +	ru->ru_nvcsw++;
% +	p->p_ru = ru;
% 
%  	/*
%  	 * Notify interested parties of our demise.

Still have various bugs here, starting with the grammar error (comma
splice) in the comment.  I now understand what you mean by thread_exit()
having aggregated the stats -- thread_exit() hasn't been called yet
for this thread, but it has been called for other threads in the process,
if any.  So td_ru has aggregated stats, and accessing it directly just
gives the old races for the unthreaded case:
- td_ru is locked by sched_lock, but there is no sched_locking here.
- any further updates of td_ru except for the times are not copied into
   p_ru unless they occur during the above copy in which case they corrupt
   the copy.
These bugs can probably be fixed by moving the copying to the final
thread_exit().  Maybe acct_process() can be moved there too, but I
suspect too much of the process has gone away by then for acct_process()
to work.

% Index: kern/kern_resource.c
% ===================================================================
% RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v
% retrieving revision 1.171
% diff -u -p -r1.171 kern_resource.c
% --- kern/kern_resource.c	27 May 2007 20:50:23 -0000	1.171
% +++ kern/kern_resource.c	30 May 2007 11:52:28 -0000
% @@ -619,6 +619,38 @@ setrlimit(td, uap)
%  	return (error);
%  }
% 
% +static void
% +lim_cb(void *arg)
% +{
% +	struct rlimit rlim;
% +	struct thread *td;
% +	struct proc *p;
% +
% +	p = arg;
% ...
% +	callout_reset(&p->p_limco, hz, lim_cb, p);
% +}
% +

A better cleanup than implicitly casting p to "void *" here: use the
original "void *" argment `arg'.

% Index: sys/proc.h
% ===================================================================
% RCS file: /usr/home/ncvs/src/sys/sys/proc.h,v
% retrieving revision 1.477
% diff -u -p -r1.477 proc.h
% --- sys/proc.h	24 Apr 2007 10:59:21 -0000	1.477
% +++ sys/proc.h	30 May 2007 11:41:53 -0000
% ...
% @@ -572,7 +576,7 @@ struct proc {
%  	struct mdproc	p_md;		/* Any machine-dependent fields. */
%  	struct callout	p_itcallout;	/* (h + c) Interval timer callout. */
%  	u_short		p_acflag;	/* (c) Accounting flags. */
% -	struct rusage	*p_ru;		/* (a) Exit information. XXX */
% +	struct rusage	*p_ru;		/* (a) Exit information. */

The locking annotation doesn't work well for pointers.  Locking for the
contents is more interesting than locking for the pointer.  The locking
annotation for struct rusage says mostly (j) (sched_lock), but that only
applies to td_ru -- it doesn't apply here and applies even less to
rusages in userland.  resource.h is supposed to be a userland header.

% Index: sys/resource.h
% ===================================================================
% RCS file: /usr/home/ncvs/src/sys/sys/resource.h,v
% retrieving revision 1.30
% diff -u -p -r1.30 resource.h
% --- sys/resource.h	16 Nov 2005 18:18:52 -0000	1.30
% +++ sys/resource.h	29 May 2007 20:29:22 -0000
% @@ -50,33 +50,31 @@
%  /*
%   * Resource utilization information.
%   *
% - * Locking key:
% - *      c - locked by proc mtx
% - *      j - locked by sched_lock mtx
% - *      n - not locked, lazy
% + * All fields are only modified by curthread and
% + * no locks are required to read.
%   */

Oops, I now see that you removed all the (j)s and sched locking here.
But this isn't quite right -- many of the fields are accessed by
statclock(), so something like interrupt atomicity or splhigh() is
required to access them.  The above (mainly in exit1()) gives much
the same races as in RELENG_4:
- RELENG_4: statclock() uses splhigh() but not interrupt atomicity.
             exit1() uses no locking and thus races with statclock().
   above: statclock() still uses sched_lock but not interrupt atomicity.
          exit1() uses no locking and thus races with statclock().
Time fields are mostly in rux and still fully locked by sched_lock.
exit1() copies some of them to p_ru, but that copy is not used.  I
think proc locking is still used for p_ru -- it is used in kern_wait(),
where it seems to be necessary to prevent multiple threads in the
parent process racing to reap the child.

Bruce