From owner-freebsd-arch@FreeBSD.ORG Thu May 31 00:17:57 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1171516A468 for ; Thu, 31 May 2007 00:17:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail32.syd.optusnet.com.au (mail32.syd.optusnet.com.au [211.29.132.63]) by mx1.freebsd.org (Postfix) with ESMTP id 8CECD13C489 for ; Thu, 31 May 2007 00:17:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail32.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4V0HhuN021232 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 31 May 2007 10:17:45 +1000 Date: Thu, 31 May 2007 10:17:17 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070530115752.F661@10.0.0.1> Message-ID: <20070531091419.S826@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2007 00:17:57 -0000 On Wed, 30 May 2007, Jeff Roberson wrote: > Ok, hopefully we're in the home stretch here. Patch is at the same location. > I did the following: > ... % Index: kern/kern_acct.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_acct.c,v % retrieving revision 1.89 % diff -u -p -r1.89 kern_acct.c % --- kern/kern_acct.c 22 May 2007 06:51:37 -0000 1.89 % +++ kern/kern_acct.c 30 May 2007 00:14:12 -0000 % @@ -337,7 +337,7 @@ acct_process(struct thread *td) % struct timeval ut, st, tmp; % struct plimit *newlim, *oldlim; % struct proc *p; % - struct rusage *r; % + struct rusage ru; % int t, ret, vfslocked; % % /* % @@ -370,6 +370,7 @@ acct_process(struct thread *td) % bcopy(p->p_comm, acct.ac_comm, sizeof acct.ac_comm); % % /* (2) The amount of user and system time that was used */ % + rufetch(p, &ru); % calcru(p, &ut, &st); % acct.ac_utime = encode_timeval(ut); % acct.ac_stime = encode_timeval(st); % @@ -383,19 +384,18 @@ acct_process(struct thread *td) % acct.ac_etime = encode_timeval(tmp); % % /* (4) The average amount of memory used */ % - r = &p->p_stats->p_ru; % tmp = ut; % timevaladd(&tmp, &st); % /* Convert tmp (i.e. u + s) into hz units to match ru_i*. */ % t = tmp.tv_sec * hz + tmp.tv_usec / tick; % if (t) % - acct.ac_mem = encode_long((r->ru_ixrss + r->ru_idrss + % - + r->ru_isrss) / t); % + acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss + % + + ru.ru_isrss) / t); % else % acct.ac_mem = 0; % % /* (5) The number of disk I/O operations done */ % - acct.ac_io = encode_long(r->ru_inblock + r->ru_oublock); % + acct.ac_io = encode_long(ru.ru_inblock + ru.ru_oublock); % % /* (6) The UID and GID of the process */ % acct.ac_uid = p->p_ucred->cr_ruid; I see the problem here. acct_process() should use the finalized p_ru. However, acct_process() is called from exit1() eve before the premature finalization there. Thus the rufetch() here is necessary to aggregate the stats, but it may give slightly anachronistic stats. No one cares (especially for the acct times, which had a granularity of 1/64 second until recently, so they were usually 0 and rarely showed the differences of a few uS due to this race), but it would be cleaner not to do extra work here. % Index: kern/kern_exit.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v % retrieving revision 1.298 % diff -u -p -r1.298 kern_exit.c % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 % +++ kern/kern_exit.c 30 May 2007 00:10:21 -0000 % @@ -169,7 +170,8 @@ retry: % * Threading support has been turned off. % */ % } % - % + KASSERT(p->p_numthreads == 1, % + ("exit1: proc %p exiting with %d threads", p, p->p_numthreads)); % /* % * Wakeup anyone in procfs' PIOCWAIT. They should have a hold % * on our vmspace, so we should block below until they have Lost some formatting. % @@ -438,15 +442,17 @@ retry: % PROC_UNLOCK(q); % } % % - /* % - * Save exit status and finalize rusage info except for times, % - * adding in child rusage info later when our time is locked. % - */ % + /* Save exit status. */ % PROC_LOCK(p); % p->p_xstat = rv; % p->p_xthread = td; % - p->p_stats->p_ru.ru_nvcsw++; % - *p->p_ru = p->p_stats->p_ru; % + /* % + * All statistics have been aggregated into the final td_ru by % + * thread_exit(), just copy it into p_ru here. % + */ % + *ru = td->td_ru; % + ru->ru_nvcsw++; % + p->p_ru = ru; % % /* % * Notify interested parties of our demise. Still have various bugs here, starting with the grammar error (comma splice) in the comment. I now understand what you mean by thread_exit() having aggregated the stats -- thread_exit() hasn't been called yet for this thread, but it has been called for other threads in the process, if any. So td_ru has aggregated stats, and accessing it directly just gives the old races for the unthreaded case: - td_ru is locked by sched_lock, but there is no sched_locking here. - any further updates of td_ru except for the times are not copied into p_ru unless they occur during the above copy in which case they corrupt the copy. These bugs can probably be fixed by moving the copying to the final thread_exit(). Maybe acct_process() can be moved there too, but I suspect too much of the process has gone away by then for acct_process() to work. % Index: kern/kern_resource.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v % retrieving revision 1.171 % diff -u -p -r1.171 kern_resource.c % --- kern/kern_resource.c 27 May 2007 20:50:23 -0000 1.171 % +++ kern/kern_resource.c 30 May 2007 11:52:28 -0000 % @@ -619,6 +619,38 @@ setrlimit(td, uap) % return (error); % } % % +static void % +lim_cb(void *arg) % +{ % + struct rlimit rlim; % + struct thread *td; % + struct proc *p; % + % + p = arg; % ... % + callout_reset(&p->p_limco, hz, lim_cb, p); % +} % + A better cleanup than implicitly casting p to "void *" here: use the original "void *" argment `arg'. % Index: sys/proc.h % =================================================================== % RCS file: /usr/home/ncvs/src/sys/sys/proc.h,v % retrieving revision 1.477 % diff -u -p -r1.477 proc.h % --- sys/proc.h 24 Apr 2007 10:59:21 -0000 1.477 % +++ sys/proc.h 30 May 2007 11:41:53 -0000 % ... % @@ -572,7 +576,7 @@ struct proc { % struct mdproc p_md; /* Any machine-dependent fields. */ % struct callout p_itcallout; /* (h + c) Interval timer callout. */ % u_short p_acflag; /* (c) Accounting flags. */ % - struct rusage *p_ru; /* (a) Exit information. XXX */ % + struct rusage *p_ru; /* (a) Exit information. */ The locking annotation doesn't work well for pointers. Locking for the contents is more interesting than locking for the pointer. The locking annotation for struct rusage says mostly (j) (sched_lock), but that only applies to td_ru -- it doesn't apply here and applies even less to rusages in userland. resource.h is supposed to be a userland header. % Index: sys/resource.h % =================================================================== % RCS file: /usr/home/ncvs/src/sys/sys/resource.h,v % retrieving revision 1.30 % diff -u -p -r1.30 resource.h % --- sys/resource.h 16 Nov 2005 18:18:52 -0000 1.30 % +++ sys/resource.h 29 May 2007 20:29:22 -0000 % @@ -50,33 +50,31 @@ % /* % * Resource utilization information. % * % - * Locking key: % - * c - locked by proc mtx % - * j - locked by sched_lock mtx % - * n - not locked, lazy % + * All fields are only modified by curthread and % + * no locks are required to read. % */ Oops, I now see that you removed all the (j)s and sched locking here. But this isn't quite right -- many of the fields are accessed by statclock(), so something like interrupt atomicity or splhigh() is required to access them. The above (mainly in exit1()) gives much the same races as in RELENG_4: - RELENG_4: statclock() uses splhigh() but not interrupt atomicity. exit1() uses no locking and thus races with statclock(). above: statclock() still uses sched_lock but not interrupt atomicity. exit1() uses no locking and thus races with statclock(). Time fields are mostly in rux and still fully locked by sched_lock. exit1() copies some of them to p_ru, but that copy is not used. I think proc locking is still used for p_ru -- it is used in kern_wait(), where it seems to be necessary to prevent multiple threads in the parent process racing to reap the child. Bruce