From owner-freebsd-arch  Sun Nov 11  7:32: 8 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id C19DD37B41F
	for <freebsd-arch@FreeBSD.org>; Sun, 11 Nov 2001 07:32:03 -0800 (PST)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fABFVsB11812
	for <freebsd-arch@FreeBSD.org>; Sun, 11 Nov 2001 10:31:55 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Sun, 11 Nov 2001 10:31:54 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: freebsd-arch@FreeBSD.org
Subject: cur{thread/proc}, or not.
Message-ID: <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


Every now and then, we get to discuss curproc, and its merits.  Let's do
it again.

There are a number of uses of curproc in the netinet code, used to
retrieve credentials for authorization somewhere down the stack, when no
proc or thread pointer has been passed down.  With the eventual addition
of td->td_ucred, it will be desirable to use the credential for the
current thread, rather than the proc, which will require locking to use. 
(This is, incidentally, true of many places in the system).  As I
understand it, use of curproc was branded 'undesirable' at some point in
the semi-distant past, and since that time, a reference to 'proc' has been
passed down the stack.  With a change to KSE, this has been translated to
references the thread, but the issue remains the same.  This comes up in
particular because I have a tree where I have propagated the thread
pointer down if_ioctl in the network stack: the normal ioctl call carries
a thread pointer now, but when it is translated into if_ioctl by the
network stack, that pointer is lost.  This raises the question: should we
(in practice) be adding process or thread pointers to many more of the
function arguments, or should we switch to using curproc/curthread
instead. 

The argument I've seen a couple of times for using the proc/thread pointer
is that of delegation: a kernel thread might be acting on behalf of
another process, and need a reference to the process so that it can use
its (file descriptors, credential, address space, ...).  I suspect that,
in practice, this is a Bad Idea, given the increased complexity of
fine-grained threading/locking and SMPng.  "borrowing" references in such
an environment seems like a recipe for buginess, and instead such
references should be "given" by the thread that obeys the
locking/reference counting, and should not be done at the level of the
proc.  For example, for a credential, you would simply grab another
reference to the credential and pass off the reference, rather than
sharing a reference.  In fact, it seems that in a lot of places where a
struct proc is passed in, the implicit assumption of the code is that this
is the "current process", and as we add more process-related locking, that
assumption will probably only grow stronger, so as to not raise lock order
issues. 

I don't pretend to have a grasp of all the issues here, so the purpose of
this message is to raise the issues so that I can understand them.  I have
a tree where I've eliminated many references to curproc; however, I'm now
wondering if it wouldn't simply be more useful to eliminate many of the
references to struct proc in the function arguments, and use curproc
instead, and add references to ucred (and related ref-counted structures)
as needed for delegation types of situations.  In particular, that would
suggest the following changes: 

(1) 'suser' would always use 'curthread', and lose its proc/thread
    argument (proc in the main tree, thread in my tree).  'suser_cred'
    would be used for delegation situations (as is the case in my tree). 

    (Note that this remains incompatible with other platforms, which
    generally accept a cred argument for 'suser', including other *BSD and
    Solaris.) 

(2) proc/thread arguments would (in general) be removed (gradually) from
    the arguments of many existing kernel functions, and
    'curproc'/'curthread' would be used instead.  For example, in the
    'VOP_*' interface, use of the 'p' or 'td' entries would be abandoned,
    and 'cred' would be more widely passed down (such as into open). 

    (Note that this is the path taken by a number of other fine-grained
    UNIX kernels, including Solaris, IRIX, et al). 

(3) Use of 'curproc' would be removed in a number of places, where
    abstracted functions such as 'suser' would invoke curthread instead.

It seems to me that unless a very strong argument exists against using
curproc/curthread (and I don't preclude one existing), using them would
actually be an improvement, as it would assert that this class of
'borrowing' couldn't exist, simplifying the kernel, not to mention
squeezing a bit more stuff out of the stack (which, at ten levels deep,
actually begins to add up on 64-bit machines).  I believe that there are
many places where the 'p' passed in is implicitly assumed to be the
current process, and that making that reliance explicit would be an
improvement, rather than a problem. 

Flames appreciated. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 10:40:16 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0D24D37B405; Sun, 11 Nov 2001 10:40:09 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id KAA89779;
	Sun, 11 Nov 2001 10:23:07 -0800 (PST)
Date: Sun, 11 Nov 2001 10:23:06 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Robert Watson <rwatson@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org>
Message-ID: <Pine.BSF.4.21.0111110957150.89663-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Sun, 11 Nov 2001, Robert Watson wrote:

> 
> Every now and then, we get to discuss curproc, and its merits.  Let's do
> it again.
> 
> There are a number of uses of curproc in the netinet code, used to
> retrieve credentials for authorization somewhere down the stack, when no
> proc or thread pointer has been passed down.  With the eventual addition
> of td->td_ucred, it will be desirable to use the credential for the
> current thread, rather than the proc, which will require locking to use. 
> (This is, incidentally, true of many places in the system).  As I
> understand it, use of curproc was branded 'undesirable' at some point in
> the semi-distant past, and since that time, a reference to 'proc' has been
> passed down the stack.  With a change to KSE, this has been translated to
> references the thread, but the issue remains the same.  This comes up in
> particular because I have a tree where I have propagated the thread
> pointer down if_ioctl in the network stack: the normal ioctl call carries
> a thread pointer now, but when it is translated into if_ioctl by the
> network stack, that pointer is lost.  This raises the question: should we
> (in practice) be adding process or thread pointers to many more of the
> function arguments, or should we switch to using curproc/curthread
> instead. 

I think we should, though there are some cases where it is not clear that
there is always a thread to add other than the base thread of the idle
process. Also, since thread structures in the kernel are only assigned to
a process to do work for the duration of the particular system call
that they are performing, no thread pointer should be stored somewhere
where it may be referenced after the syscall has returned to userland.
In that case the best you can do is a proc pointer.

Also, in SMPng cur{thread,proc} takes some time to get as I'm told that
dereferencing %fs is very slow.. (Not sure how true that is).

> 
> The argument I've seen a couple of times for using the proc/thread pointer
> is that of delegation: a kernel thread might be acting on behalf of
> another process, and need a reference to the process so that it can use
> its (file descriptors, credential, address space, ...). 

I was worried about this case doing the KSE switchover
but I never actually saw a case where it was obviously doing this...
(though I have my suspicions that it may still happen in some
non-obvious places).

> I suspect that,
> in practice, this is a Bad Idea, given the increased complexity of
> fine-grained threading/locking and SMPng.  "borrowing" references in such
> an environment seems like a recipe for buginess, and instead such
> references should be "given" by the thread that obeys the
> locking/reference counting, and should not be done at the level of the
> proc.  For example, for a credential, you would simply grab another
> reference to the credential and pass off the reference, rather than
> sharing a reference.  In fact, it seems that in a lot of places where a
> struct proc is passed in, the implicit assumption of the code is that this
> is the "current process", and as we add more process-related locking, that
> assumption will probably only grow stronger, so as to not raise lock order
> issues. 

There are other reasons for needing the pointer than for a credential.
For example in AIO, the process pointer is stored so that
address space can be loaned to the aio threads to do the IO.

> 
> I don't pretend to have a grasp of all the issues here, so the purpose of
> this message is to raise the issues so that I can understand them.  I have
> a tree where I've eliminated many references to curproc; however, I'm now
> wondering if it wouldn't simply be more useful to eliminate many of the
> references to struct proc in the function arguments, and use curproc
> instead, and add references to ucred (and related ref-counted structures)
> as needed for delegation types of situations.  In particular, that would
> suggest the following changes: 

I have thought about this both ways...
both have advantages. In some architectures, getting curthread might
be very expensive.
Removing the proc pointers would take us back where we were before BSD4.4
(anyone know if Kirk is on this list?)

> 
> (1) 'suser' would always use 'curthread', and lose its proc/thread
>     argument (proc in the main tree, thread in my tree).  'suser_cred'
>     would be used for delegation situations (as is the case in my tree). 
> 
>     (Note that this remains incompatible with other platforms, which
>     generally accept a cred argument for 'suser', including other *BSD and
>     Solaris.) 
> 
> (2) proc/thread arguments would (in general) be removed (gradually) from
>     the arguments of many existing kernel functions, and
>     'curproc'/'curthread' would be used instead.  For example, in the
>     'VOP_*' interface, use of the 'p' or 'td' entries would be abandoned,
>     and 'cred' would be more widely passed down (such as into open). 
> 
>     (Note that this is the path taken by a number of other fine-grained
>     UNIX kernels, including Solaris, IRIX, et al). 
> 
> (3) Use of 'curproc' would be removed in a number of places, where
>     abstracted functions such as 'suser' would invoke curthread instead.

I believe it was an early move to start 
to prepare for some sort of SMP work
where they couldn't think of an architecture neutral 
way of getting 'curthread' that was guaranteed to be efficient 
everywhere.

> 
> It seems to me that unless a very strong argument exists against using
> curproc/curthread (and I don't preclude one existing), using them would
> actually be an improvement, as it would assert that this class of
> 'borrowing' couldn't exist, simplifying the kernel, not to mention
> squeezing a bit more stuff out of the stack (which, at ten levels deep,
> actually begins to add up on 64-bit machines).  I believe that there are
> many places where the 'p' passed in is implicitly assumed to be the
> current process, and that making that reliance explicit would be an
> improvement, rather than a problem. 
> 
> Flames appreciated. 

I think you'll get few flames..
but probably a lot of silence froma many people.


> 
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
> robert@fledge.watson.org      NAI Labs, Safeport Network Services
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 11:17:47 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7DD8537B418; Sun, 11 Nov 2001 11:17:35 -0800 (PST)
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fABJHZM10690;
	Sun, 11 Nov 2001 11:17:35 -0800 (PST)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id 00D053807; Sun, 11 Nov 2001 11:17:34 -0800 (PST)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4
To: Robert Watson <rwatson@FreeBSD.ORG>
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org> 
Date: Sun, 11 Nov 2001 11:17:34 -0800
From: Peter Wemm <peter@wemm.org>
Message-Id: <20011111191735.00D053807@overcee.netplex.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Robert Watson wrote:

> It seems to me that unless a very strong argument exists against using
> curproc/curthread (and I don't preclude one existing), using them would
> actually be an improvement, as it would assert that this class of
> 'borrowing' couldn't exist, simplifying the kernel, not to mention
> squeezing a bit more stuff out of the stack (which, at ten levels deep,
> actually begins to add up on 64-bit machines).  I believe that there are
> many places where the 'p' passed in is implicitly assumed to be the
> current process, and that making that reliance explicit would be an
> improvement, rather than a problem. 

My gripe is that on i386, it creates a LOT of work for the compiler.

Consider this small function in kern_kthread.c:
void
kthread_exit(int ecode)
{

        sx_xlock(&proctree_lock);
        PROC_LOCK(curproc);
        proc_reparent(curproc, initproc);
        PROC_UNLOCK(curproc);
        sx_xunlock(&proctree_lock);
        exit1(curthread, W_EXITCODE(ecode, 0));
}

Have a look at http://people.freebsd.org/~peter/macros.c  where I've cpp'ed
it and indented it for readability.  Anyway, kthread_exit() turns into
this for the compiler to choke on:

void
kthread_exit(int ecode)
{

	_sx_xlock((&proctree_lock), 0, 0);
	do {
		do {
			if (!atomic_cmpset_ptr(&(((((&((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			})->td_proc))->p_mtx)))))->mtx_lock, (void *)0x00000004, ((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			}))))) _mtx_lock_sleep(((((&((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			})->td_proc))->p_mtx)))), (((0))), ((0)), ((0)));
		} while (0);
		do {
			if ((((((0))) & 0x00000002) == 0 && (((&(((&((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			})->td_proc))->p_mtx)))->mtx_object))->lo_flags & 0x00040000) == 0));
		} while (0);
	} while (0);
	proc_reparent((({
		__typeof(((struct globaldata *) 0)->gd_curthread) __result;
		if (sizeof(__result) == 1) {
			u_char          __b;
			__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
			__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
		} else if (sizeof(__result) == 2) {
			u_short         __w;
			__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
			__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
		} else if (sizeof(__result) == 4) {
			u_int           __i;
			__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
			__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
		} else {
			__result = *({
				__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
				__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
				__p;
			});
		} __result;
	})->td_proc), initproc);
	do {
		do {
			if (((((((0)))) & 0x00000002) == 0 && (((&(((&((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			})->td_proc))->p_mtx)))->mtx_object))->lo_flags & 0x00040000) == 0));
		} while (0);
		do {
			if (!atomic_cmpset_ptr(&(((((&((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			})->td_proc))->p_mtx)))))->mtx_lock, ((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			}))), (void *)0x00000004)) _mtx_unlock_sleep(((((&((({
				__typeof(((struct globaldata *) 0)->gd_curthread) __result;
				if (sizeof(__result) == 1) {
					u_char          __b;
					__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
				} else if (sizeof(__result) == 2) {
					u_short         __w;
					__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
				} else if (sizeof(__result) == 4) {
					u_int           __i;
					__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
					__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
				} else {
					__result = *({
						__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
						__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
						__p;
					});
				} __result;
			})->td_proc))->p_mtx)))), (((0))), ((0)), ((0)));
		} while (0);
	} while (0);
	_sx_xunlock((&proctree_lock), 0, 0);
	exit1(({
		__typeof(((struct globaldata *) 0)->gd_curthread) __result;
		if (sizeof(__result) == 1) {
			u_char          __b;
			__asm           volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
			__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b;
		} else if (sizeof(__result) == 2) {
			u_short         __w;
			__asm           volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
			__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w;
		} else if (sizeof(__result) == 4) {
			u_int           __i;
			__asm           volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread)))));
			__result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i;
		} else {
			__result = *({
				__typeof(((struct globaldata *) 0)->gd_curthread) * __p;
				__asm           volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread))));
				__p;
			});
		} __result;
	}), ((ecode) << 8 | (0)));
}

Ever wonder why the kernel gets slower and slower to compile?  Ever
compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by
the speed?

Count me in the 'curproc considered harmful' camp.  (or curthread).

Yes, this doesn't end up as a lot of code in the end, but the compiler
still has to digest it and the optimizer has got to do a sh!tload of work
to eliminate massive quantities of unused code.   Just imagine what
happens without -O.

Regarding 64 bit machines, all of our 64 bit platforms use register
passing, some with fixed size register frames.  On those, the difference
of saving one argument isn't going to add up to much, if anything.  And
it would still require an intermediate frame to hold the calculated value
of curproc/curthread where its used.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 14:28:12 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4A16737B419; Sun, 11 Nov 2001 14:28:10 -0800 (PST)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.11.6/8.11.6) with ESMTP id fABMRCL01972;
	Sun, 11 Nov 2001 23:27:17 +0100 (CET)
	(envelope-from phk@critter.freebsd.dk)
To: Peter Wemm <peter@wemm.org>
Cc: Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: Your message of "Sun, 11 Nov 2001 11:17:34 PST."
             <20011111191735.00D053807@overcee.netplex.com.au> 
Date: Sun, 11 Nov 2001 23:27:12 +0100
Message-ID: <1970.1005517632@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

In message <20011111191735.00D053807@overcee.netplex.com.au>, Peter Wemm writes:

> [ass'y output of gcc]
>
>Ever wonder why the kernel gets slower and slower to compile?  Ever
>compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by
>the speed?
>
>Count me in the 'curproc considered harmful' camp.  (or curthread).

Peters example more than clenches the argument for me, but I also
wonder if we would not paint ourselves into a corner with the
cur{proc|thread} stuff if the future ends up being more parallel
and cluster-oriented.

<VOICE MODE=GODFATHER">
Roberto!  come over here!

Do you zink zese Curproc and Curthread they will get losst on zeir way home ?

Good boy.

</VOICE>

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 14:49:26 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180])
	by hub.freebsd.org (Postfix) with ESMTP
	id EB3C237B41A; Sun, 11 Nov 2001 14:49:20 -0800 (PST)
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fABMnKM11180;
	Sun, 11 Nov 2001 14:49:20 -0800 (PST)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id B24623807; Sun, 11 Nov 2001 14:49:19 -0800 (PST)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4
To: Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc: Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <1970.1005517632@critter.freebsd.dk> 
Date: Sun, 11 Nov 2001 14:49:19 -0800
From: Peter Wemm <peter@wemm.org>
Message-Id: <20011111224919.B24623807@overcee.netplex.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Poul-Henning Kamp wrote:
> In message <20011111191735.00D053807@overcee.netplex.com.au>, Peter Wemm writ
    es:
> 
> > [ass'y output of gcc]
> >
> >Ever wonder why the kernel gets slower and slower to compile?  Ever
> >compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by
> >the speed?
> >
> >Count me in the 'curproc considered harmful' camp.  (or curthread).
> 
> Peters example more than clenches the argument for me, but I also
> wonder if we would not paint ourselves into a corner with the
> cur{proc|thread} stuff if the future ends up being more parallel
> and cluster-oriented.

I believe it would be a lot easier to remove the p/td arguments later
once we know that we dont need them, than to remove them now and discover
later that we do need them and have to go back and figure it all out again.

To answer Robert.. By all means be explicit about creds etc, but lets not
get two different bikesheds^H^H^H^H^H^Hchanges mixed up together.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 14:53: 5 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id D8C3B37B41A
	for <freebsd-arch@FreeBSD.ORG>; Sun, 11 Nov 2001 14:52:58 -0800 (PST)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fABMqfB16719;
	Sun, 11 Nov 2001 17:52:41 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Sun, 11 Nov 2001 17:52:40 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.ORG>
X-Sender: robert@fledge.watson.org
To: Peter Wemm <peter@wemm.org>
Cc: Poul-Henning Kamp <phk@critter.freebsd.dk>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <20011111224919.B24623807@overcee.netplex.com.au>
Message-ID: <Pine.NEB.3.96L.1011111175017.16646B-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Sun, 11 Nov 2001, Peter Wemm wrote:

> I believe it would be a lot easier to remove the p/td arguments later
> once we know that we dont need them, than to remove them now and
> discover later that we do need them and have to go back and figure it
> all out again. 
> 
> To answer Robert.. By all means be explicit about creds etc, but lets
> not get two different bikesheds^H^H^H^H^H^Hchanges mixed up together. 

Well, my concern was really whether or not I should go ahead and commit
the if_ioctl changes to add a td argument, which scatter new thread
references all over the place, when adopting a 'curthread' philosophy
would make that a waste of time.  I'll post the patches, once I've merged
in some recent changes, on Monday.

To be honest, I don't really mind either way, I was just interested in
getting a sense of the arguments {for, against} moving to
curthread/curproc.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 17:50:58 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184])
	by hub.freebsd.org (Postfix) with ESMTP
	id B122F37B416; Sun, 11 Nov 2001 17:50:53 -0800 (PST)
Received: from beastie.mckusick.com (localhost [127.0.0.1])
	by beastie.mckusick.com (8.11.4/8.9.3) with ESMTP id fABIFG336949;
	Sun, 11 Nov 2001 10:15:21 -0800 (PST)
	(envelope-from mckusick@beastie.mckusick.com)
Message-Id: <200111111815.fABIFG336949@beastie.mckusick.com>
To: Robert Watson <rwatson@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not. 
Cc: freebsd-arch@FreeBSD.ORG
In-Reply-To: Your message of "Sun, 11 Nov 2001 10:31:54 EST."
             <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org> 
Date: Sun, 11 Nov 2001 10:15:16 -0800
From: Kirk McKusick <mckusick@mckusick.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Some many years ago, I tried to get rid of all the references to
curproc in the filesystem code, and quickly came to the realization
that it would require adding a proc pointer to virtually every
subroutine in the filesystem code. For the reasons that you have
noted, this is ugly and adds bloat to the stack space. On the other
hand, there are places where the filesystem code does not want to
use the current process credential. One of the more evident ones
is in the NFS server code which wants to pass down the credential
of the requesting client rather than its own. Solaris uses a very
ugly hack where the server thread replaces its credential with that
of its client, does the VOP call, then puts its own credential back
when it returns. This sort of problem could exist in almost any
instance where the kernel is acting as a server. So, completely
removing process/credential references from the kernel interfaces
is not the right solution either.

	Kirk

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sun Nov 11 22:33:42 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by hub.freebsd.org (Postfix) with ESMTP
	id C236137B416; Sun, 11 Nov 2001 22:33:32 -0800 (PST)
Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102])
	by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id RAA16949;
	Mon, 12 Nov 2001 17:33:22 +1100
Date: Mon, 12 Nov 2001 17:32:12 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender:  <bde@delplex.bde.org>
To: Peter Wemm <peter@wemm.org>
Cc: Robert Watson <rwatson@FreeBSD.ORG>, <freebsd-arch@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <20011111191735.00D053807@overcee.netplex.com.au>
Message-ID: <20011112165530.B34657-100000@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On Sun, 11 Nov 2001, Peter Wemm wrote:

> Robert Watson wrote:
>
> > It seems to me that unless a very strong argument exists against using
> > curproc/curthread (and I don't preclude one existing), using them would
> > actually be an improvement, as it would assert that this class of

> My gripe is that on i386, it creates a LOT of work for the compiler.

That's just an implementation detail for one arch.  I did strongly object
to the implementation, but...

> Consider this small function in kern_kthread.c:
> void
> kthread_exit(int ecode)
> {
>
>         sx_xlock(&proctree_lock);
>         PROC_LOCK(curproc);
>         proc_reparent(curproc, initproc);
>         PROC_UNLOCK(curproc);
>         sx_xunlock(&proctree_lock);
>         exit1(curthread, W_EXITCODE(ecode, 0));
> }
>
> Have a look at http://people.freebsd.org/~peter/macros.c  where I've cpp'ed
> it and indented it for readability.  Anyway, kthread_exit() turns into
> this for the compiler to choke on:

> [235 lines of bletcherous code deleted]

The corresponding code for RELENG_4 is:

Source:
---
void
kthread_exit(int ecode)
{
	proc_reparent(curproc, initproc);
	exit1(curproc, W_EXITCODE(ecode, 0));
}
---

Preprocssor output (!SMP case):
---
void
kthread_exit(int ecode)
{
	proc_reparent(curproc, initproc);
	exit1(curproc, (( ecode ) << 8 | (  0 )) );
}
---

Preprocssor output (SMP case):
---
void
kthread_exit(int ecode)
{
	proc_reparent(((  struct proc * )_global_curproc_nv())  , initproc);
	exit1(((  struct proc * )_global_curproc_nv())  , (( ecode ) << 8 | (  0 )) );
}
---

The preprocssor output didn't even need editing to look this nice.
_global_curproc_nv() is an inline function, so the compiler has more work
to do in the SMP case than might appear.  This function is:

	static __inline int _global_curproc_nv(void) { \
		int val; \
		__asm("movl %%fs:gd_curproc",%0" : "=r" (val)); \
		return (val); \
	} \

which is only about 10 times smaller than the corresponding code in
-current (it has one case instead of 4, and has a much simpler reference
to gd_curproc).

The size of the output in -current can be reduced by a factor of about
2 by copying curproc to a local variable.

> Ever wonder why the kernel gets slower and slower to compile?  Ever
> compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by
> the speed?

Better yet, compile a 2.1 or 2.2 kernel under 2.1 or 2.2 and get about 25%
more speed (mostly from not having pessimizations in gcc).

> Count me in the 'curproc considered harmful' camp.  (or curthread).

Count me ouside of it.

> Regarding 64 bit machines, all of our 64 bit platforms use register
> passing, some with fixed size register frames.  On those, the difference
> of saving one argument isn't going to add up to much, if anything.  And
> it would still require an intermediate frame to hold the calculated value
> of curproc/curthread where its used.

Passing the pointer down through 20 subroutines (some of which don't
even use it except to pass it along) may add up to much.

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  2: 9:36 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 3839E37B417; Mon, 12 Nov 2001 02:09:33 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACA9SI75024;
	Mon, 12 Nov 2001 02:09:28 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 02:09:28 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111121009.fACA9SI75024@apollo.backplane.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: Peter Wemm <peter@wemm.org>, Robert Watson <rwatson@FreeBSD.ORG>,
	<freebsd-arch@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not. 
References:  <20011112165530.B34657-100000@delplex.bde.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:> Have a look at http://people.freebsd.org/~peter/macros.c  where I've cpp'ed
:> it and indented it for readability.  Anyway, kthread_exit() turns into
:> this for the compiler to choke on:
:
:> [235 lines of bletcherous code deleted]

    It's a mess, but the code produced isn't too bad.  It's much better
    now that the mutexes are calling real procedures.
    
:
:Passing the pointer down through 20 subroutines (some of which don't
:even use it except to pass it along) may add up to much.
:
:Bruce

    I agree that it is kind of silly to pass a global down through N levels
    of procedures.  Just on principle.  On the otherhand I don't expect
    the performance to be better or worse, or even for there to be any
    real difference in code size.  Fewer instructions per routine in
    more routines, with more memory writes (pass as argument on stack),
    verses more instructions in fewer routines, with only memory reads
    (access as global).  Without there being a clear winner there isn't
    much of a reason to change the existing code.

    If we stopped trying to be fancy with interrupt scheduling and went
    back to the BSDI methodology the kernel code could assume that
    %fs doesn't change out from under it and we could *GREATLY*
    simplify the __PCPU_GET() code to something like this:

static __inline
struct globaldata *
__globaldata(void)
{
        struct globaldata *gd;
                
        __asm("movl %%fs,%0" : "=r" (gd));
        return(gd);
}
                
#define __PCPU_GET(name)        (__globaldata()->name)

    Which would allow GCC to generate somewhat better code output
    (about 1K less code in the text segment as well) as well as
    allow the per-cpu variables to be accessed more normally without
    having to macros to GET and SET them.

    Else we are stuck with what we have.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  2:47:41 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id C8EB137B417
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 02:47:37 -0800 (PST)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fACAlOB24388;
	Mon, 12 Nov 2001 05:47:25 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 12 Nov 2001 05:47:24 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.ORG>
X-Sender: robert@fledge.watson.org
To: Kirk McKusick <mckusick@mckusick.com>
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <200111111815.fABIFG336949@beastie.mckusick.com>
Message-ID: <Pine.NEB.3.96L.1011112054158.16646E-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Sun, 11 Nov 2001, Kirk McKusick wrote:

> Some many years ago, I tried to get rid of all the references to curproc
> in the filesystem code, and quickly came to the realization that it
> would require adding a proc pointer to virtually every subroutine in the
> filesystem code. For the reasons that you have noted, this is ugly and
> adds bloat to the stack space. On the other hand, there are places where
> the filesystem code does not want to use the current process credential.
> One of the more evident ones is in the NFS server code which wants to
> pass down the credential of the requesting client rather than its own.
> Solaris uses a very ugly hack where the server thread replaces its
> credential with that of its client, does the VOP call, then puts its own
> credential back when it returns. This sort of problem could exist in
> almost any instance where the kernel is acting as a server. So,
> completely removing process/credential references from the kernel
> interfaces is not the right solution either. 

Right now, many of the VFS calls pass a credential in, which is used in
lieu of the process credential in most cases.  The prominent exceptions to
this rule seem to be in the device code (where process credentials are
used), and in the smattering of VOP calls where in UFS/FFS, an
authorization decision is not required.  By putting the credential into
these calls, I think most NFS cases could be normalized.  This would be
consistent with the approach adopted by several other systems I looked at,
and seems like it may intuitively be the right approach given the 'file'
cached credential model.

As Peter has pointed out, this change could be independent of any choice
about curproc/curthread, and is probably worth doing regardless of the
choice there.  Probably the right 'approach' here is to assume that
operations on 'vnode' require a 'ucred', whereas operations on 'file'
generally do not. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  4:10:31 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by hub.freebsd.org (Postfix) with ESMTP
	id 5DED237B405; Mon, 12 Nov 2001 04:10:26 -0800 (PST)
Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102])
	by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id XAA19129;
	Mon, 12 Nov 2001 23:10:16 +1100
Date: Mon, 12 Nov 2001 23:09:06 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender:  <bde@delplex.bde.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Peter Wemm <peter@wemm.org>, Robert Watson <rwatson@FreeBSD.ORG>,
	<freebsd-arch@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <200111121009.fACA9SI75024@apollo.backplane.com>
Message-ID: <20011112221522.E36389-100000@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On Mon, 12 Nov 2001, Matthew Dillon wrote:

>     If we stopped trying to be fancy with interrupt scheduling and went
>     back to the BSDI methodology the kernel code could assume that
>     %fs doesn't change out from under it and we could *GREATLY*

Strictly, that the GDT entry for %fs doesn't change.  We could safely
assume this already for the !SMP case.

>     simplify the __PCPU_GET() code to something like this:
>
> static __inline
> struct globaldata *
> __globaldata(void)
> {
>         struct globaldata *gd;
>
>         __asm("movl %%fs,%0" : "=r" (gd));
>         return(gd);
> }
>
> #define __PCPU_GET(name)        (__globaldata()->name)
>
>     Which would allow GCC to generate somewhat better code output
>     (about 1K less code in the text segment as well) as well as
>     allow the per-cpu variables to be accessed more normally without
>     having to macros to GET and SET them.

This is essentially a slightly pessimized version of the RELENG_4
code for the SMP case (RELENG_4 avoids going through the pointer in
for most per-cpu global accesses).

It also helps to declare __globaldata() as __pure2 so that gcc can
tell that it always returns the same value.  It doesn't quite always
return the same value, but I can't think of any cases where a cached
value would remain valid long enough to cause problems.

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  4:39:47 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180])
	by hub.freebsd.org (Postfix) with ESMTP
	id B07DF37B41A; Mon, 12 Nov 2001 04:39:25 -0800 (PST)
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fACCdPM13118;
	Mon, 12 Nov 2001 04:39:25 -0800 (PST)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id 6A89E380A; Mon, 12 Nov 2001 04:39:25 -0800 (PST)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Bruce Evans <bde@zeta.org.au>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
In-Reply-To: <200111121009.fACA9SI75024@apollo.backplane.com> 
Date: Mon, 12 Nov 2001 04:39:25 -0800
From: Peter Wemm <peter@wemm.org>
Message-Id: <20011112123925.6A89E380A@overcee.netplex.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:
> :> Have a look at http://people.freebsd.org/~peter/macros.c  where I've cpp'e
    d
> :> it and indented it for readability.  Anyway, kthread_exit() turns into
> :> this for the compiler to choke on:
> :
> :> [235 lines of bletcherous code deleted]
> 
>     It's a mess, but the code produced isn't too bad.  It's much better
>     now that the mutexes are calling real procedures.

Mutexes only call procedures if debugging options are on.  If you compile
without INVARIANTS, KTR, or WITNESS, then you get the maximum inline
versions.

Regarding __globaldata() .. That's almost how an intermediate version
of globals.h did it on the i386, about rev 1.16.  We always have the option
to go back to something like later on if preemption turns out to be a wash.

Your inline function doesn't work though.. %fs isn't a general purpose
register.. You can't store a pointer in the register itself.  You have
to use an indirect memory reference to fetch the pointer.

ie:
        struct globaldata *gd;
                
        __asm("movl %%fs,%0" : "=r" (gd));
        return(gd);

must be more like this:
	__asm("movl %%fs:0,%0" : "=r" (gd));

ie: read memory location 0 from the %fs segment.

Note that the RELENG_4 macros call inlines:
#define GLOBAL_FUNC(name) \
        static __inline void *_global_ptr_##name(void) { \
                void *val; \
                __asm __volatile("movl $gd_" #name ",%0;" \
                        "addl %%fs:globaldata,%0" : "=r" (val)); \
                return (val); \
        } \
        static __inline void *_global_ptr_##name##_nv(void) { \
                void *val; \
                __asm("movl $gd_" #name ",%0;" \
                        "addl %%fs:globaldata,%0" : "=r" (val)); \
                return (val); \
        } \
        static __inline int _global_##name(void) { \
                int val; \
                __asm __volatile("movl %%fs:gd_" #name ",%0" : "=r" (val)); \
                return (val); \
        } \
        static __inline int _global_##name##_nv(void) { \
                int val; \
                __asm("movl %%fs:gd_" #name ",%0" : "=r" (val)); \
                return (val); \
        } \
        static __inline void _global_##name##_set(int val) { \
                __asm __volatile("movl %0,%%fs:gd_" #name : : "r" (val)); \
        } \
        static __inline void _global_##name##_set_nv(int val) { \
                __asm("movl %0,%%fs:gd_" #name : : "r" (val)); \
        }

...
GLOBAL_FUNC(curproc)
GLOBAL_FUNC(astpending)
GLOBAL_FUNC(curpcb)
GLOBAL_FUNC(npxproc)
GLOBAL_FUNC(common_tss)
GLOBAL_FUNC(switchtime)
GLOBAL_FUNC(switchticks)
...

Bruce neglected to show the spammage from this in his cut/paste.
Here's what it really looks like in RELENG_4, and remember that this
is *without* mutexes and atomic support, etc, and after I have
cleaned it up so that hopefully the mail system wont shred it:

static __inline void *_global_ptr_curproc (void) { void *val; __asm volatile ("movl $gd_" "curproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_curproc_nv(void) { void *val; __asm("movl $gd_" "curproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_curproc (void) { int val; __asm volatile ("movl %%fs:gd_" "curproc" ",%0" : "=r" (val)); return (val); }
static __inline int _global_curproc_nv(void) { int val; __asm("movl %%fs:gd_" "curproc" ",%0" : "=r" (val)); return (val); }
static __inline void _global_curproc_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "curproc" : : "r" (val)); }
static __inline void _global_curproc_set_nv(int val) { __asm("movl %0,%%fs:gd_" "curproc" : : "r" (val)); } 
static __inline void *_global_ptr_astpending (void) { void *val; __asm volatile ("movl $gd_" "astpending" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_astpending_nv(void) { void *val; __asm("movl $gd_" "astpending" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_astpending (void) { int val; __asm volatile ("movl %%fs:gd_" "astpending" ",%0" : "=r" (val)); return (val); }
static __inline int _global_astpending_nv(void) { int val; __asm("movl %%fs:gd_" "astpending" ",%0" : "=r" (val)); return (val); }
static __inline void _global_astpending_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "astpending" : : "r" (val)); }
static __inline void _global_astpending_set_nv(int val) { __asm("movl %0,%%fs:gd_" "astpending" : : "r" (val)); } 
static __inline void *_global_ptr_curpcb (void) { void *val; __asm volatile ("movl $gd_" "curpcb" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_curpcb_nv(void) { void *val; __asm("movl $gd_" "curpcb" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_curpcb (void) { int val; __asm volatile ("movl %%fs:gd_" "curpcb" ",%0" : "=r" (val)); return (val); }
static __inline int _global_curpcb_nv(void) { int val; __asm("movl %%fs:gd_" "curpcb" ",%0" : "=r" (val)); return (val); }
static __inline void _global_curpcb_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "curpcb" : : "r" (val)); }
static __inline void _global_curpcb_set_nv(int val) { __asm("movl %0,%%fs:gd_" "curpcb" : : "r" (val)); } 
static __inline void *_global_ptr_npxproc (void) { void *val; __asm volatile ("movl $gd_" "npxproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_npxproc_nv(void) { void *val; __asm("movl $gd_" "npxproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_npxproc (void) { int val; __asm volatile ("movl %%fs:gd_" "npxproc" ",%0" : "=r" (val)); return (val); }
static __inline int _global_npxproc_nv(void) { int val; __asm("movl %%fs:gd_" "npxproc" ",%0" : "=r" (val)); return (val); }
static __inline void _global_npxproc_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "npxproc" : : "r" (val)); }
static __inline void _global_npxproc_set_nv(int val) { __asm("movl %0,%%fs:gd_" "npxproc" : : "r" (val)); } 
static __inline void *_global_ptr_common_tss (void) { void *val; __asm volatile ("movl $gd_" "common_tss" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_common_tss_nv(void) { void *val; __asm("movl $gd_" "common_tss" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_common_tss (void) { int val; __asm volatile ("movl %%fs:gd_" "common_tss" ",%0" : "=r" (val)); return (val); }
static __inline int _global_common_tss_nv(void) { int val; __asm("movl %%fs:gd_" "common_tss" ",%0" : "=r" (val)); return (val); }
static __inline void _global_common_tss_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "common_tss" : : "r" (val)); }
static __inline void _global_common_tss_set_nv(int val) { __asm("movl %0,%%fs:gd_" "common_tss" : : "r" (val)); } 
static __inline void *_global_ptr_switchtime (void) { void *val; __asm volatile ("movl $gd_" "switchtime" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_switchtime_nv(void) { void *val; __asm("movl $gd_" "switchtime" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_switchtime (void) { int val; __asm volatile ("movl %%fs:gd_" "switchtime" ",%0" : "=r" (val)); return (val); }
static __inline int _global_switchtime_nv(void) { int val; __asm("movl %%fs:gd_" "switchtime" ",%0" : "=r" (val)); return (val); }
static __inline void _global_switchtime_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "switchtime" : : "r" (val)); }
static __inline void _global_switchtime_set_nv(int val) { __asm("movl %0,%%fs:gd_" "switchtime" : : "r" (val)); } 
static __inline void *_global_ptr_switchticks (void) { void *val; __asm volatile ("movl $gd_" "switchticks" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_switchticks_nv(void) { void *val; __asm("movl $gd_" "switchticks" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_switchticks (void) { int val; __asm volatile ("movl %%fs:gd_" "switchticks" ",%0" : "=r" (val)); return (val); }
static __inline int _global_switchticks_nv(void) { int val; __asm("movl %%fs:gd_" "switchticks" ",%0" : "=r" (val)); return (val); }
static __inline void _global_switchticks_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "switchticks" : : "r" (val)); }
static __inline void _global_switchticks_set_nv(int val) { __asm("movl %0,%%fs:gd_" "switchticks" : : "r" (val)); } 
static __inline void *_global_ptr_common_tssd (void) { void *val; __asm volatile ("movl $gd_" "common_tssd" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_common_tssd_nv(void) { void *val; __asm("movl $gd_" "common_tssd" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_common_tssd (void) { int val; __asm volatile ("movl %%fs:gd_" "common_tssd" ",%0" : "=r" (val)); return (val); }
static __inline int _global_common_tssd_nv(void) { int val; __asm("movl %%fs:gd_" "common_tssd" ",%0" : "=r" (val)); return (val); }
static __inline void _global_common_tssd_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "common_tssd" : : "r" (val)); }
static __inline void _global_common_tssd_set_nv(int val) { __asm("movl %0,%%fs:gd_" "common_tssd" : : "r" (val)); } 
static __inline void *_global_ptr_tss_gdt (void) { void *val; __asm volatile ("movl $gd_" "tss_gdt" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_tss_gdt_nv(void) { void *val; __asm("movl $gd_" "tss_gdt" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_tss_gdt (void) { int val; __asm volatile ("movl %%fs:gd_" "tss_gdt" ",%0" : "=r" (val)); return (val); }
static __inline int _global_tss_gdt_nv(void) { int val; __asm("movl %%fs:gd_" "tss_gdt" ",%0" : "=r" (val)); return (val); }
static __inline void _global_tss_gdt_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "tss_gdt" : : "r" (val)); }
static __inline void _global_tss_gdt_set_nv(int val) { __asm("movl %0,%%fs:gd_" "tss_gdt" : : "r" (val)); } 
static __inline void *_global_ptr_cpuid (void) { void *val; __asm volatile ("movl $gd_" "cpuid" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_cpuid_nv(void) { void *val; __asm("movl $gd_" "cpuid" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_cpuid (void) { int val; __asm volatile ("movl %%fs:gd_" "cpuid" ",%0" : "=r" (val)); return (val); }
static __inline int _global_cpuid_nv(void) { int val; __asm("movl %%fs:gd_" "cpuid" ",%0" : "=r" (val)); return (val); }
static __inline void _global_cpuid_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "cpuid" : : "r" (val)); }
static __inline void _global_cpuid_set_nv(int val) { __asm("movl %0,%%fs:gd_" "cpuid" : : "r" (val)); } 
static __inline void *_global_ptr_other_cpus (void) { void *val; __asm volatile ("movl $gd_" "other_cpus" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_other_cpus_nv(void) { void *val; __asm("movl $gd_" "other_cpus" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_other_cpus (void) { int val; __asm volatile ("movl %%fs:gd_" "other_cpus" ",%0" : "=r" (val)); return (val); }
static __inline int _global_other_cpus_nv(void) { int val; __asm("movl %%fs:gd_" "other_cpus" ",%0" : "=r" (val)); return (val); }
static __inline void _global_other_cpus_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "other_cpus" : : "r" (val)); }
static __inline void _global_other_cpus_set_nv(int val) { __asm("movl %0,%%fs:gd_" "other_cpus" : : "r" (val)); } 
static __inline void *_global_ptr_inside_intr (void) { void *val; __asm volatile ("movl $gd_" "inside_intr" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_inside_intr_nv(void) { void *val; __asm("movl $gd_" "inside_intr" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_inside_intr (void) { int val; __asm volatile ("movl %%fs:gd_" "inside_intr" ",%0" : "=r" (val)); return (val); }
static __inline int _global_inside_intr_nv(void) { int val; __asm("movl %%fs:gd_" "inside_intr" ",%0" : "=r" (val)); return (val); }
static __inline void _global_inside_intr_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "inside_intr" : : "r" (val)); }
static __inline void _global_inside_intr_set_nv(int val) { __asm("movl %0,%%fs:gd_" "inside_intr" : : "r" (val)); } 
static __inline void *_global_ptr_prv_CMAP1 (void) { void *val; __asm volatile ("movl $gd_" "prv_CMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_CMAP1_nv(void) { void *val; __asm("movl $gd_" "prv_CMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CMAP1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CMAP1" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CMAP1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CMAP1" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_CMAP1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CMAP1" : : "r" (val)); }
static __inline void _global_prv_CMAP1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CMAP1" : : "r" (val)); } 
static __inline void *_global_ptr_prv_CMAP2 (void) { void *val; __asm volatile ("movl $gd_" "prv_CMAP2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_CMAP2_nv(void) { void *val; __asm("movl $gd_" "prv_CMAP2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CMAP2 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CMAP2" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CMAP2_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CMAP2" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_CMAP2_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CMAP2" : : "r" (val)); }
static __inline void _global_prv_CMAP2_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CMAP2" : : "r" (val)); } 
static __inline void *_global_ptr_prv_CMAP3 (void) { void *val; __asm volatile ("movl $gd_" "prv_CMAP3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_CMAP3_nv(void) { void *val; __asm("movl $gd_" "prv_CMAP3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CMAP3 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CMAP3" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CMAP3_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CMAP3" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_CMAP3_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CMAP3" : : "r" (val)); }
static __inline void _global_prv_CMAP3_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CMAP3" : : "r" (val)); } 
static __inline void *_global_ptr_prv_PMAP1 (void) { void *val; __asm volatile ("movl $gd_" "prv_PMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_PMAP1_nv(void) { void *val; __asm("movl $gd_" "prv_PMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_PMAP1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_PMAP1" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_PMAP1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_PMAP1" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_PMAP1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_PMAP1" : : "r" (val)); }
static __inline void _global_prv_PMAP1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_PMAP1" : : "r" (val)); } 
static __inline void *_global_ptr_prv_CADDR1 (void) { void *val; __asm volatile ("movl $gd_" "prv_CADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_CADDR1_nv(void) { void *val; __asm("movl $gd_" "prv_CADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CADDR1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CADDR1" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CADDR1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CADDR1" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_CADDR1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CADDR1" : : "r" (val)); }
static __inline void _global_prv_CADDR1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CADDR1" : : "r" (val)); } 
static __inline void *_global_ptr_prv_CADDR2 (void) { void *val; __asm volatile ("movl $gd_" "prv_CADDR2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_CADDR2_nv(void) { void *val; __asm("movl $gd_" "prv_CADDR2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CADDR2 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CADDR2" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CADDR2_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CADDR2" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_CADDR2_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CADDR2" : : "r" (val)); }
static __inline void _global_prv_CADDR2_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CADDR2" : : "r" (val)); } 
static __inline void *_global_ptr_prv_CADDR3 (void) { void *val; __asm volatile ("movl $gd_" "prv_CADDR3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_CADDR3_nv(void) { void *val; __asm("movl $gd_" "prv_CADDR3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CADDR3 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CADDR3" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_CADDR3_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CADDR3" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_CADDR3_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CADDR3" : : "r" (val)); }
static __inline void _global_prv_CADDR3_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CADDR3" : : "r" (val)); } 
static __inline void *_global_ptr_prv_PADDR1 (void) { void *val; __asm volatile ("movl $gd_" "prv_PADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline void *_global_ptr_prv_PADDR1_nv(void) { void *val; __asm("movl $gd_" "prv_PADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); }
static __inline int _global_prv_PADDR1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_PADDR1" ",%0" : "=r" (val)); return (val); }
static __inline int _global_prv_PADDR1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_PADDR1" ",%0" : "=r" (val)); return (val); }
static __inline void _global_prv_PADDR1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_PADDR1" : : "r" (val)); }
static __inline void _global_prv_PADDR1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_PADDR1" : : "r" (val)); } 

void
kproc_start(udata)
	const void *udata;
{
	const struct kproc_desc	*kp = udata;
	int error;

	error = kthread_create((void (*)(void *))kp->func, 0 ,
		    kp->global_procpp, kp->arg0);
	if (error)
		panic("kproc_start: %s: error %d", kp->arg0, error);
}

int
kthread_create(void (*func)(void *), void *arg,
    struct proc **newpp, const char *fmt, ...)
{
	int error;
	va_list ap;
	struct proc *p2;

	if (!proc0.p_stats  ) {
		panic("kthread_create called too soon");
	}

	error = fork1(&proc0, (1<<5)  | (1<<2)  | (1<<4) , &p2);
	if (error)
		return error;

	 
	if (newpp != 0 )
		*newpp = p2;

	 
	p2->p_flag |= 0x00004  | 0x00200 ;
	p2->p_procsig->ps_flag |= 0x0001 ;
	{	if (( p2 )->p_lock++ == 0 && (( p2 )->p_flag & 0x00004 ) == 0)	faultin( p2 );	} ;

	 
	(( ap ) = (va_list)__builtin_next_arg(  fmt )) ;
	vsnprintf(p2->p_comm, sizeof(p2->p_comm), fmt, ap);
	 ;

	 
	cpu_set_fork_handler(p2, func, arg);

	return 0;
}

void
kthread_exit(int ecode)
{
	proc_reparent(((  struct proc * )_global_curproc_nv())  , initproc);
	exit1(((  struct proc * )_global_curproc_nv())  , (( ecode ) << 8 | (  0 )) );
}

int
suspend_kproc(struct proc *p, int timo)
{
	if ((p->p_flag & 0x00200 ) == 0)
		return (22 );
	( p->p_siglist ).__bits[(((    17    ) - 1)  >> 5) ] |= (1 << (((    17    ) - 1)  & 31))  ;
	return tsleep((caddr_t)&p->p_siglist, 40 , "suspkp", timo);
}

int
resume_kproc(struct proc *p)
{
	if ((p->p_flag & 0x00200 ) == 0)
		return (22 );
	( p->p_siglist ).__bits[(((    17    ) - 1)  >> 5) ] &= ~(1 << (((    17    ) - 1)  & 31))  ;
	wakeup((caddr_t)&p->p_siglist);
	return (0);
}

void
kproc_suspend_loop(struct proc *p)
{
	while ((( p->p_siglist ).__bits[(((    17    ) - 1)  >> 5) ] & (1 << (((    17    ) - 1)  & 31)) ) ) {
		wakeup((caddr_t)&p->p_siglist);
		tsleep((caddr_t)&p->p_siglist, 40 , "kpsusp", 0);
	}
}

And dont forget the extra support code required for this:

#include "assym.s"

#ifdef SMP
        /*
         * Define layout of per-cpu address space.
         * This is "constructed" in locore.s on the BSP and in mp_machdep.c
         * for each AP.  DO NOT REORDER THESE WITHOUT UPDATING THE REST!
         */
        .globl  _SMP_prvspace, _lapic
        .set    _SMP_prvspace,(MPPTDI << PDRSHIFT)
        .set    _lapic,_SMP_prvspace + (NPTEPG-1) * PAGE_SIZE

        .globl  gd_idlestack,gd_idlestack_top
        .set    gd_idlestack,PS_IDLESTACK
        .set    gd_idlestack_top,PS_IDLESTACK_TOP
#endif

        /*
         * Define layout of the global data.  On SMP this lives in
         * the per-cpu address space, otherwise it's in the data segment.
         */
        .globl  globaldata
#ifndef SMP
        .data
        ALIGN_DATA
globaldata:
        .space  GD_SIZEOF               /* in data segment */
#else
        .set    globaldata,0
#endif
        .globl  gd_curproc, gd_curpcb, gd_npxproc, gd_astpending
        .globl  gd_common_tss, gd_switchtime, gd_switchticks
        .set    gd_curproc,globaldata + GD_CURPROC
        .set    gd_astpending,globaldata + GD_ASTPENDING
        .set    gd_curpcb,globaldata + GD_CURPCB
        .set    gd_npxproc,globaldata + GD_NPXPROC
        .set    gd_common_tss,globaldata + GD_COMMON_TSS
        .set    gd_switchtime,globaldata + GD_SWITCHTIME
        .set    gd_switchticks,globaldata + GD_SWITCHTICKS

        .globl  gd_common_tssd, gd_tss_gdt
        .set    gd_common_tssd,globaldata + GD_COMMON_TSSD
        .set    gd_tss_gdt,globaldata + GD_TSS_GDT

#ifdef USER_LDT
        .globl  gd_currentldt
        .set    gd_currentldt,globaldata + GD_CURRENTLDT
#endif

#ifndef SMP
        .globl  _curproc, _curpcb, _npxproc, _astpending
        .globl  _common_tss, _switchtime, _switchticks
        .set    _curproc,globaldata + GD_CURPROC
        .set    _astpending,globaldata + GD_ASTPENDING
        .set    _curpcb,globaldata + GD_CURPCB
        .set    _npxproc,globaldata + GD_NPXPROC
        .set    _common_tss,globaldata + GD_COMMON_TSS
        .set    _switchtime,globaldata + GD_SWITCHTIME
        .set    _switchticks,globaldata + GD_SWITCHTICKS

        .globl  _common_tssd, _tss_gdt
        .set    _common_tssd,globaldata + GD_COMMON_TSSD
        .set    _tss_gdt,globaldata + GD_TSS_GDT

#ifdef USER_LDT
        .globl  _currentldt
        .set    _currentldt,globaldata + GD_CURRENTLDT
#endif
#endif

#ifdef SMP
        /*
         * The BSP version of these get setup in locore.s and pmap.c, while
         * the AP versions are setup in mp_machdep.c.
         */
        .globl  gd_cpuid, gd_cpu_lockid, gd_other_cpus
        .globl  gd_ss_eflags, gd_inside_intr
        .globl  gd_prv_CMAP1, gd_prv_CMAP2, gd_prv_CMAP3, gd_prv_PMAP1
        .globl  gd_prv_CADDR1, gd_prv_CADDR2, gd_prv_CADDR3, gd_prv_PADDR1

        .set    gd_cpuid,globaldata + GD_CPUID
        .set    gd_cpu_lockid,globaldata + GD_CPU_LOCKID
        .set    gd_other_cpus,globaldata + GD_OTHER_CPUS
        .set    gd_ss_eflags,globaldata + GD_SS_EFLAGS
        .set    gd_inside_intr,globaldata + GD_INSIDE_INTR
        .set    gd_prv_CMAP1,globaldata + GD_PRV_CMAP1
        .set    gd_prv_CMAP2,globaldata + GD_PRV_CMAP2
        .set    gd_prv_CMAP3,globaldata + GD_PRV_CMAP3
        .set    gd_prv_PMAP1,globaldata + GD_PRV_PMAP1
        .set    gd_prv_CADDR1,globaldata + GD_PRV_CADDR1
        .set    gd_prv_CADDR2,globaldata + GD_PRV_CADDR2
        .set    gd_prv_CADDR3,globaldata + GD_PRV_CADDR3
        .set    gd_prv_PADDR1,globaldata + GD_PRV_PADDR1
#endif

The globals.s code has to be in exact sync with the C headers.

And we push a whole bunch of stuff into the kernel namelist as well:

# nm /kernel | sort | more
00000000 A globaldata
00000004 A gd_curproc
00000008 A gd_npxproc
0000000c A gd_curpcb
00000010 A gd_switchtime
00000018 A gd_common_tss
00000080 A gd_switchticks
00000084 A gd_common_tssd
0000008c A gd_tss_gdt
00000090 A gd_cpuid
00000094 A gd_cpu_lockid
00000098 A gd_other_cpus
0000009c A gd_inside_intr
000000a0 A gd_ss_eflags
000000a4 A gd_prv_CMAP1
000000a8 A gd_prv_CMAP2
000000ac A gd_prv_CMAP3
000000b0 A gd_prv_PMAP1
000000b4 A gd_prv_CADDR1
000000b8 A gd_prv_CADDR2
000000bc A gd_prv_CADDR3
000000c0 A gd_prv_PADDR1
000000c4 A gd_astpending
00005000 A gd_idlestack
00008000 A gd_idlestack_top
9fc00000 A PTmap
9fe7f000 A PTD
9fe7f9fc A PTDpde
9fe7fffc A APTDpde
a0000000 A kernbase
a011ffb0 T btext
a0120019 t begin
a0120064 T sigcode
a0120084 t _osigcode
a01200ac t _esigcode
a01200ac t recover_bootinfo
a01200b9 t newboot
a01200fb t got_bi_size
a012010c t got_common_bi_size
a012010f t olddiskboot
a0120120 t identify_cpu
[....]

Anyway, we have plenty of time to come back to this if it turns out that
we dont need the complexity.  We have *lots* of optimization choices.
But we should not start restricting our options yet.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  8:34:57 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212])
	by hub.freebsd.org (Postfix) with ESMTP id 243A337B418
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Nov 2001 08:34:52 -0800 (PST)
Received: (qmail 80361 invoked from network); 12 Nov 2001 16:34:51 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <julian@elischer.org>; 12 Nov 2001 16:34:51 -0000
Message-ID: <XFMail.011112083451.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.BSF.4.21.0111110957150.89663-100000@InterJet.elischer.org>
Date: Mon, 12 Nov 2001 08:34:51 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Julian Elischer <julian@elischer.org>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 11-Nov-01 Julian Elischer wrote:
> Also, in SMPng cur{thread,proc} takes some time to get as I'm told that
> dereferencing %fs is very slow.. (Not sure how true that is).

I'm not sure it is any slower than pushing the variable onto the stack and then
reading it from the stack.  Reading the variable off teh stack is still a
memory read, as is reading curproc, so it's not really that slow.  %fs is no
slower than %ds, not anything that compares to the amount of time to go out to
cache or memory and read the thing.

> There are other reasons for needing the pointer than for a credential.
> For example in AIO, the process pointer is stored so that
> address space can be loaned to the aio threads to do the IO.

Yeah, but it isn't used.  All that is used for is to find the vmspace to dink
with the aio thread's vmspace AFAICT.
 
> I have thought about this both ways...
> both have advantages. In some architectures, getting curthread might
> be very expensive.

I don't think it is as expensive as people think it is. :)

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  8:43:23 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211])
	by hub.freebsd.org (Postfix) with ESMTP id 0509337B405
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 08:43:19 -0800 (PST)
Received: (qmail 94565 invoked from network); 12 Nov 2001 16:43:17 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <peter@wemm.org>; 12 Nov 2001 16:43:17 -0000
Message-ID: <XFMail.011112084317.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <20011111191735.00D053807@overcee.netplex.com.au>
Date: Mon, 12 Nov 2001 08:43:17 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Peter Wemm <peter@wemm.org>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 11-Nov-01 Peter Wemm wrote:
> Robert Watson wrote:
> 
>> It seems to me that unless a very strong argument exists against using
>> curproc/curthread (and I don't preclude one existing), using them would
>> actually be an improvement, as it would assert that this class of
>> 'borrowing' couldn't exist, simplifying the kernel, not to mention
>> squeezing a bit more stuff out of the stack (which, at ten levels deep,
>> actually begins to add up on 64-bit machines).  I believe that there are
>> many places where the 'p' passed in is implicitly assumed to be the
>> current process, and that making that reliance explicit would be an
>> improvement, rather than a problem. 
> 
> My gripe is that on i386, it creates a LOT of work for the compiler.
> 
> Consider this small function in kern_kthread.c:
> void
> kthread_exit(int ecode)
> {
> 
>         sx_xlock(&proctree_lock);
>         PROC_LOCK(curproc);
>         proc_reparent(curproc, initproc);
>         PROC_UNLOCK(curproc);
>         sx_xunlock(&proctree_lock);
>         exit1(curthread, W_EXITCODE(ecode, 0));
> }
> 
> Have a look at http://people.freebsd.org/~peter/macros.c  where I've cpp'ed
> it and indented it for readability.  Anyway, kthread_exit() turns into
> this for the compiler to choke on:

This is why one does 'struct proc *p;  p = curproc;' and then s/curproc/p/.  As
it is our current macros collapse that PCPU_GET() down into one instruction. 
We actually used to have it be multiple instructions, but then peopel got all
upset and whined and complained about it being 2 instructions or whatever it
was when SMPng first went in.

Also, regarding the preemption stuff on the side:

- BSD/OS happily preempts arbitrarily for interrupts just in case that wasn't
  clear, and
- curthread doesn't change when we get preempted, just things like cpuid or
  PCPU_GET(spinlocks) need to be worried about.  Since the only PCPU macro
  commonly used is curthread, then you don't have to worry about this in 
  most cases.

> Ever wonder why the kernel gets slower and slower to compile?  Ever
> compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by
> the speed?

And 2.1 and 2.2 don't support SMP.  If we didn't have SMP then PCPU_FOO() could
certainly be simpler.  They could just be global variables like they used to be
in fact.  Now, maybe as a hack for now, you could try something like having a
simple case for PCPU_GET() on the x86 that is PCPU_GET_CUTHREAD() or something
and define curthread to be that.  Sheesh.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12  9:31:44 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 309CA37B419; Mon, 12 Nov 2001 09:31:39 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACHVck84386;
	Mon, 12 Nov 2001 09:31:38 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 09:31:38 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111121731.fACHVck84386@apollo.backplane.com>
To: Peter Wemm <peter@wemm.org>
Cc: Bruce Evans <bde@zeta.org.au>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not. 
References:  <20011112123925.6A89E380A@overcee.netplex.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


:>     It's a mess, but the code produced isn't too bad.  It's much better
:>     now that the mutexes are calling real procedures.
:
:Mutexes only call procedures if debugging options are on.  If you compile
:without INVARIANTS, KTR, or WITNESS, then you get the maximum inline
:versions.

    Sigh.  Well, better then nothing I guess.

:Regarding __globaldata() .. That's almost how an intermediate version
:of globals.h did it on the i386, about rev 1.16.  We always have the option
:to go back to something like later on if preemption turns out to be a wash.
:
:Your inline function doesn't work though.. %fs isn't a general purpose
:register.. You can't store a pointer in the register itself.  You have
:to use an indirect memory reference to fetch the pointer.

    Ach.  Right, of course.

:Anyway, we have plenty of time to come back to this if it turns out that
:we dont need the complexity.  We have *lots* of optimization choices.
:But we should not start restricting our options yet.
:
:Cheers,
:-Peter
:--
:Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au

    Well, that's part of the problem.  We *don't* hav elots of optimization
    choices.  The way things are currently set-up it is not possible to
    depend on *anything* being stable without obtaining a mutex first.

    I'm not going to worry about it for the moment, I have bigger fish
    to fry.

				-Matt
				Matthew Dillon 
				<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 14:49:58 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from raven.mail.pas.earthlink.net (raven.mail.pas.earthlink.net [207.217.120.39])
	by hub.freebsd.org (Postfix) with ESMTP
	id A4B6137B417; Mon, 12 Nov 2001 14:49:53 -0800 (PST)
Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com)
	by raven.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 163Ptn-0002Aj-00; Mon, 12 Nov 2001 14:49:52 -0800
Message-ID: <3BF05241.74F895EF@mindspring.com>
Date: Mon, 12 Nov 2001 14:50:41 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Robert Watson wrote:
> There are a number of uses of curproc in the netinet code, used to
> retrieve credentials for authorization somewhere down the stack, when no
> proc or thread pointer has been passed down.

I think that the majority of the netinet code can be handled
by using the socket credential, instead of the process
credential.


> With the eventual addition
> of td->td_ucred, it will be desirable to use the credential for the
> current thread, rather than the proc, which will require locking to use.

I think locking credential instances is bad.

The real question you want to answer is whether or not the
credential instance that was used to acquire a socket should
be used continuously from there on out (i.e. it is a grant),
or whether it should change when the process credential
changes (i.e. it is a lease).  You seem to be arguing for a
lease.  I would argue for a grant.

One issue is that there are cases where write permission is
tested before each write.  There are also cases, where you
obtain a privileged socket, and then relinquish privileges
after obtaining it; such cases are explicitly modelled on a
grant model rather than a lease model.

The point is that if the credentials are granted, then a
change in credential is not a change of the credential itself,
but is instead a copy-on-write proposition.  In other words,
credentials, once granted, are priviledge stable.

If this is the case, then they are written when they are
instanced, cloned before they are modified (indeed, it seems
that the clone/modify operation must be made atomic), and
thus are never written once instanced -- only destroyed on
the 1->0 reference transition.

If so, then no locking is required, since the LCK CMPXCHG can
be utilized to do atomic increment and decrement on the
reference counting, without needing locks.


> As I
> understand it, use of curproc was branded 'undesirable' at some point in
> the semi-distant past, and since that time, a reference to 'proc' has been
> passed down the stack.  With a change to KSE, this has been translated to
> references the thread, but the issue remains the same.  This comes up in
> particular because I have a tree where I have propagated the thread
> pointer down if_ioctl in the network stack: the normal ioctl call carries
> a thread pointer now, but when it is translated into if_ioctl by the
> network stack, that pointer is lost.  This raises the question: should we
> (in practice) be adding process or thread pointers to many more of the
> function arguments, or should we switch to using curproc/curthread
> instead.

The "curproc" undesirability stems primarily from credentials
enforcement during interrupt processing.  I think that this is
not an insurmountable issue, but I would argue that these are
more appropriate for object credentials, where the objects in
question are not threads or processes.

For example, if we were to process incoming TCP connections up
through the "accept" code at interrupt time, one might naievely
assume that, since the current socket code down through the
accept processing code off the queue filled in at NETISR seems
to require a proc credential, that it is therefore necessary to
have a proc credential at interrupt time in order to do this
processing.

The answer is that this is a false assumption, and is predicated
on historical code, and nothing more.

Specifically, if I need a credential for a newly accepted socket
that I am now creating, I can add a reference to the listen socket
credential -- I //do not need// a process credential in order to
do an accept.

There is a lot of this type of fuzzy thinking, asking "how can I
propagate the process credential that I used to use for this
operation down to the underlying code?", when the real question
should be "what is the appropriate credential to use for this
operation, and is the process credential really what I want to
use in this case?".


I think it's possible to get rid of most of the process credential
references -- and therefore, most of the proc references -- at all
points below the /sys/kern/uipc_socket*.c level.


> I don't pretend to have a grasp of all the issues here, so the purpose of
> this message is to raise the issues so that I can understand them.  I have
> a tree where I've eliminated many references to curproc; however, I'm now
> wondering if it wouldn't simply be more useful to eliminate many of the
> references to struct proc in the function arguments, and use curproc
> instead, and add references to ucred (and related ref-counted structures)
> as needed for delegation types of situations.  In particular, that would
> suggest the following changes:

I think this is the wrong direction, but if you wanted to do this,
I think that you would need to put the cur* symbols into the per
CPU private pages.  This is problematic in the extreme, because it
means that you must set these values each time going down, in order
to be able to substitute a per CPU global for the stack reference.

I think this is a bad thing, in general, and will lead only to
trouble later.

I would much rather that the credentials be object referenced off
of non-process, non-thread objects, based on whatever the correct
scoping really is, for the security model you want to enforce.  My
"accept" example is only one of a class of changes that could
facilitate this.


-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 14:54:34 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id EF6BA37B416; Mon, 12 Nov 2001 14:54:31 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACMsNd06845;
	Mon, 12 Nov 2001 14:54:23 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 14:54:23 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111122254.fACMsNd06845@apollo.backplane.com>
To: Terry Lambert <tlambert2@mindspring.com>
Cc: Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org> <3BF05241.74F895EF@mindspring.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:The point is that if the credentials are granted, then a
:change in credential is not a change of the credential itself,
:but is instead a copy-on-write proposition.  In other words,
:credentials, once granted, are priviledge stable.
:
:If this is the case, then they are written when they are
:instanced, cloned before they are modified (indeed, it seems
:that the clone/modify operation must be made atomic), and
:thus are never written once instanced -- only destroyed on
:the 1->0 reference transition.
:
:If so, then no locking is required, since the LCK CMPXCHG can
:be utilized to do atomic increment and decrement on the
:reference counting, without needing locks.
:...
:
:-- Terry

    Yes, I believe this is how credentials work.  I looked at
    the code about 6 months ago.  We should not have to do any
    locking of the credential stuff, only simple mutexing
    around the ref counter.  That is how it should work
    is how I believe it currently works.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15: 8:46 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id 0211437B416
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Nov 2001 15:08:43 -0800 (PST)
Received: (qmail 30403 invoked from network); 12 Nov 2001 23:08:41 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <tlambert2@mindspring.com>; 12 Nov 2001 23:08:41 -0000
Message-ID: <XFMail.011112150836.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <3BF05241.74F895EF@mindspring.com>
Date: Mon, 12 Nov 2001 15:08:36 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 12-Nov-01 Terry Lambert wrote:
> Robert Watson wrote:
>> With the eventual addition
>> of td->td_ucred, it will be desirable to use the credential for the
>> current thread, rather than the proc, which will require locking to use.
> 
> I think locking credential instances is bad.

No, he's not locking credentials, he would be locking the process to avoid
having the credential change out from under him.  However, this won't be needed
in most cases since each thread has a read-only reference to the process
credential.  (When the process changes credentials, the references of other
threads force it to duplicate its current cred into a new one before making the
change.)

> If so, then no locking is required, since the LCK CMPXCHG can
> be utilized to do atomic increment and decrement on the
> reference counting, without needing locks.

Except that people keep complaining about using atomic ops for ref counts,
however that can be done later as an optimization.

Regarding object credentials, I agree, and I thought that this was how things
were already performed.

>> I don't pretend to have a grasp of all the issues here, so the purpose of
>> this message is to raise the issues so that I can understand them.  I have
>> a tree where I've eliminated many references to curproc; however, I'm now
>> wondering if it wouldn't simply be more useful to eliminate many of the
>> references to struct proc in the function arguments, and use curproc
>> instead, and add references to ucred (and related ref-counted structures)
>> as needed for delegation types of situations.  In particular, that would
>> suggest the following changes:
> 
> I think this is the wrong direction, but if you wanted to do this,
> I think that you would need to put the cur* symbols into the per
> CPU private pages.  This is problematic in the extreme, because it
> means that you must set these values each time going down, in order
> to be able to substitute a per CPU global for the stack reference.

Errr, Terry.  Where do you think curthread/curproc lives now?  It's _already_
in a per-CPU page.  We set curthread/curproc on each context switch.

> I would much rather that the credentials be object referenced off
> of non-process, non-thread objects, based on whatever the correct
> scoping really is, for the security model you want to enforce.  My
> "accept" example is only one of a class of changes that could
> facilitate this.

I agree with this.  I think Robert's question wasn't just about socket
credentials however, his question was why pass a proc pointer (or thread
poiter) all the way down the stack that is implicitly assumed to be
curproc/curthread in several places instead of just using curproc/curthread
which your only response seems to be to suggest that we "change" to doing
something that we already do.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15: 8:57 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211])
	by hub.freebsd.org (Postfix) with ESMTP id 1B11A37B416
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 15:08:53 -0800 (PST)
Received: (qmail 39315 invoked from network); 12 Nov 2001 23:08:43 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <dillon@apollo.backplane.com>; 12 Nov 2001 23:08:43 -0000
Message-ID: <XFMail.011112150837.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <200111122254.fACMsNd06845@apollo.backplane.com>
Date: Mon, 12 Nov 2001 15:08:37 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 12-Nov-01 Matthew Dillon wrote:
>:The point is that if the credentials are granted, then a
>:change in credential is not a change of the credential itself,
>:but is instead a copy-on-write proposition.  In other words,
>:credentials, once granted, are priviledge stable.
>:
>:If this is the case, then they are written when they are
>:instanced, cloned before they are modified (indeed, it seems
>:that the clone/modify operation must be made atomic), and
>:thus are never written once instanced -- only destroyed on
>:the 1->0 reference transition.
>:
>:If so, then no locking is required, since the LCK CMPXCHG can
>:be utilized to do atomic increment and decrement on the
>:reference counting, without needing locks.
>:...
>:
>:-- Terry
> 
>     Yes, I believe this is how credentials work.  I looked at
>     the code about 6 months ago.  We should not have to do any
>     locking of the credential stuff, only simple mutexing
>     around the ref counter.  That is how it should work
>     is how I believe it currently works.

Yep.  They use a mutex for the refcount for now, but I still have patches that
some people don't like for implementing a simple refcount API just using atomic
operations.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:16:24 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from raven.mail.pas.earthlink.net (raven.mail.pas.earthlink.net [207.217.120.39])
	by hub.freebsd.org (Postfix) with ESMTP
	id 803E837B417; Mon, 12 Nov 2001 15:16:22 -0800 (PST)
Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com)
	by raven.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 163QJR-0007Wq-00; Mon, 12 Nov 2001 15:16:22 -0800
Message-ID: <3BF05877.B9E886D8@mindspring.com>
Date: Mon, 12 Nov 2001 15:17:11 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011111101234.11566A-100000@fledge.watson.org> <3BF05241.74F895EF@mindspring.com> <200111122254.fACMsNd06845@apollo.backplane.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:
>     Yes, I believe this is how credentials work.  I looked at
>     the code about 6 months ago.  We should not have to do any
>     locking of the credential stuff, only simple mutexing
>     around the ref counter.  That is how it should work
>     is how I believe it currently works.

FWIW:

Robert had implied that more heavyweight locking of the process
(or thread) structure was necessary to access the credential,
which is correct, if you are referencing it that was.

The part of me you quoted here was a conclusion based on using
direct references to value-stable credentials rather than
value-colatile proc or thread structs.  It only works to refute
Roberts argument if you include that; it's not correct to
conclude that the way it currently works is sufficient in the
face of the proc/thread dereference issues that Robert was
trying to address (and which I tried to address by avoiding
entirely).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:20:14 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7FEE937B416; Mon, 12 Nov 2001 15:20:11 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95318;
	Mon, 12 Nov 2001 15:02:12 -0800 (PST)
Date: Mon, 12 Nov 2001 15:02:11 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Terry Lambert <tlambert2@mindspring.com>
Cc: Robert Watson <rwatson@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <3BF05241.74F895EF@mindspring.com>
Message-ID: <Pine.BSF.4.21.0111121457180.94926-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 12 Nov 2001, Terry Lambert wrote:

> 
> I think this is the wrong direction, but if you wanted to do this,
> I think that you would need to put the cur* symbols into the per
> CPU private pages.  This is problematic in the extreme, because it
> means that you must set these values each time going down, in order
> to be able to substitute a per CPU global for the stack reference.

curproc and curthread ARE in the per-cpu private pages.
on x86, the %fs segment register points to a small segment that includes
the appropriate pages for that cpu. Each cpu is initialised with a
different %fs register value.
Your private info is accessed as an offset into the 'f' segment
which is not used by anything else.

'curthread' is a macro that generates %fs(gd_curthread)
(I forget the exact syntax)

Similar for other CPUs

> I think this is a bad thing, in general, and will lead only to
> trouble later.
> 
> I would much rather that the credentials be object referenced off
> of non-process, non-thread objects, based on whatever the correct
> scoping really is, for the security model you want to enforce.  My
> "accept" example is only one of a class of changes that could
> facilitate this.
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:20:22 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id 09D5937B416; Mon, 12 Nov 2001 15:20:15 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95324;
	Mon, 12 Nov 2001 15:04:27 -0800 (PST)
Date: Mon, 12 Nov 2001 15:04:27 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <200111122254.fACMsNd06845@apollo.backplane.com>
Message-ID: <Pine.BSF.4.21.0111121503200.94926-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 12 Nov 2001, Matthew Dillon wrote:

> :The point is that if the credentials are granted, then a
> :change in credential is not a change of the credential itself,
> :but is instead a copy-on-write proposition.  In other words,
> :credentials, once granted, are priviledge stable.
> :
> :If this is the case, then they are written when they are
> :instanced, cloned before they are modified (indeed, it seems
> :that the clone/modify operation must be made atomic), and
> :thus are never written once instanced -- only destroyed on
> :the 1->0 reference transition.
> :
> :If so, then no locking is required, since the LCK CMPXCHG can
> :be utilized to do atomic increment and decrement on the
> :reference counting, without needing locks.
> :...
> :
> :-- Terry
> 
>     Yes, I believe this is how credentials work.  I looked at
>     the code about 6 months ago.  We should not have to do any
>     locking of the credential stuff, only simple mutexing
>     around the ref counter.  That is how it should work
>     is how I believe it currently works.

This is not how they work, but rather how they WILL work
given that the commit happens soon (maybe it was already done
last week and I missed it...)

> 
> 					-Matt
> 					Matthew Dillon 
> 					<dillon@backplane.com>
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:20:36 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9503637B419; Mon, 12 Nov 2001 15:20:24 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACNKLC07027;
	Mon, 12 Nov 2001 15:20:21 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 15:20:21 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111122320.fACNKLC07027@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.ORG>
Cc: freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
References:  <XFMail.011112150837.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:
:Yep.  They use a mutex for the refcount for now, but I still have patches that
:some people don't like for implementing a simple refcount API just using atomic
:operations.
:
:-- 
:
:John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/

    I haven't seen your patches but I like the idea of a simple API for
    incrementing and decrementing a refcnt_t type of variable that
    hides the underlying 'how'.  For example, on some architectures
    you could use atomic ops, on others you could use a small pool
    of mutexes.

    Specifically, I really dislike the mutex embedded in the ucred
    structure.  It is entirely unnecessary - a simple global pool
    of shared mutexes is sufficient, hashed by pointer address,
    or using atomic ops on architectures that support them.

    Something like this:

    /*
     * machine independant sys/refcnt.h
     */

    #ifndef ARCH_OVERRIDE_REFCNT

    typedef int refcnt_t;

    #endif

    ...

    /*
     * machine independant kern/refcnt.c
     */

    #ifndef ARCH_OVERRIDE_REFCNT

    #define MTX_POOL	32

    static struct mtx mtx_pool[MTX_POOL];

    /*
     * called in early startup to initialize
     * mutexes (if necessary)
     */
    void
    refcnt_init(void)
    {
	...
    }

    /*
     * Increment the ref counter.  panic if we
     * overflow.
     */
    void
    refcnt_bump(refcnt_t *rp)
    {
	/*
	 * architecture dependant. e.g. atomic op
	 * in I386, maybe a pool mutex for alpha, etc
	 * etc etc.
	 */
    }

    /*
     * Decrement the ref counter.  panic if we
     * overflow.  Returns the ref counter after
     * it has been decremented (typically used to
     * determine that the associated structure
     * is no longer in use).
     */
    int
    refcnt_drop(refcnt_t *rp)
    {
	/*
	 * architecture dependant. e.g. atomic op
	 * in I386, maybe a pool mutex for alpha, etc
	 * etc etc.
	 */
    }

    #endif

    You could have a default set of ref counter routines that
    use a global pool of mutexes to avoid having to implement
    them for each architecture, and you could have architecture
    overrides of those routines to implement architecture-specific
    optimizations.

    Similar pool-type functions (using the same pool) can be used
    to sequence structure deallocations / cloning / etc.  In
    fact, the one huge advantage of a pool mutex is that it
    is independant of the structure, so you don't race a
    deallocation routine when obtaining the mutex prior to
    checking that the structure is even valid.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:32:31 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211])
	by hub.freebsd.org (Postfix) with ESMTP id BD31737B417
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 15:32:28 -0800 (PST)
Received: (qmail 53828 invoked from network); 12 Nov 2001 23:32:27 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <dillon@apollo.backplane.com>; 12 Nov 2001 23:32:27 -0000
Message-ID: <XFMail.011112153221.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <200111122320.fACNKLC07027@apollo.backplane.com>
Date: Mon, 12 Nov 2001 15:32:21 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: cur{thread/proc}, or not.
Cc: Terry Lambert <tlambert2@mindspring.com>, 
Cc: Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 12-Nov-01 Matthew Dillon wrote:
>:
>:Yep.  They use a mutex for the refcount for now, but I still have patches
>:that
>:some people don't like for implementing a simple refcount API just using
>:atomic
>:operations.
>:
>:-- 
>:
>:John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
> 
>     I haven't seen your patches but I like the idea of a simple API for
>     incrementing and decrementing a refcnt_t type of variable that
>     hides the underlying 'how'.  For example, on some architectures
>     you could use atomic ops, on others you could use a small pool
>     of mutexes.

http://www.freebsd.org/~jhb/patches/refcount.patch

It's slightly different than this in that refcount_drop() returns a boolean
that is true if the count just dropped to zero.  It only uses mutexes when
using debugging and doesn't use a pool, but currently it is implemented
completely with atomic ops on all currently supported archs.

Hmm, it needs a change in that no memory barriers are really needed except that
maybe the atomic_add should use a release barrier.  This refcount has some
problems, however.  The only reliable way to do a refcount_shared() primitive
would be to do

int
refcount_shared(refcount_t *count)
{
        int rval;

        rval = !refcount_drop(count);
        refcount_hold(count);
}

But that is evil and has a race condition.  Changing refcount_drop() to return
the current value would be more workable I suppose and would allow you to do
this by doing a hold and then a drop and see if the value is > 1 to see if it's
shared.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:35:51 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4ED1E37B418; Mon, 12 Nov 2001 15:35:42 -0800 (PST)
Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com)
	by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 163Qc8-0006CX-00; Mon, 12 Nov 2001 15:35:41 -0800
Message-ID: <3BF05CFE.EAE5EEE4@mindspring.com>
Date: Mon, 12 Nov 2001 15:36:30 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: John Baldwin <jhb@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>
Subject: Re: cur{thread/proc}, or not.
References: <XFMail.011112150836.jhb@FreeBSD.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

John Baldwin wrote:
> > If so, then no locking is required, since the LCK CMPXCHG can
> > be utilized to do atomic increment and decrement on the
> > reference counting, without needing locks.
> 
> Except that people keep complaining about using atomic ops for
> ref counts, however that can be done later as an optimization.


Is this the MIPS argument?  There is a way around this problem
on brain damaged processors, which has been known to CS for a
long time.  A heavy-weight idempotent-but-not-atomic portable
approach would make these people happy, since then their pet
processors would not look so much like pigs compared to other
processors that were handicapped by having to run the same
code.

I don't think of it as a premature optimization so much as it
is a premature generalization.  If we want to be general, then
we should provide C code for all but the very platform specific
things, since this would be incredibly more useful for any port
attempt than doing P/V idempotent counting.


> Regarding object credentials, I agree, and I thought that this
> was how things were already performed.

Not where the proc or thread is used to reference the cred,
though there is much code that uses the read-only reference.


> > I think this is the wrong direction, but if you wanted to do this,
> > I think that you would need to put the cur* symbols into the per
> > CPU private pages.  This is problematic in the extreme, because it
> > means that you must set these values each time going down, in order
> > to be able to substitute a per CPU global for the stack reference.
> 
> Errr, Terry.  Where do you think curthread/curproc lives now?  It's
> _already_ in a per-CPU page.  We set curthread/curproc on each context
> switch.

Yes.  That is Evil Overhead That Must Go Away.  My use of "need"
was probably not emphatic enough -- I should have said "MUST forever
after".

This isn't really very clear without my example, where I do the
processing as the result of an interrupt, rather than in the
context of a process.  :-(.


> > I would much rather that the credentials be object referenced off
> > of non-process, non-thread objects, based on whatever the correct
> > scoping really is, for the security model you want to enforce.  My
> > "accept" example is only one of a class of changes that could
> > facilitate this.
> 
> I agree with this.  I think Robert's question wasn't just about socket
> credentials however, his question was why pass a proc pointer (or thread
> poiter) all the way down the stack that is implicitly assumed to be
> curproc/curthread in several places instead of just using curproc/curthread
> which your only response seems to be to suggest that we "change" to doing
> something that we already do.

No; I think that most of the passed references to proc/curproc
can be eliminated.

Now, of course, we will have to deal with the cruft idea of
"curcred"...

I dislike the idea of "cur" anything.  It means that we have
to assume top-down procedural processing, with queueing breaks
at both interrupt and NETISR (to cite specific examples).
Doing this is demonstrably the wrong thing to do, even if we
ignore the global non-cacheable per CPU page overhead.

If anyone has any reservations on this, I suggest they do some
network performance testing with the Duke University port of
the LRP + RESCON code to FreeBSD 4.3 from the original Rice
Univeristy code (before anyone gets too happy, there is a
non-commercial use license on this, and I personally think a
queued fair share scheduler has significantly lower overhead
than resource containers, for what that's worth).

Your connection per second rate alone will triple if you use
this appraoch.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:37: 1 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122])
	by hub.freebsd.org (Postfix) with ESMTP
	id CC64C37B417; Mon, 12 Nov 2001 15:36:59 -0800 (PST)
Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com)
	by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 163QdP-0007WL-00; Mon, 12 Nov 2001 15:36:59 -0800
Message-ID: <3BF05D4C.55A9A459@mindspring.com>
Date: Mon, 12 Nov 2001 15:37:48 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: John Baldwin <jhb@FreeBSD.org>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not.
References: <XFMail.011112150837.jhb@FreeBSD.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

John Baldwin wrote:
> the refcount for now, but I still have patches that
> some people don't like for implementing a simple refcount API just using
> atomic operations.

Please commit these.  Using mutexes in this instance is just
a happy way to put the performance in the toilet.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:50:49 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0960137B416; Mon, 12 Nov 2001 15:50:46 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACNojg07127;
	Mon, 12 Nov 2001 15:50:45 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 15:50:45 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111122350.fACNojg07127@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.org>
Cc: Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
References:  <XFMail.011112153221.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

    You want to be very careful not to bloat the concept.  We
    already have severe bloatage in the mutex code and that has
    led to a lot of unnecessary complexity.  A huge amount,
    in fact.  We have so many types of mutexes it makes my
    head spin and I'm not very happy about it.  Forget about 
    'shared' verses 'exclusive'.  A reference count is a 
    reference count, that's all.  If you keep the concept
    simple you can implement more functionality horizontally
    rather then implementing more complexity vertically.

    For example, consider this API for pool mutexes.

    /*
     * obtain related pool mutex
     */
    void
    pool_mtx_lock(void *ptr);
    {
    }

    /*
     * release related pool mutex.
     */
    void
    pool_mtx_unlock(void *ptr)
    {
    }

    Now consider how this could be combined with, say,
    the zalloc() and zfree() code.  Consider how it could
    be combined with the refcount code.  It might even be
    possible to remove the stable-storage requirement.
    Consider a vnode verses its underlying VM object.
    Consider this:

    vp = vnode ... we already have a ref count on the vp.
    while ((object = vp->v_object) != NULL) {
	pool_mtx_lock(object)
	if (vp->v_object == object)
	    break;
	pool_mtx_unlock(object)
    }
    /* object guarenteed to be associated with vnode */
    ++object->ref_cnt;
    pool_mtx_unlock(object);

    ... continue working on object

    Structural overhead: 0 bytes
    Parallelism: high

    Now consider how this might be combined with the
    refcnt pool code:


    CODE PIECE 1:

    vp = vnode ... we already have a ref count on the vp.
    while ((object = vp->v_object) != NULL) {
	pool_mtx_lock(&object->ref_cnt);
	if (vp->v_object == object)
	    break;
	pool_mtx_unlock(&object->ref_cnt)
    }
    /* object guarenteed to be associated with vnode */
    ++object->ref_cnt;
    pool_mtx_unlock(&object->ref_cnt);

    CODE PIECE 2 (compatible with CODE PIECE 1):

    /* object is a known good object that will not be going away soon */
    refcnt_bump(&object->ref_cnt);
    ... use object ...
    refcnt_drop(&object->ref_cnt);


    And there you have it.  An utterly simple API of four
    routines (refcnt routines and pool routines), with a huge
    amount of capability.

					-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:53: 4 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id 2C06837B417
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Nov 2001 15:52:56 -0800 (PST)
Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fACNqjB37001;
	Mon, 12 Nov 2001 18:52:45 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 12 Nov 2001 18:52:45 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: Terry Lambert <tlambert2@mindspring.com>
Cc: freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <3BF05241.74F895EF@mindspring.com>
Message-ID: <Pine.NEB.3.96L.1011112183454.36592A-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On Mon, 12 Nov 2001, Terry Lambert wrote:

> Robert Watson wrote:
> > There are a number of uses of curproc in the netinet code, used to
> > retrieve credentials for authorization somewhere down the stack, when no
> > proc or thread pointer has been passed down.
> 
> I think that the majority of the netinet code can be handled by using
> the socket credential, instead of the process credential. 

The majority, yes, but not all.  In particular, there are a number of
desirable behaviors where you *do* want to use the process credential.  In
particular, relating to binding activities, where current semantics permit
a 'privileged process' to create and bind sockets such that they have
access to otherwise restricted ports, transfer them to unprivileged
processes, but not grant the full scope of privilege to those processes. A
primary example of this in use in practice might be a situation where an
I/O socket is handed off from a network daemon to an unprivileged process,
such as inetd handing off to fingerd: fingerd should not retain inetd's
privileges regarding many aspects of the socket's behavior.  This argument
might be seen more convincingly from the perspective of UDP sockets.  Yes,
it is true that in most cases use of the socket credential is desirable,
but in a number of important cases, it is not.

There are some related cases in VFS, where we consider a per-jail
securelevel based on the acting process, not the file-opening process. 
Similarly, there are some ioctl's on tty devices that are subject
(process)  credential authorized: these are in general present to handle
the case where descriptors to these objects are (and must be) inherited.
There are some related cases, such as fd passing via unix domain sockets,
where the same properties can prove very useful: the ability to transfer
access to sockets/files via LPC as 'rights' rather than delegating all
rights. 

> > With the eventual addition
> > of td->td_ucred, it will be desirable to use the credential for the
> > current thread, rather than the proc, which will require locking to use.
> 
> I think locking credential instances is bad.

That is not what we're talking about.  We're talking about locking the
process structure.  No one is suggesting this.

> The real question you want to answer is whether or not the credential
> instance that was used to acquire a socket should be used continuously
> from there on out (i.e. it is a grant), or whether it should change when
> the process credential changes (i.e. it is a lease).  You seem to be
> arguing for a lease.  I would argue for a grant. 
> 
> One issue is that there are cases where write permission is tested
> before each write.  There are also cases, where you obtain a privileged
> socket, and then relinquish privileges after obtaining it; such cases
> are explicitly modelled on a grant model rather than a lease model. 
> 
> The point is that if the credentials are granted, then a change in
> credential is not a change of the credential itself, but is instead a
> copy-on-write proposition.  In other words, credentials, once granted,
> are priviledge stable. 
> 
> If this is the case, then they are written when they are instanced,
> cloned before they are modified (indeed, it seems that the clone/modify
> operation must be made atomic), and thus are never written once
> instanced -- only destroyed on the 1->0 reference transition. 

Everyone agrees that the ucred semantics are copy-on-write.  This is
well-documented, and not something we're currently interested in changing
(although some platforms have opted to sacrifice memory in order to reduce
locking/atomic operations, and that's something we might eventually want
to consider if we move to very fine-grained and highly parallel
operation). 

> If so, then no locking is required, since the LCK CMPXCHG can be
> utilized to do atomic increment and decrement on the reference counting,
> without needing locks.

There is some disagreement on the topic of atomic operations due to
portability issues (among other things), but that's not what we're talking
about. 

> > As I
> > understand it, use of curproc was branded 'undesirable' at some point in
> > the semi-distant past, and since that time, a reference to 'proc' has been
> > passed down the stack.  With a change to KSE, this has been translated to
> > references the thread, but the issue remains the same.  This comes up in
> > particular because I have a tree where I have propagated the thread
> > pointer down if_ioctl in the network stack: the normal ioctl call carries
> > a thread pointer now, but when it is translated into if_ioctl by the
> > network stack, that pointer is lost.  This raises the question: should we
> > (in practice) be adding process or thread pointers to many more of the
> > function arguments, or should we switch to using curproc/curthread
> > instead.
> 
> The "curproc" undesirability stems primarily from credentials
> enforcement during interrupt processing.  I think that this is not an
> insurmountable issue, but I would argue that these are more appropriate
> for object credentials, where the objects in question are not threads or
> processes. 
> 
> For example, if we were to process incoming TCP connections up through
> the "accept" code at interrupt time, one might naievely assume that,
> since the current socket code down through the accept processing code
> off the queue filled in at NETISR seems to require a proc credential,
> that it is therefore necessary to have a proc credential at interrupt
> time in order to do this processing.
> 
> The answer is that this is a false assumption, and is predicated on
> historical code, and nothing more.
> 
> Specifically, if I need a credential for a newly accepted socket that I
> am now creating, I can add a reference to the listen socket credential
> -- I //do not need// a process credential in order to do an accept. 
> 
> There is a lot of this type of fuzzy thinking, asking "how can I
> propagate the process credential that I used to use for this operation
> down to the underlying code?", when the real question should be "what is
> the appropriate credential to use for this operation, and is the process
> credential really what I want to use in this case?".

I agree there has been a lot of fuzzy thinking.  I also agree that, in
every case, we need to carefully consider the credential used.  In
particular, this is true in the 'new world order' of td_ucred, where we'll
now often have three credentials to decide from:

(1) Mutable p_ucred (requires proc lock)
(2) Cached td_ucred (requires no lock)
(3) Cached so->so_cred, file->f_cred, et al.

In most cases, (2) or (3) will be appropriate.  In some situations,
particularly when it comes to credential update, (1) will be appropriate. 

> I think it's possible to get rid of most of the process credential
> references -- and therefore, most of the proc references -- at all
> points below the /sys/kern/uipc_socket*.c level. 

No, it's not, in a number of very important cases, of which I've
identified at least three above.

Structuring code to have a notion of "but the kernel asked" vs. "but a
user asked" is difficult, and something I'm not sure we have a grasp on
how to approach.  Sometimes, for example, FSCRED or NOCRED is used as a
"special-case" credential to say "do it anyway".  This is often broken
when it comes to distributed file systems where a client system may not
simply be able to assert "because I said so", and probably reflects
unclear thinking on the topic.

> > I don't pretend to have a grasp of all the issues here, so the purpose of
> > this message is to raise the issues so that I can understand them.  I have
> > a tree where I've eliminated many references to curproc; however, I'm now
> > wondering if it wouldn't simply be more useful to eliminate many of the
> > references to struct proc in the function arguments, and use curproc
> > instead, and add references to ucred (and related ref-counted structures)
> > as needed for delegation types of situations.  In particular, that would
> > suggest the following changes:
> 
> I think this is the wrong direction, but if you wanted to do this, I
> think that you would need to put the cur* symbols into the per CPU
> private pages.  This is problematic in the extreme, because it means
> that you must set these values each time going down, in order to be able
> to substitute a per CPU global for the stack reference. 
> 
> I think this is a bad thing, in general, and will lead only to trouble
> later. 
> 
> I would much rather that the credentials be object referenced off of
> non-process, non-thread objects, based on whatever the correct scoping
> really is, for the security model you want to enforce.  My "accept"
> example is only one of a class of changes that could facilitate this. 

I think everyone agrees that the 'cached credential' model is the right
approach for many of these cases, but I think it's over-reaching to claim
it's appropriate in all cases.  The question then becomes, how do we
access the relevant 'subject' credential to authorize the operation: is it
something that is passed down via the call stack (possibly via 'struct
thread *td'), or is it something implicit to the run-time environmenta
('curproc'/'curthread'), which is precisely the question I was trying to
resolve through my post.  If 'curproc'/'curthread' is truly undesirable,
then we can simply eliminate its use, and replace that with almost
universal passing of 'struct thread' (for the purposes of authorization,
but also for other purposes: target of copyin/copyout/aio, scheduling,
ktrace, ...).  If it is acceptable to maintain the use of curproc, we may
want to change some of our primitives to represent it being available. 

Right now, we're in a state of limbo: the official policy (if you will) is
'XXX'.  We should either eliminate it from general use, or we should use
it where it's appropriate :-). 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:56:58 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id 05D5337B416
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 15:56:56 -0800 (PST)
Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fACNudB37043;
	Mon, 12 Nov 2001 18:56:39 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 12 Nov 2001 18:56:38 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.ORG>
X-Sender: robert@fledge.watson.org
To: Terry Lambert <tlambert2@mindspring.com>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <3BF05877.B9E886D8@mindspring.com>
Message-ID: <Pine.NEB.3.96L.1011112185320.36592B-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 12 Nov 2001, Terry Lambert wrote:

> Matthew Dillon wrote:
> >     Yes, I believe this is how credentials work.  I looked at
> >     the code about 6 months ago.  We should not have to do any
> >     locking of the credential stuff, only simple mutexing
> >     around the ref counter.  That is how it should work
> >     is how I believe it currently works.
> 
> FWIW:
> 
> Robert had implied that more heavyweight locking of the process (or
> thread) structure was necessary to access the credential, which is
> correct, if you are referencing it that was. 

In the proposed model, there are two relevant subject credentials: the
thread credential, and the process credential.  The thread credential is
static for the lifetime of the system call, and while the call is
on-going, it can be used without any locking/atomic primitives (with the
exception of when additional references are added to be cached in
objects).  The process credential is shared, and, if you will, the 'real'
copy.  This reference is changed as the process's notion of credential is
updated, and requires locks, as it might be changed by multiple threads
(potentially in parallel), as well as inspected by other processes for the
purposes of reporting (to ps, for example), or for access control (signal
delivery, debugging, ...)

One of the many nice things about the model, which should be credited to
John, is that it doesn't require locking operations in most usage
situations.

> The part of me you quoted here was a conclusion based on using direct
> references to value-stable credentials rather than value-colatile proc
> or thread structs.  It only works to refute Roberts argument if you
> include that; it's not correct to conclude that the way it currently
> works is sufficient in the face of the proc/thread dereference issues
> that Robert was trying to address (and which I tried to address by
> avoiding entirely). 

...


Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:57:42 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 02BEE37B417; Mon, 12 Nov 2001 15:57:39 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACNvc507188;
	Mon, 12 Nov 2001 15:57:38 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 15:57:38 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111122357.fACNvc507188@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.ORG>
Cc: Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References:  <XFMail.011112153221.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:http://www.freebsd.org/~jhb/patches/refcount.patch
:
:It's slightly different than this in that refcount_drop() returns a boolean

    Ok, I've read it.  Ick.  Could you reorgranize it a bit to do something
    slightly different?

    Make sys/refcount.h provide a machine portable set of routines.  Allow
    the machine/refcount.h headers to override the portable set.  This way
    an architecture does *NOT* need to implement routines for yet another
    header file (or duplicate a lot of code over and over again).

    This business about INVARIANTS makes no sense to me.  INVARIANTS should
    not totally change the way the refcount API works.  It certainly should
    not result in different structures!  If we are embedding ref counts
    in every structure in the system simply setting or clearing INVARIANTS
    blows up our compatibility, which is bad.
     
    Also, I don't see any reason to embed yet another mutex in a structure.
    The ref count should be a simple int.  Use a pool of mutexes.  If you like
    I'll commit a set of generic pool mutexes that you can simply call.  How 
    about that?

							-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:58: 3 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id E40DC37B416
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Nov 2001 15:57:59 -0800 (PST)
Received: (qmail 10400 invoked from network); 12 Nov 2001 23:57:58 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <dillon@apollo.backplane.com>; 12 Nov 2001 23:57:58 -0000
Message-ID: <XFMail.011112155752.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <200111122350.fACNojg07127@apollo.backplane.com>
Date: Mon, 12 Nov 2001 15:57:52 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>,
	Terry Lambert <tlambert2@mindspring.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 12-Nov-01 Matthew Dillon wrote:
>     You want to be very careful not to bloat the concept.  We
>     already have severe bloatage in the mutex code and that has
>     led to a lot of unnecessary complexity.  A huge amount,
>     in fact.  We have so many types of mutexes it makes my
>     head spin and I'm not very happy about it.  Forget about 
>     'shared' verses 'exclusive'.  A reference count is a 
>     reference count, that's all.  If you keep the concept
>     simple you can implement more functionality horizontally
>     rather then implementing more complexity vertically.

Err, hang on.  I wasn't doing shared counts.   refcount_shared() would be a
simple primitive to return true if the refcount was > 1.  I was trying to see
how the current API would fit with ucred mutexes, for example.  If you had
looked at the patch, you would find that the API is very simple.  What I really
should do is add atomic_fetchadd()  (fetchadd on ia64, xadd on 486+, locked
load /conditional store loop on alpha, simualted with atomic_cmpset() on opther
archs if needed) and refcount_drop() can just be atomic_fetchadd().  This will
change refcount_drop() to return the current value rather than if the value is
zero.  Please reread my mail and the patch itself.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 15:59: 1 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 6CBD137B41B; Mon, 12 Nov 2001 15:58:59 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fACNwxq07227;
	Mon, 12 Nov 2001 15:58:59 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 15:58:59 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111122358.fACNwxq07227@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.ORG>
Cc: freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
References:  <XFMail.011112155752.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


:
:
:On 12-Nov-01 Matthew Dillon wrote:
:>     You want to be very careful not to bloat the concept.  We
:>     already have severe bloatage in the mutex code and that has
:>     led to a lot of unnecessary complexity.  A huge amount,
:>     in fact.  We have so many types of mutexes it makes my
:
:Err, hang on.  I wasn't doing shared counts.   refcount_shared() would be a
:simple primitive to return true if the refcount was > 1.  I was trying to see

   Sorry.  Posted that before I read the patch.

					-Matt

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16: 0:26 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP id A175C37B417
	for <arch@freebsd.org>; Mon, 12 Nov 2001 16:00:20 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95484
	for <arch@freebsd.org>; Mon, 12 Nov 2001 15:48:49 -0800 (PST)
Date: Mon, 12 Nov 2001 15:48:47 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: arch@freebsd.org
Subject: Thread scheduling in the kernel
Message-ID: <Pine.BSF.4.21.0111121507240.94926-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


In an attempt to get the next part of the KSE work designed (design before
code you know.. a strange new concept) I've been trying to work out
the "correct" scheduling methods for such a system.

There are a few 'tricks' that need to be taken into account..

a few notes..


1/ Since threads running a syscall hit 'sleep' events
the entities on teh sleep queues must be the  threads.

2/ the entity that is scheduled onto the run queues is the KSE.
(as the name suggests).

3/ If we have only one run queue, then KSEs for several processors
from the same process, may be on the same queue.

4/  If threads 'wake up' they are hung of a list of runnable threads
somewhere. This list could be hanging off the process, or the KSE.
(actually more likely the KSEgroup than the process but...)

5/ If a KSE reaches teh front of the queue, but the process
that is running is not that for which that KSE has some affinity,
does it get out of the way to allow another KSE in the queue
to get run? or does it just run and 'switch' everything over to the new
available processor? Maybe the scheduler looks for the KSE from the same
group, that was assigned to that processor, and runs that, leaving
the original KSE at the head of the queue? 
Maybe that happens until all the KSEs in the queue
that were from that group have been run? In this case it becomes possible
to always have a KSE from that group ready...

Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that
group are put on all processors that look for work, until all of them 
have been run? (this would ensure that threads from the same process
would all be run at the same time which is sometimes good, and sometimes
bad, depending on the application.

6/ When a Thread is made runnable it gets (in the present system) a
priority. What priority does a KSE in the run queues have when it has
threads of several differnt priorities? Do we sort them in priority order
and drop the priority of the KSE(group) as we go through them
until we have less priority than some other kse?

7/ when a KSE runs out of work, how does it decide whether there is work
that should be stolen from a fellow KSE? How does processor affinity
effect this?

8/ If we had per-processor scheduling queues, How would that effect it?
Which element get's put on the queues? Does a KSE
stay on the run queue if it has un=run threads, even when it's running?
How do we handle the arrival of new runnable threads with a KSE
when it's running but a fellow KSE is not runnable. Do we 
bump the priority of the other KSE and hand it the new threads?


remember: here are the 4 structures:

proc  -   owner of all resources (FDs, memory, user creds) except cpu

Ksegroup -  owner of all scheduler controlling characteristics
	(e.g. nice, realtime, number of processors),  N per process.
	Owner of stats used for scheduling calculations.	

kse -	kind of a placeholder.  It gets scheduled onto 
	a processor (by a yet un-named mechaninsm) and provides
	cpu-cycles for the execution of 'threads' (see next).
	Max. of one per processor per KSE-group.

thread -  The in-kernel incarnation of a user thread that is presently
	in the kernel for some reason (e.g. syscall, pagefault, etc)
	Holds ALL the state needed to resume after sleeping, and is the
	entity that is suspended when the thread hits a 'sleep'.
	"unlimmitted" per KSEgroup. probably have a short-term
	"favourite" KSE/processor.


When a thread blocks, the KSE looks for another thread to run, and if it
doesn't find one, it will create one, and upcall back to the 
userland to see if there are more userland threads to run.
(if not, it returns to yield the processor)

The question that has been giving me headaches is the 
relationship between these elements, and
the definitions of how these structures are linked up and moved
around to provide fair efficient scheduling.

If a KSE has a high priority thread and a low priority thread
runnable in the kernel, but in reverse order, should it take
the high priority from the higher prio. thread and process both,
or should it order the threads and run teh high prio one first.
In this case what happens whan a higher prio. thread becomes runnable
while one is already running, and if the highest prio thread returns to
userland, should teh processor move to userland to follow it, or
switch to the next priority thread in the kernel.?
Do all threads in the kernel have priority over all threads in userland?
(this might be a reasonable decision).

These and other questions are in need of real discussion here on -arch.
We need to somewhere develope a document as to how we want this to work.

If we can have a good discussion here on these topics over a coupel of
days I'll attempt to produce such a document
and submit it for comment as the basis of a second round of discussions.

Julian


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16: 0:39 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id 75DE337B41C; Mon, 12 Nov 2001 16:00:32 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95502;
	Mon, 12 Nov 2001 15:53:29 -0800 (PST)
Date: Mon, 12 Nov 2001 15:53:28 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: John Baldwin <jhb@FreeBSD.org>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <XFMail.011112150837.jhb@FreeBSD.org>
Message-ID: <Pine.BSF.4.21.0111121552290.94926-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

we should re-examine teh 'refcount' API

it's a very basic type and gettin gmore-so all the time..
we can affort to have a 'standard' 'safe' way of doing reference counts.


On Mon, 12 Nov 2001, John Baldwin wrote:

> 
> On 12-Nov-01 Matthew Dillon wrote:
> >:The point is that if the credentials are granted, then a
> >:change in credential is not a change of the credential itself,
> >:but is instead a copy-on-write proposition.  In other words,
> >:credentials, once granted, are priviledge stable.
> >:
> >:If this is the case, then they are written when they are
> >:instanced, cloned before they are modified (indeed, it seems
> >:that the clone/modify operation must be made atomic), and
> >:thus are never written once instanced -- only destroyed on
> >:the 1->0 reference transition.
> >:
> >:If so, then no locking is required, since the LCK CMPXCHG can
> >:be utilized to do atomic increment and decrement on the
> >:reference counting, without needing locks.
> >:...
> >:
> >:-- Terry
> > 
> >     Yes, I believe this is how credentials work.  I looked at
> >     the code about 6 months ago.  We should not have to do any
> >     locking of the credential stuff, only simple mutexing
> >     around the ref counter.  That is how it should work
> >     is how I believe it currently works.
> 
> Yep.  They use a mutex for the refcount for now, but I still have patches that
> some people don't like for implementing a simple refcount API just using atomic
> operations.
> 
> -- 
> 
> John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
> PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
> "Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:20:18 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8B62137B41B; Mon, 12 Nov 2001 16:20:10 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA95611;
	Mon, 12 Nov 2001 16:10:26 -0800 (PST)
Date: Mon, 12 Nov 2001 16:10:25 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: John Baldwin <jhb@FreeBSD.org>,
	Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <200111122350.fACNojg07127@apollo.backplane.com>
Message-ID: <Pine.BSF.4.21.0111121608240.94926-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 12 Nov 2001, Matthew Dillon wrote:

>     You want to be very careful not to bloat the concept.  We
>     already have severe bloatage in the mutex code and that has
>     led to a lot of unnecessary complexity.  A huge amount,
>     in fact.  We have so many types of mutexes it makes my
>     head spin and I'm not very happy about it.  Forget about 
>     'shared' verses 'exclusive'.  A reference count is a 
>     reference count, that's all.  If you keep the concept
>     simple you can implement more functionality horizontally
>     rather then implementing more complexity vertically.
> 
>     For example, consider this API for pool mutexes.

[...]

weren't you just complaining that there were too many kinds of mutex?
I'm not sure how this fits under "reference counting API"

ANyhow can you explain the idea of a pool mutex more clearly?


> 
> 
>     And there you have it.  An utterly simple API of four
>     routines (refcnt routines and pool routines), with a huge
>     amount of capability.
> 
> 					-Matt
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:23:29 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 55EC337B416; Mon, 12 Nov 2001 16:23:27 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAD0Msb07370;
	Mon, 12 Nov 2001 16:22:54 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 16:22:54 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111130022.fAD0Msb07370@apollo.backplane.com>
To: Julian Elischer <julian@elischer.org>
Cc: John Baldwin <jhb@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG,
	Robert Watson <rwatson@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
References:  <Pine.BSF.4.21.0111121552290.94926-100000@InterJet.elischer.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


:
:we should re-examine teh 'refcount' API
:
:it's a very basic type and gettin gmore-so all the time..
:we can affort to have a 'standard' 'safe' way of doing reference counts.
:

    Well, the question we face here is:  should a refcount API be self
    contained - apply only to ref counts, or should it be interlockable
    with other functionality?

    The best example of what I'm asking here can be found by observing 
    the existing vnode interlock.  A single interlock mutex in each vnode
    currently handles a bunch of chores:  (1) It locks v_usecount
    flags, (2) it interlocks the higher-level lockmgr lock, and (3) it
    interlocks certain combined operations.

    The current refcount API that John proposes would not be sufficient
    to be useful for the vnode v_usecount, but it probably would be
    sufficient for something like the ucred cr_ref count.

    What about other structures in the system?  Do we need self-contained
    ref counts ala ucred, or do we need interlocking ref counts ala vnode?

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:24:46 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212])
	by hub.freebsd.org (Postfix) with ESMTP id 27D6F37B41B
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Nov 2001 16:24:38 -0800 (PST)
Received: (qmail 70062 invoked from network); 13 Nov 2001 00:24:37 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <julian@elischer.org>; 13 Nov 2001 00:24:37 -0000
Message-ID: <XFMail.011112162432.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.BSF.4.21.0111121608240.94926-100000@InterJet.elischer.org>
Date: Mon, 12 Nov 2001 16:24:32 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Julian Elischer <julian@elischer.org>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>,
	Terry Lambert <tlambert2@mindspring.com>,
	Matthew Dillon <dillon@apollo.backplane.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 13-Nov-01 Julian Elischer wrote:
> 
> 
> On Mon, 12 Nov 2001, Matthew Dillon wrote:
> 
>>     You want to be very careful not to bloat the concept.  We
>>     already have severe bloatage in the mutex code and that has
>>     led to a lot of unnecessary complexity.  A huge amount,
>>     in fact.  We have so many types of mutexes it makes my
>>     head spin and I'm not very happy about it.  Forget about 
>>     'shared' verses 'exclusive'.  A reference count is a 
>>     reference count, that's all.  If you keep the concept
>>     simple you can implement more functionality horizontally
>>     rather then implementing more complexity vertically.
>> 
>>     For example, consider this API for pool mutexes.
> 
> [...]
> 
> weren't you just complaining that there were too many kinds of mutex?
> I'm not sure how this fits under "reference counting API"
> 
> ANyhow can you explain the idea of a pool mutex more clearly?

Heh, think of it as a pool of mutexes, not a different type of mutex.  Instead
of having 1 mutex for each object, you use a hash table of mutexes for a set of
objects.  Thus, if you have 50 objects vs. 500 objects, if you embed 1 mutex
for each object, you bloat each object and have 500 locks instead of 50 locks. 
Using pool mutexes, you only have N number of mutexes regardless of the number
of mutexes.  Note that if pool mutexes are non-recursive, they can't be safely
used when you might have more than one object of a given set locked at a time. 
For example, process locks are the only object we do this with currently.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:24:45 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id BA7CD37B405
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 16:24:34 -0800 (PST)
Received: (qmail 2691 invoked from network); 13 Nov 2001 00:24:34 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <dillon@apollo.backplane.com>; 13 Nov 2001 00:24:34 -0000
Message-ID: <XFMail.011112162428.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <200111122357.fACNvc507188@apollo.backplane.com>
Date: Mon, 12 Nov 2001 16:24:28 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.ORG, Robert Watson <rwatson@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 12-Nov-01 Matthew Dillon wrote:
>:http://www.freebsd.org/~jhb/patches/refcount.patch
>:
>:It's slightly different than this in that refcount_drop() returns a boolean
> 
>     Ok, I've read it.  Ick.  Could you reorgranize it a bit to do something
>     slightly different?
> 
>     Make sys/refcount.h provide a machine portable set of routines.  Allow
>     the machine/refcount.h headers to override the portable set.  This way
>     an architecture does *NOT* need to implement routines for yet another
>     header file (or duplicate a lot of code over and over again).

Actually, if I add atomic_fetchadd(), the whole thing becomes MI and can just
live in sys/refcount.h.

>     This business about INVARIANTS makes no sense to me.  INVARIANTS should
>     not totally change the way the refcount API works.  It certainly should
>     not result in different structures!  If we are embedding ref counts
>     in every structure in the system simply setting or clearing INVARIANTS
>     blows up our compatibility, which is bad.

It could use a static system-wide mutex for all I care.  The invariants need
the mutex so they can safely read the value for the purposes of the KASSERT's,
that is all.  A pool would be better than a single mutex possibly.  My question
is how does your pool work?  Do you pick a mutex out of the pool at init time
like the lockmgr locks work?  Or do you use a hash on the object address?

>     Also, I don't see any reason to embed yet another mutex in a structure.
>     The ref count should be a simple int.  Use a pool of mutexes.  If you
> like
>     I'll commit a set of generic pool mutexes that you can simply call.  How 
>     about that?

Well, there are different ways of doing lock pools. :)  How about something
like this:

/*
 * Returns lock for address 'ptr'.
 *
mtx_pool_find(void *ptr)
{
}

#define mtx_pool_lock(p)        mtx_lock(mtx_pool_find((p)))
#define mtx_pool_unlock(p)      mtx_unlock(mtx_pool_find((p))

Then if a structure (like lockmgr locks or sx locks) wants to cache the lock
pointer instead of doing the hash all the time, it can just do

        foo->f_lock = mtx_pool_find(foo);

This actually isn't all that difficult, it just adds the ability to lookup and
cache the mutex associated with an address.  I would also like it under mtx_*
so it's clear what type of locks are in the pool, but that's just me. :)

>                                                       -Matt

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:24:51 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212])
	by hub.freebsd.org (Postfix) with ESMTP id 859D737B41A
	for <arch@freebsd.org>; Mon, 12 Nov 2001 16:24:36 -0800 (PST)
Received: (qmail 70027 invoked from network); 13 Nov 2001 00:24:35 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <julian@elischer.org>; 13 Nov 2001 00:24:35 -0000
Message-ID: <XFMail.011112162429.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.BSF.4.21.0111121507240.94926-100000@InterJet.elischer.org>
Date: Mon, 12 Nov 2001 16:24:29 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Julian Elischer <julian@elischer.org>
Subject: RE: Thread scheduling in the kernel
Cc: arch@freebsd.org
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 12-Nov-01 Julian Elischer wrote:
> 
> In an attempt to get the next part of the KSE work designed (design before
> code you know.. a strange new concept) I've been trying to work out
> the "correct" scheduling methods for such a system.
> 
> There are a few 'tricks' that need to be taken into account..
> 
> a few notes..
> 
> 
> 1/ Since threads running a syscall hit 'sleep' events
> the entities on teh sleep queues must be the  threads.
> 
> 2/ the entity that is scheduled onto the run queues is the KSE.
> (as the name suggests).
> 
> 3/ If we have only one run queue, then KSEs for several processors
> from the same process, may be on the same queue.
> 
> 4/  If threads 'wake up' they are hung of a list of runnable threads
> somewhere. This list could be hanging off the process, or the KSE.
> (actually more likely the KSEgroup than the process but...)

It should hang off the group.

> 5/ If a KSE reaches teh front of the queue, but the process
> that is running is not that for which that KSE has some affinity,
> does it get out of the way to allow another KSE in the queue
> to get run? or does it just run and 'switch' everything over to the new
> available processor? Maybe the scheduler looks for the KSE from the same
> group, that was assigned to that processor, and runs that, leaving
> the original KSE at the head of the queue? 
> Maybe that happens until all the KSEs in the queue
> that were from that group have been run? In this case it becomes possible
> to always have a KSE from that group ready...

Actually, I would remove the concept of affinities from the KSE itself.  Rather
I would let each thread have lastcpu like it does now, and when a KSE goes to
choose a thread, it chooses one that has the lastcpu == current cpuid.

> Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that
> group are put on all processors that look for work, until all of them 
> have been run? (this would ensure that threads from the same process
> would all be run at the same time which is sometimes good, and sometimes
> bad, depending on the application.

I wouldn't do this.  I would just put KSE's on the queue's.  However, I think
that KSE's actually can be even smaller than they are now.  AFAICT they are
basically placeholders to sit on the runqueue's and not good for much else. :)

> 6/ When a Thread is made runnable it gets (in the present system) a
> priority. What priority does a KSE in the run queues have when it has
> threads of several differnt priorities? Do we sort them in priority order
> and drop the priority of the KSE(group) as we go through them
> until we have less priority than some other kse?

Actually, in theory the prioities are supposed to be per-KSE group right?  In
that case, changign the priority of an individual thread for the purposes of
priority propagation/inheritance or other shenanigans results in creating a new
group for that thread.

> 7/ when a KSE runs out of work, how does it decide whether there is work
> that should be stolen from a fellow KSE? How does processor affinity
> effect this?

If the list is per-ksegroup, then you just make a first pass preferring threads
that last ran on the current CPU.  If you don't find anything, you just grab
the first thing on the list.

> 8/ If we had per-processor scheduling queues, How would that effect it?
> Which element get's put on the queues? Does a KSE
> stay on the run queue if it has un=run threads, even when it's running?
> How do we handle the arrival of new runnable threads with a KSE
> when it's running but a fellow KSE is not runnable. Do we 
> bump the priority of the other KSE and hand it the new threads?

I'm not sure how this fits in that model unless you bind KSE's to CPU's or
something similar.  Only threads really have affinity, KSE's don't really care
if they migrate as they have no execution context that gets affected.

If the priorities are per-KSEgroup, then you get to assume that all threads in
a group are equal in priority, which is true unless a particular thread
temporiarly gets a bump from priority propagation or the process assigns a
thread to a realtime priority or some such.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:26:45 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id 6667D37B419
	for <freebsd-arch@FreeBSD.ORG>; Mon, 12 Nov 2001 16:26:33 -0800 (PST)
Received: (qmail 4112 invoked from network); 13 Nov 2001 00:26:32 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <jhb@FreeBSD.org>; 13 Nov 2001 00:26:32 -0000
Message-ID: <XFMail.011112162627.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <XFMail.011112162428.jhb@FreeBSD.org>
Date: Mon, 12 Nov 2001 16:26:27 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: John Baldwin <jhb@FreeBSD.org>
Subject: Re: cur{thread/proc}, or not.
Cc: Terry Lambert <tlambert2@mindspring.com>, 
Cc: Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG,
	Matthew Dillon <dillon@apollo.backplane.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 13-Nov-01 John Baldwin wrote:
> Then if a structure (like lockmgr locks or sx locks) wants to cache the lock
> pointer instead of doing the hash all the time, it can just do
> 
>         foo->f_lock = mtx_pool_find(foo);
> 
> This actually isn't all that difficult, it just adds the ability to lookup
> and
> cache the mutex associated with an address.  I would also like it under mtx_*
> so it's clear what type of locks are in the pool, but that's just me. :)

s/difficult/different/

It's not difficult either, but that wasn't my point. :)

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:31:23 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id EA1DA37B405; Mon, 12 Nov 2001 16:31:20 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAD0Unn07434;
	Mon, 12 Nov 2001 16:30:49 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 16:30:49 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111130030.fAD0Unn07434@apollo.backplane.com>
To: Julian Elischer <julian@elischer.org>
Cc: John Baldwin <jhb@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>,
	Robert Watson <rwatson@FreeBSD.ORG>, freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References:  <Pine.BSF.4.21.0111121608240.94926-100000@InterJet.elischer.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:weren't you just complaining that there were too many kinds of mutex?
:I'm not sure how this fits under "reference counting API"
:
:ANyhow can you explain the idea of a pool mutex more clearly?

    A pool mutex is the BSDI concept, similar to the wait address when
    you tsleep().  You get the mutex via a rendezvous point which is an
    arbitrary pointer, and release it the same way.

    Just as with the wait address the pointer you pass is arbitrary.  It
    need not represent any sort of structure and the structures you use
    need not embed any actual mutex.  Instead the pool code would obtain
    a mutex out of a pool of mutexes based on a hash of the supplied pointer.

    pool_mtx_lock(void *ptr);
    pool_mtx_unlock(void *ptr);

    Pool mutexes could be used just about *everywhere* where a mutex is used
    in a non-reentrant fashion now.  i.e. where you obtain a mutex, do a
    bunch of stuff that does not require obtaining any additional mutexes,
    and then release the mutex (which is how most mutexes are supposed to
    work anyway).

    There are two huge advantages to using pool mutexes:

	* No structural overhead.  Zip.  Zero.  Zilch.  Nada.

	* The mutex itself is stable storage, even if the address
	  is not, so you can use it to verify the second pointer when you
	  have a pointer to a (stable) structure containing a field which
	  is a pointer to an (unstable) structure.  

	  while ((ptr = stable->pointer) != NULL) {
		pool_mtx_lock(ptr);
		if (ptr == stable->pointer)
		    break;
		pool_mtx_unlock(ptr);
	  }
	  /*
	   * stable->pointer, if not NULL, is now locked and itself stable
	   * until you release the mutex
	   */

    There are two disadvantages:

	* Possible non-optimal cache mastership behavior.  However, this
	  is not a major disadvantage since it can be addressed by 
	  increasing the pool size.

	* Slightly greater overhead to calculate the hash index and obtain
	  the address of the pool mutex before obtaining or releasing it.

    The pool mutex hash function would be something simple based on
    (int)ptr.

						-Matt

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:34:54 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2E4B937B416; Mon, 12 Nov 2001 16:34:52 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAD0YqV07450;
	Mon, 12 Nov 2001 16:34:52 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 16:34:52 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111130034.fAD0YqV07450@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>,
	Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
References:  <XFMail.011112162428.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:is how does your pool work?  Do you pick a mutex out of the pool at init time
:like the lockmgr locks work?  Or do you use a hash on the object address?

    I was thinking non-chained hash on the object address.  Real simple.
    (((int)ptr >> 5) ^ (int)ptr) & MASK or something like that.  Or something
    even simpler... basically something we can play around with and optimize
    later without breaking the API we've constructed.

:Well, there are different ways of doing lock pools. :)  How about something
:like this:
:
:/*
: * Returns lock for address 'ptr'.
: *
:mtx_pool_find(void *ptr)
:{
:}
:
:#define mtx_pool_lock(p)        mtx_lock(mtx_pool_find((p)))
:#define mtx_pool_unlock(p)      mtx_unlock(mtx_pool_find((p))
:
:Then if a structure (like lockmgr locks or sx locks) wants to cache the lock
:pointer instead of doing the hash all the time, it can just do
:
:        foo->f_lock = mtx_pool_find(foo);
:
:This actually isn't all that difficult, it just adds the ability to lookup and
:cache the mutex associated with an address.  I would also like it under mtx_*
:so it's clear what type of locks are in the pool, but that's just me. :)

    Yes I think the addition of a mtx_pool_find() call is excellent!  A
    wonderful example of horizontal expansion (rather then vertical
    complexity, or vertical complication if I'm being cute).

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:35:31 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id 3128637B405; Mon, 12 Nov 2001 16:35:26 -0800 (PST)
Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAD0ZGB37659;
	Mon, 12 Nov 2001 19:35:16 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 12 Nov 2001 19:35:15 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: Terry Lambert <tlambert2@mindspring.com>
Cc: John Baldwin <jhb@FreeBSD.org>,
	Matthew Dillon <dillon@apollo.backplane.com>,
	freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <3BF05D4C.55A9A459@mindspring.com>
Message-ID: <Pine.NEB.3.96L.1011112185746.36592C-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 12 Nov 2001, Terry Lambert wrote:

> John Baldwin wrote:
> > the refcount for now, but I still have patches that
> > some people don't like for implementing a simple refcount API just using
> > atomic operations.
> 
> Please commit these.  Using mutexes in this instance is just a happy way
> to put the performance in the toilet. 

My recollection is that there was some concern about the size of the unit
of atomic operation across platforms.  I may not recall correctly, but my
understanding was that some platforms substantially limited the potential
size of the target of the atomic operation to less than the normal
arithmetic unit size.  Again, subject to the fallibility of my
recollection, the maximum unit for atomic operations on Sparc64 was
24-bit, despite the native register size being 64-bit. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:38:23 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id ACBBC37B416
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Nov 2001 16:38:20 -0800 (PST)
Received: (qmail 11614 invoked from network); 13 Nov 2001 00:38:19 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <rwatson@FreeBSD.org>; 13 Nov 2001 00:38:19 -0000
Message-ID: <XFMail.011112163814.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.NEB.3.96L.1011112185746.36592C-100000@fledge.watson.org>
Date: Mon, 12 Nov 2001 16:38:14 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Robert Watson <rwatson@FreeBSD.org>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.org,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Terry Lambert <tlambert2@mindspring.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 13-Nov-01 Robert Watson wrote:
> 
> On Mon, 12 Nov 2001, Terry Lambert wrote:
> 
>> John Baldwin wrote:
>> > the refcount for now, but I still have patches that
>> > some people don't like for implementing a simple refcount API just using
>> > atomic operations.
>> 
>> Please commit these.  Using mutexes in this instance is just a happy way
>> to put the performance in the toilet. 
> 
> My recollection is that there was some concern about the size of the unit
> of atomic operation across platforms.  I may not recall correctly, but my
> understanding was that some platforms substantially limited the potential
> size of the target of the atomic operation to less than the normal
> arithmetic unit size.  Again, subject to the fallibility of my
> recollection, the maximum unit for atomic operations on Sparc64 was
> 24-bit, despite the native register size being 64-bit. 

No, that was on sparc32, not sparc64.  All of our current architectures would
be fine with it.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 16:42:50 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id 72EDC37B417; Mon, 12 Nov 2001 16:42:44 -0800 (PST)
Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAD0gWB37739;
	Mon, 12 Nov 2001 19:42:32 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 12 Nov 2001 19:42:31 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: John Baldwin <jhb@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.org,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Terry Lambert <tlambert2@mindspring.com>
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <XFMail.011112163814.jhb@FreeBSD.org>
Message-ID: <Pine.NEB.3.96L.1011112194158.36592D-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 12 Nov 2001, John Baldwin wrote:

> > My recollection is that there was some concern about the size of the unit
> > of atomic operation across platforms.  I may not recall correctly, but my
> > understanding was that some platforms substantially limited the potential
> > size of the target of the atomic operation to less than the normal
> > arithmetic unit size.  Again, subject to the fallibility of my
> > recollection, the maximum unit for atomic operations on Sparc64 was
> > 24-bit, despite the native register size being 64-bit. 
> 
> No, that was on sparc32, not sparc64.  All of our current architectures
> would be fine with it. 

Oh, good.  I couldn't remember (hence some waffling) -- I have no problem
with this.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 18:17: 3 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from gull.prod.itd.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84])
	by hub.freebsd.org (Postfix) with ESMTP
	id 542E637B416; Mon, 12 Nov 2001 18:16:54 -0800 (PST)
Received: from dialup-209.247.141.234.dial1.sanjose1.level3.net ([209.247.141.234] helo=mindspring.com)
	by gull.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 163T87-0001fz-00; Mon, 12 Nov 2001 18:16:52 -0800
Message-ID: <3BF082C6.BA7CA05D@mindspring.com>
Date: Mon, 12 Nov 2001 18:17:42 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.org
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011112183454.36592A-100000@fledge.watson.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Robert Watson wrote:
> > I think that the majority of the netinet code can be handled by using
> > the socket credential, instead of the process credential.
> 
> The majority, yes, but not all.  In particular, there are a number of
> desirable behaviors where you *do* want to use the process credential.  In
> particular, relating to binding activities, where current semantics permit
> a 'privileged process' to create and bind sockets such that they have
> access to otherwise restricted ports, transfer them to unprivileged
> processes, but not grant the full scope of privilege to those processes. A
> primary example of this in use in practice might be a situation where an
> I/O socket is handed off from a network daemon to an unprivileged process,
> such as inetd handing off to fingerd: fingerd should not retain inetd's
> privileges regarding many aspects of the socket's behavior.  This argument
> might be seen more convincingly from the perspective of UDP sockets.  Yes,
> it is true that in most cases use of the socket credential is desirable,
> but in a number of important cases, it is not.

I think that this case implies that the socket creation and
binding are seperated, or that it's possible to re-bind a
socket, once bound.

I think the model needs to be "reliquish priviledges"; in other
words, there is an explicit handoff, at which point this is an
allowable thing.

Putting bits like "may bind to privileged port" on unbound
sockets is, I think, a bad thing.

The easiest way to deal with this is to replace the socket
credential when the handoff takes place.

However, I think that in most cases, the priviledge handoff
associated with the handdof of a priviledged object is _intentional_,
in order to have a process with full privileges (e.g. "root") hand
off only partial privileges to another, otherwise unprivileged
process.  Specifically, it's a workaround for not having high
granularity control over privileges and/or a capabilities model
(capabilities models are, by definition, impossible to initialize
without invoking some implicit privilege, so we can ignore them as
academic curiousities for now).

If I had off access to something by handing off a descriptor,
rather than handing off a reference and forcing you to create
your own descriptor, then my handoff of rights is intentional,
and not something which needs to be blocked.


> There are some related cases in VFS, where we consider a per-jail
> securelevel based on the acting process, not the file-opening process.

I don't like these, but I accept that they must exist for jail
code to function.

> Similarly, there are some ioctl's on tty devices that are subject
> (process)  credential authorized: these are in general present to handle
> the case where descriptors to these objects are (and must be) inherited.

I think this and the previous case can be folded together as
"user option", similar to not being able to have simultaneous
use of your X server, or the ability to load kernel modules, and
secure level 2 at the same time: it's a trade off, and it is a
conscious one make at user discretion.


> There are some related cases, such as fd passing via unix domain sockets,
> where the same properties can prove very useful: the ability to transfer
> access to sockets/files via LPC as 'rights' rather than delegating all
> rights.

The read/write rights for object opened by another process, or
opened in an SUID case, with a subsequent relinquishing of the
credentials that permitted the operation in the first place are
the interesting cases, I think.  The others fall into exception
and administrative fiat.


> > > With the eventual addition
> > > of td->td_ucred, it will be desirable to use the credential for the
> > > current thread, rather than the proc, which will require locking to use.
> >
> > I think locking credential instances is bad.
> 
> That is not what we're talking about.  We're talking about locking the
> process structure.  No one is suggesting this.

I think locking the process structure/thread structure is bad,
particularly when you are only doing it to get at the credential,
and it's probably the wrong credential anyway.


> > There is a lot of this type of fuzzy thinking, asking "how can I
> > propagate the process credential that I used to use for this operation
> > down to the underlying code?", when the real question should be "what is
> > the appropriate credential to use for this operation, and is the process
> > credential really what I want to use in this case?".
> 
> I agree there has been a lot of fuzzy thinking.  I also agree that, in
> every case, we need to carefully consider the credential used.  In
> particular, this is true in the 'new world order' of td_ucred, where we'll
> now often have three credentials to decide from:
> 
> (1) Mutable p_ucred (requires proc lock)
> (2) Cached td_ucred (requires no lock)
> (3) Cached so->so_cred, file->f_cred, et al.
> 
> In most cases, (2) or (3) will be appropriate.  In some situations,
> particularly when it comes to credential update, (1) will be appropriate.

I consider caching of mutable data harmful.  Here, you inply that
there will be cached mutable data in scope at the time that the
decision to use the mutable data must be made.

I think this is incredibly messy, and will only lead to mistakes
about what's being used.

I think that if a right is granted, it's granted, and only if you
define a specific revocation protocol that can be procedurally
linked so as to notify those people who need to make the assumption
of non-mutability for performance reasons, is it OK to change it.

I would be very tempted to:

1)	const the credentials that are non-mutable; this is hard,
	but manageable through a cast after the reference count
	adjustment.

2)	leave all unnecessary credentials out of scope, so that
	the decision as to which to use is obvious.

3)	Discourage the implementation of a revocation protocol.

I realize that this is a tradeoff between explicit and implicit,
and that it results in irrevokable grant of priviledges, in so
far as the credential reference granted grants such priviledge,
but the cases where this is bas are incredible exceptions, such
as revocation of a clearance to someone formerly having clearance
on a machine where you are going to trust their processes to
continue to run, at the lowered clearance level.

Continuing to let the code run in this situation will probably
happen when hell freezes over.


> > I think it's possible to get rid of most of the process credential
> > references -- and therefore, most of the proc references -- at all
> > points below the /sys/kern/uipc_socket*.c level.
> 
> No, it's not, in a number of very important cases, of which I've
> identified at least three above.

I disagree with two of them (see above), and thingk the third is
an incredible exception.  If you don't think so, then perhaps it's
time we rethink the underlying problem being solved, and change
the solution to be more rational so as to not require that.

The problem here is that you are trying to do something as an
afterthought (add security features not previously present), and
avoid some of the redesign that should happen, at the cost of a
performance penalty.


> Structuring code to have a notion of "but the kernel asked" vs. "but a
> user asked" is difficult, and something I'm not sure we have a grasp on
> how to approach.  Sometimes, for example, FSCRED or NOCRED is used as a
> "special-case" credential to say "do it anyway".  This is often broken
> when it comes to distributed file systems where a client system may not
> simply be able to assert "because I said so", and probably reflects
> unclear thinking on the topic.

Most distributed FS's have this issue.  You're not going to resolve
it by fiat, since it's impossible to do that without an enforcible
distributed cache coherency protocol,. such that when the cached
data gets to the client, it can be forecefully updated by the server,
should it become necessary.

I think you are concentrating too much on the revocation of granted
rights issue, rather than on the grant of nonrevokable right issue,
which is what I think should be the tack taken.


> > I would much rather that the credentials be object referenced off of
> > non-process, non-thread objects, based on whatever the correct scoping
> > really is, for the security model you want to enforce.  My "accept"
> > example is only one of a class of changes that could facilitate this.
> 
> I think everyone agrees that the 'cached credential' model is the right
> approach for many of these cases, but I think it's over-reaching to claim
> it's appropriate in all cases.  The question then becomes, how do we
> access the relevant 'subject' credential to authorize the operation: is it
> something that is passed down via the call stack (possibly via 'struct
> thread *td'), or is it something implicit to the run-time environmenta
> ('curproc'/'curthread'), which is precisely the question I was trying to
> resolve through my post.  If 'curproc'/'curthread' is truly undesirable,
> then we can simply eliminate its use, and replace that with almost
> universal passing of 'struct thread' (for the purposes of authorization,
> but also for other purposes: target of copyin/copyout/aio, scheduling,
> ktrace, ...).  If it is acceptable to maintain the use of curproc, we may
> want to change some of our primitives to represent it being available.

I think it's truly undesirable, since it limits the scalability
of number of CPUs, and the ability to create clusters resonably,
by putting a lot of bus contention into operations which should
not involve inter-CPU cache coherency issues in the first place.

I don't believe you will be able to grant priviledge on one node
of a NUMA cluster, translate the process to another node, and then
revoke the privilege on a third node, and have that revocation
take effect without leaving a race window in which the putatively
de-credentialed process is still able to act with the granted
credentials before the node on which it is running receives the
revocation.

This is exactly the X.509 certificate revocation problem, and it'd
be nice if everyone could afford to check with the certificate
authority to see the revocation list each and every time that they
wanted to invoke the privilege granted by holding the ceritificate,
but that's just not scalable to real world application.

If you want to do this, then you need to change the way you handle
it entirely; for X.509, this is generally done by providing for a
time based expiriation, and a recertification requirement.  No one
really looks at the CRLs, in practice.

In the limit, this scales by granting the rights for longer and
longer windows, as utilization increases.  It's not very satisfying.


> Right now, we're in a state of limbo: the official policy (if you will) is
> 'XXX'.  We should either eliminate it from general use, or we should use
> it where it's appropriate :-).

I definitely agree that there should be an uambiguous policy in
place... I just think I disagree wih you about what it should be.
:-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 18:30:23 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from gull.prod.itd.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7720B37B416; Mon, 12 Nov 2001 18:30:20 -0800 (PST)
Received: from dialup-209.247.141.234.dial1.sanjose1.level3.net ([209.247.141.234] helo=mindspring.com)
	by gull.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 163TL9-0007BQ-00; Mon, 12 Nov 2001 18:30:19 -0800
Message-ID: <3BF085EC.AEE7DE9C@mindspring.com>
Date: Mon, 12 Nov 2001 18:31:08 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: John Baldwin <jhb@FreeBSD.org>
Cc: Julian Elischer <julian@elischer.org>, freebsd-arch@FreeBSD.org,
	Robert Watson <rwatson@FreeBSD.org>,
	Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: cur{thread/proc}, or not.
References: <XFMail.011112162432.jhb@FreeBSD.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

John Baldwin wrote:
> > ANyhow can you explain the idea of a pool mutex more clearly?
> 
> Heh, think of it as a pool of mutexes, not a different type of
> mutex.  Instead of having 1 mutex for each object, you use a hash
> table of mutexes for a set of objects.  Thus, if you have 50 objects
> vs. 500 objects, if you embed 1 mutex for each object, you bloat
> each object and have 500 locks instead of 50 locks.  Using pool
> mutexes, you only have N number of mutexes regardless of the number
> of mutexes.  Note that if pool mutexes are non-recursive, they can't
> be safely used when you might have more than one object of a given
> set locked at a time. For example, process locks are the only object
> we do this with currently.

Pool mutexes are evil, if not implemented exactly right,
and "exactly right" will vary over time.

We need only look at the allocation unit optimization for
things like struct socket allocations, which weren't updated
when kevent came in and changed the size of the structure
and therefore made the previous optimal cluster allocation
block pessimal instead.

Pool mutexes have the same problem that the fixed hash size
for TCP connections has, in that you end up with relatively
large collision domains when you get to a relatively large
number of objects being hashed.

Increasing the hash is not an answer, since it means that
the default tuned case tries to handle the max for everything
and ends up taking up so much memory you get the max for
nothing.

You might be able to keep a "pool ratio"; e.g. "for every N
objects, there will be 1 mutex bucket", but then you get into
the problem of refactoring the existing buckets.

There is also the issue of collision domain; we tend to see
this with an incredible number of client connections to HTTP
servers with the in_pcbhash code (to keep the same example),
because the hash values for port 80 on a particular IP tend
to be pretty limited.

In other words, I think that you will run into locality
issues which will give you a hash that results in a particular
bucket being inordinately busy, while another one is idle.

Unless you address the locality balancing issue up front, it
is a bad idea to use this for mutexes for objects, even if
each object type gets its own mutex pool to avoid collision
multiplication when multiple object types are referenced from
the same pool.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 18:35: 5 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from gull.prod.itd.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84])
	by hub.freebsd.org (Postfix) with ESMTP
	id A7C8E37B417; Mon, 12 Nov 2001 18:34:58 -0800 (PST)
Received: from dialup-209.247.141.234.dial1.sanjose1.level3.net ([209.247.141.234] helo=mindspring.com)
	by gull.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 163TPc-0003xn-00; Mon, 12 Nov 2001 18:34:56 -0800
Message-ID: <3BF08702.84DDFFE0@mindspring.com>
Date: Mon, 12 Nov 2001 18:35:46 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Julian Elischer <julian@elischer.org>,
	John Baldwin <jhb@FreeBSD.ORG>, Robert Watson <rwatson@FreeBSD.ORG>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <Pine.BSF.4.21.0111121608240.94926-100000@InterJet.elischer.org> <200111130030.fAD0Unn07434@apollo.backplane.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:
>     There are two huge advantages to using pool mutexes:
> 
>         * No structural overhead.  Zip.  Zero.  Zilch.  Nada.
> 
>         * The mutex itself is stable storage, even if the address
>           is not, so you can use it to verify the second pointer when you
>           have a pointer to a (stable) structure containing a field which
>           is a pointer to an (unstable) structure.

They are a solution to the retrofit problem.  I.e. you use them
when you would rather kludge around the problem instead of having
to refactor the code.


>     There are two disadvantages:
> 
>         * Possible non-optimal cache mastership behavior.  However, this
>           is not a major disadvantage since it can be addressed by
>           increasing the pool size.

See my other post... this looks like a fix, but it doesn't
scale, and it limits the system by default, and grossly
complicates tuning for optimal performance for a particular
task.


>         * Slightly greater overhead to calculate the hash index and obtain
>           the address of the pool mutex before obtaining or releasing it.
> 
>     The pool mutex hash function would be something simple based on
>     (int)ptr.

You could pick a computationally trivial hash to avoid this;
it's fairly irrelevant to the argument, either way, I think.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon Nov 12 19:17:21 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id BD55D37B405; Mon, 12 Nov 2001 19:17:19 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAD3HIE07916;
	Mon, 12 Nov 2001 19:17:18 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 12 Nov 2001 19:17:18 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111130317.fAD3HIE07916@apollo.backplane.com>
To: Terry Lambert <tlambert2@mindspring.com>
Cc: John Baldwin <jhb@FreeBSD.ORG>,
	Julian Elischer <julian@elischer.org>, freebsd-arch@FreeBSD.ORG,
	Robert Watson <rwatson@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not.
References: <XFMail.011112162432.jhb@FreeBSD.org> <3BF085EC.AEE7DE9C@mindspring.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:things like struct socket allocations, which weren't updated
:when kevent came in and changed the size of the structure
:and therefore made the previous optimal cluster allocation
:block pessimal instead.
:
:Pool mutexes have the same problem that the fixed hash size
:for TCP connections has, in that you end up with relatively
:large collision domains when you get to a relatively large
:number of objects being hashed.

   Well, I have to disagree.  The primary scaling issue for pool mutexes
   is against the number of cpu's, not the number of structures, and
   the number of cpu's is relatively static.  I agree that the hash
   function needs to be chosen carefully to maximize performance, but 
   the advantage is that this (and other tricks) can be done inside
   the API, without having to mess around with anything outside the API.

   I think we have a far worse problem with structural bloat right now.
   Far, far worse.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 10:20:32 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2644337B417; Tue, 13 Nov 2001 10:20:17 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id KAA99031;
	Tue, 13 Nov 2001 10:04:09 -0800 (PST)
Date: Tue, 13 Nov 2001 10:04:07 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: John Baldwin <jhb@FreeBSD.org>
Cc: arch@freebsd.org
Subject: RE: Thread scheduling in the kernel
In-Reply-To: <XFMail.011112162429.jhb@FreeBSD.org>
Message-ID: <Pine.BSF.4.21.0111130929230.98845-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

(I notice you only comented on the first half, but that's a lot better
than the complete lack of interest from everyone else.....)

On Mon, 12 Nov 2001, John Baldwin wrote:

> 
> On 12-Nov-01 Julian Elischer wrote:
> > 
> > In an attempt to get the next part of the KSE work designed (design before
> > code you know.. a strange new concept) I've been trying to work out
> > the "correct" scheduling methods for such a system.
> > 
> > There are a few 'tricks' that need to be taken into account..
> > 
> > a few notes..
> > 
> > 
> > 1/ Since threads running a syscall hit 'sleep' events
> > the entities on teh sleep queues must be the  threads.
> > 
> > 2/ the entity that is scheduled onto the run queues is the KSE.
> > (as the name suggests).
> > 
> > 3/ If we have only one run queue, then KSEs for several processors
> > from the same process, may be on the same queue.
> > 
> > 4/  If threads 'wake up' they are hung of a list of runnable threads
> > somewhere. This list could be hanging off the process, or the KSE.
> > (actually more likely the KSEgroup than the process but...)
> 
> It should hang off the group.

This was my original idea.  However I ended up splitting that queue up so
that it was on each KSE and allowed a KSE with no work to steal work from
another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to
processors lightly, then I bind threads at the same time. (lightly)

The idea is that threads are put on the queue for the KSE on which they
last ran. Only when a KSE runs out of runnable threads on its own list and
still has teh CPU, will it try steal work from another in the same group.

The downside is that there is no overall priority between threads in a
group.. This is one thing I want o discuss... the queueing model.


> 
> > 5/ If a KSE reaches teh front of the queue, but the process
> > that is running is not that for which that KSE has some affinity,
> > does it get out of the way to allow another KSE in the queue
> > to get run? or does it just run and 'switch' everything over to the new
> > available processor? Maybe the scheduler looks for the KSE from the same
> > group, that was assigned to that processor, and runs that, leaving
> > the original KSE at the head of the queue? 
> > Maybe that happens until all the KSEs in the queue
> > that were from that group have been run? In this case it becomes possible
> > to always have a KSE from that group ready...
> 
> Actually, I would remove the concept of affinities from the KSE
> itself.  Rather I would let each thread have lastcpu like it does now,
> and when a KSE goes to choose a thread, it chooses one that has the
> lastcpu == current cpuid.

That is another possible way of tackling the problem.
How deep in the group's queue does the KSE look before it decides to just
take 'any' thread?

> 
> > Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that
> > group are put on all processors that look for work, until all of them 
> > have been run? (this would ensure that threads from the same process
> > would all be run at the same time which is sometimes good, and sometimes
> > bad, depending on the application.
> 
> I wouldn't do this.  I would just put KSE's on the queue's.  However, I think
> that KSE's actually can be even smaller than they are now.  AFAICT they are
> basically placeholders to sit on the runqueue's and not good for much else. :)

You may notice that that's approximatly what I have done now... It's a
"Kernel Schedulable Entity". It does leave some unfairness to the
advantage of processes that have multiple KSEs. The KSe's job is to be
scheduled on a run queue, and to provide a linkage point for other
elements. It doesn't have very much else in it. (maybe a state variable).


> 
> > 6/ When a Thread is made runnable it gets (in the present system) a
> > priority. What priority does a KSE in the run queues have when it has
> > threads of several differnt priorities? Do we sort them in priority order
> > and drop the priority of the KSE(group) as we go through them
> > until we have less priority than some other kse?
> 
> Actually, in theory the prioities are supposed to be per-KSE group
> right?  In that case, changign the priority of an individual thread
> for the purposes of priority propagation/inheritance or other
> shenanigans results in creating a new group for that thread.

Static priority inputs, (e.g. nice), yes.
It is quite possible tha the KSEs in the group might have private
priorities that diverge from this according to inputs from the 
threads they are running at that time....

My guess is that a kse from the group is elevated in priority
when a thread with elevated priority comes runnable.
This brings up questions of pre-emption.

> 
> > 7/ when a KSE runs out of work, how does it decide whether there is work
> > that should be stolen from a fellow KSE? How does processor affinity
> > effect this?
> 
> If the list is per-ksegroup, then you just make a first pass
> preferring threads that last ran on the current CPU.  If you don't
> find anything, you just grab the first thing on the list.

The queue might be quite long.. maybe only scan the first N entries...

> 
> > 8/ If we had per-processor scheduling queues, How would that effect it?
> > Which element get's put on the queues? Does a KSE
> > stay on the run queue if it has un=run threads, even when it's running?
> > How do we handle the arrival of new runnable threads with a KSE
> > when it's running but a fellow KSE is not runnable. Do we 
> > bump the priority of the other KSE and hand it the new threads?
> 
> I'm not sure how this fits in that model unless you bind KSE's to
> CPU's or something similar.  Only threads really have affinity, KSE's
> don't really care if they migrate as they have no execution context
> that gets affected.

If you can bind KSes to processors a bit and 
have an affinity to a particular KSE, then you reduce the amount
of work you have to do to select thte next thread to run.
It's a tradeoff. The selection might get very heavyweight if there are a
LOT of threads to select from. this could make a scheme that does a
selection between each thread to be run, rather unscalable. If we had the
affinity 'built in' to the structures/lists then it would be an order(1)
operation.. and more scalable.

(just an idea)

> 
> If the priorities are per-KSEgroup, then you get to assume that all threads in
> a group are equal in priority, which is true unless a particular thread
> temporiarly gets a bump from priority propagation or the process assigns a
> thread to a realtime priority or some such.

I don;t think that the priority of all teh threads are the same, but
rather, the priority ifor them all is based upon the same BASE priority
and statistics.. i.e. the KSEG collects recent CPU usage and
it's base priority degrades, taking all it's KSE's and threads with it.
However I think that when a thread wakes up with an elevated priority,
(as they do now) then a KSE needs to be boosted in priority to run it.
After that thread has returned to user mode, the next highest priority
in the list is run, etc. This sort-of suggests per-KSEG priority queues...
As the KSE runs lower and lower priority threads, it's own priority
could be lowered. When its priority is lower than another KSE
onthe run queues it loses the CPU.. The question is whether the returning
syscall follows the execution path back into userland. If it doesn't
immediatly, then it may never get to userland if it lowers
it's priority on another thread, and loses the CPU. This could lead
to a case where a proces has completed all it's syscalls but
is not able to proceed in userland.. Maybe this is what should happen,
but I doubt it..

[lots of my other comments missing.....]


> 
> -- 
> 
> John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
> PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
> "Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 10:59:15 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206])
	by hub.freebsd.org (Postfix) with ESMTP id 5C52E37B416
	for <arch@freebsd.org>; Tue, 13 Nov 2001 10:59:05 -0800 (PST)
Received: (qmail 8730 invoked from network); 13 Nov 2001 18:58:31 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail6.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <julian@elischer.org>; 13 Nov 2001 18:58:31 -0000
Message-ID: <XFMail.011113105904.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.BSF.4.21.0111130929230.98845-100000@InterJet.elischer.org>
Date: Tue, 13 Nov 2001 10:59:04 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Julian Elischer <julian@elischer.org>
Subject: RE: Thread scheduling in the kernel
Cc: arch@freebsd.org
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 13-Nov-01 Julian Elischer wrote:
> (I notice you only comented on the first half, but that's a lot better
> than the complete lack of interest from everyone else.....)

Well, that's cause I think that there are some basic things that need to be
decided before we can make the decisisons at the bottom of your e-mail.  I
think the first thing is that priorities need to be decided.  The real question
there is do we want per-thread priorities or per-ksegroup priorities?  If you
go totally with per-thread priorities which you seem to be favoring now and
just use ksegroup for nice and fixed priorities, then that makes kse groups
simpler at the expense of complicating KSE scheduling. :)

If we let each thread have a priority and maintain its own scheduling
parameters then I would be tempted to put threads on the runqueue's rather than
kse's primarily because you then have the problem of having to go update the
priorities of KSE's all the time when thread priorites change.  And since you
want a thread to run as soon as its priority allows, this means changign the
prioritiy of all KSE's in its group so it gets to run on the first one that
becomes available.  This would point to a single priority in the KSE group that
all KSE's share that is the highest priority of all runnable threads.  If the
list of runnable threads in the KSE group is priority sorted (as it should be)
this isn't but so difficult as you look at the priority of the thread at the
head of the list.  However, every time that priority changes, you have to go
shuffle KSE's around on the queue's potentially, rather than just moving that
one thread around on the queue's (or putting it on the queue as the case may
be).

One comment about preemption: probably what we will go with is only preempting
for real time threads (including interrupt threads) and not preempt time
sharing threads until their quantum is up or they block.  The entire concept of
KSE's as I understand it, is to serve as a holder for the quantum so that we
can give a multithreaded process it's full quantum each go-around even if
threads block in which case we split it across multiple threads.  In that case,
I think this might be a resonable model:

- Put threads on the runqueues.
- During choosethread, we use the following algorithm:
  - If the highest priority thread is a time sharing or idle thread, our
    current process is a KSE process, and we still have quantum left (I am
     foreseeing a KEF_FORCESWITCH for forcing a KSE switch when quantum
     expires) then we will look for another thread in this kse group in
    priority order with a bias for threads that last run on the current
    cpu for affinity purposes.  This may mean that we don't run the strictly
    highest priority thread in the system for the purposes of preserving
    quanta for time-shared processes.
  - Otherwise, we simply run the highest priority thread.

I think this will achieve the desired goal of a KSE (preserve quanta for
multithreading time-sharing processes across threads) while still allowing
things like priority propagation and preemption to work smoothly.  It's also
fairly simple.

If you use a priority bias for affinity, then that means you basically have a
constant, say 4 (that is random, prolly not the real value) then you will
basically artificially bump the priority of threads with lastcpu == cpuid by 4
during your comparison.  This means you can stop walking the ksegroup list of
threads when you hit a thread whose priority is more than 4 levels less than
that of the highest priority thread.  Also, the first thread you hit that meets
the affinity requirement is the one you run, this should keep a (hopefully)
decent bound on the amount of list walking done.

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 12:47:24 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from barry.mail.mindspring.net (barry.mail.mindspring.net [207.69.200.25])
	by hub.freebsd.org (Postfix) with ESMTP id E022237B41A
	for <freebsd-arch@FreeBSD.org>; Tue, 13 Nov 2001 12:47:15 -0800 (PST)
Received: from src-fvzagy98ow5 (pool-63.49.205.54.troy.grid.net [63.49.205.54])
	by barry.mail.mindspring.net (8.9.3/8.8.5) with SMTP id PAA10581
	for <freebsd-arch@FreeBSD.org>; Tue, 13 Nov 2001 15:47:13 -0500 (EST)
Message-Id: <3.0.6.32.20011113154711.009793e0@imatowns.com>
X-Sender: ggombert@imatowns.com
X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32)
Date: Tue, 13 Nov 2001 15:47:11 -0500
To: freebsd-arch@FreeBSD.org
From: Glenn Gombert <ggombert@imatowns.com>
Subject: freebsd-arch@FreeBSD.org
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

A couple of questions --

>1/ Since threads running a syscall hit 'sleep' events
>the entities on teh sleep queues must be the  threads.

  Will the sleep queues (which mix threads from multiple CPUs) impact
performance as the number of threads dramatically increase ..

> 2/ the entity that is scheduled onto the run queues is the KSE.
> (as the name suggests).

  Is there a number of threads per KSE that is optimum for performance?
will this be impacted by the UpCalls that are made between the Kernel and
User land=85..what determines the optimum number of threads to be created pe=
r
KSE (before another one is created for a particular application)..

> 3/ If we have only one run queue, then KSEs for several processors
> from the same process, may be on the same queue.

> 4/  If threads 'wake up' they are hung of a list of runnable threads
> somewhere. This list could be hanging off the process, or the KSE.
> actually more likely the KSEgroup than the process but...)

 .. does not one process serve as a 'container' for one KSEG and multiple
KSE and Threads ?? does this process share the time quanta between all its
member(s) or is it the job of the UTS to make these type of decisions??


> 5/ If a KSE reaches teh front of the queue, but the process
> that is running is not that for which that KSE has some affinity,
> does it get out of the way to allow another KSE in the queue
> to get run? or does it just run and 'switch' everything over to the new
> available processor? Maybe the scheduler looks for the KSE from the same
> group, that was assigned to that processor, and runs that, leaving
> the original KSE at the head of the queue?=20
> Maybe that happens until all the KSEs in the queue
> that were from that group have been run? In this case it becomes possible
> to always have a KSE from that group ready...

 Does the kernel scheduler make the decisions about scheduling (once a
Thread has been created) ?..what is the relationship between the UTS and
the Kernel Scheduler (from the standpoint of time allocation when it comes
to Use's and individual threads)=85

> Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that
> group are put on all processors that look for work, until all of them=20
> have been run? (this would ensure that threads from the same process
> would all be run at the same time which is sometimes good, and sometimes
> bad, depending on the application.

 How is the time quanta divided up between KSE's and Threads ??=85who makes
the decision when each should be placed on the runqueue and run at a
particular time when the responsibility is devided up between the UTS and
kernel scheduler=85

> 6/ When a Thread is made runnable it gets (in the present system) a
> priority. What priority does a KSE in the run queues have when it has
> threads of several differnt priorities? Do we sort them in priority order
> and drop the priority of the KSE(group) as we go through them
> until we have less priority than some other kse?

> 7/ when a KSE runs out of work, how does it decide whether there is work
> that should be stolen from a fellow KSE? How does processor affinity
> effect this?

 Is a KSE not bound to a particular processor with the KSEG able to
allocate resources between multiple processors?

> 8/ If we had per-processor scheduling queues, How would that effect it?
> Which element get's put on the queues? Does a KSE
> stay on the run queue if it has un=3Drun threads, even when it's running?
> How do we handle the arrival of new runnable threads with a KSE
> when it's running but a fellow KSE is not runnable. Do we=20
> bump the priority of the other KSE and hand it the new threads?


> remember: here are the 4 structures:

> proc  -   owner of all resources (FDs, memory, user creds) except cpu

> Ksegroup -  owner of all scheduler controlling characteristics
> 	(e.g. nice, realtime, number of processors),  N per process.
> 	Owner of stats used for scheduling calculations.=09

> kse -	kind of a placeholder.  It gets scheduled onto=20
> 	a processor (by a yet un-named mechaninsm) and provides
> 	cpu-cycles for the execution of 'threads' (see next).
> 	Max. Of one per processor per KSE-group.

> thread -  The in-kernel incarnation of a user thread that is presently
> 	in the kernel for some reason (e.g. syscall, pagefault, etc)
> 	Holds ALL the state needed to resume after sleeping, and is the
> 	entity that is suspended when the thread hits a 'sleep'.
> 	"unlimmitted" per KSEgroup. probably have a short-term
> 	"favourite" KSE/processor.


 What is the relationship between processors and processes?? Does not one
KSEG distribute multiple KSE's between multiple CPU's?

> When a thread blocks, the KSE looks for another thread to run, and if it
> doesn't find one, it will create one, and upcall back to the=20
> userland to see if there are more userland threads to run.
> (if not, it returns to yield the processor)

> The question that has been giving me headaches is the=20
> relationship between these elements, and
> the definitions of how these structures are linked up and moved
> around to provide fair efficient scheduling.

> If a KSE has a high priority thread and a low priority thread
> runnable in the kernel, but in reverse order, should it take
> the high priority from the higher prio. thread and process both,
> or should it order the threads and run teh high prio one first.
> In this case what happens whan a higher prio. thread becomes runnable
> while one is already running, and if the highest prio thread returns to
> userland, should teh processor move to userland to follow it, or
> switch to the next priority thread in the kernel.?
> Do all threads in the kernel have priority over all threads in userland?
> (this might be a reasonable decision).

  Does the UTS have any input into the priority of how time is apportioned
to each individual KSE / Thread in the kernel runqueue??..or is that
entirely up to the kernel scheduler =85

 In general does the memory allocation/recrimination scheme seem adequate
for all the KSE's/Threads that will be created and destroyed with the new
implementation=85


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 13:30:15 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from blount.mail.mindspring.net (blount.mail.mindspring.net [207.69.200.226])
	by hub.freebsd.org (Postfix) with ESMTP id 2005D37B418
	for <arch@freebsd.org>; Tue, 13 Nov 2001 13:30:09 -0800 (PST)
Received: from src-fvzagy98ow5 (pool-63.49.207.166.troy.grid.net [63.49.207.166])
	by blount.mail.mindspring.net (8.9.3/8.8.5) with SMTP id QAA01085
	for <arch@freebsd.org>; Tue, 13 Nov 2001 16:30:06 -0500 (EST)
Message-Id: <3.0.6.32.20011113163004.009803c0@imatowns.com>
X-Sender: ggombert@imatowns.com (Unverified)
X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32)
Date: Tue, 13 Nov 2001 16:30:04 -0500
To: arch@freebsd.org
From: Glenn Gombert <ggombert@imatowns.com>
Subject: RE: Thread scheduling in the kernel
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

> Well, that's cause I think that there are some basic things that need to=
 be
> decided before we can make the decisisons at the bottom of your e-mail.  I
> think the first thing is that priorities need to be decided.  The real
question
> there is do we want per-thread priorities or per-ksegroup priorities?  If
you
> go totally with per-thread priorities which you seem to be favoring now=
 and
> just use ksegroup for nice and fixed priorities, then that makes kse=
 groups
> simpler at the expense of complicating KSE scheduling. :)

 Is not a KSE 'bound' to a particular CPU, with each thread in the KSE
given a specific amount of time by the kernel scheduler ??. how does the
UTS play in this (other than to sleep and wakeup threads) =85

> If we let each thread have a priority and maintain its own scheduling
> parameters then I would be tempted to put threads on the runqueue's
rather than
> kse's primarily because you then have the problem of having to go update=
 the
> priorities of KSE's all the time when thread priorites change.  And since
you
> want a thread to run as soon as its priority allows, this means changign=
 the
> prioritiy of all KSE's in its group so it gets to run on the first one=
 that

 what is the mechanism for this (kernel scheduling ) or does the UTS become
involve as well ? What is the impact on performance (if re-scheduling is
done on a per-thread basis)=85


> becomes available.  This would point to a single priority in the KSE
group that
> all KSE's share that is the highest priority of all runnable threads.  If
the
> list of runnable threads in the KSE group is priority sorted (as it
should be)
> this isn't but so difficult as you look at the priority of the thread at=
 the
> head of the list.  However, every time that priority changes, you have to=
 go
> shuffle KSE's around on the queue's potentially, rather than just moving
that
> one thread around on the queue's (or putting it on the queue as the case=
 may
> be).

 Is not time allocated between Threads in a KSE based upon the total amount
of time available to the KSE.. if it is not this way , does not Threads
associated with a particular application gain an 'unfair' advantage when it
come to running =85

> One comment about preemption: probably what we will go with is only
preempting
> for real time threads (including interrupt threads) and not preempt time
>  sharing threads until their quantum is up or they block.  The entire
concept of
> KSE's as I understand it, is to serve as a holder for the quantum so that=
 we
> can give a multithreaded process it's full quantum each go-around even if
> threads block in which case we split it across multiple threads.  In that
case,
> I think this might be a resonable model:


> I think this will achieve the desired goal of a KSE (preserve quanta for
> multithreading time-sharing processes across threads) while still allowing
> things like priority propagation and preemption to work smoothly.  It's=
 also
> fairly simple.


> If you use a priority bias for affinity, then that means you basically
have a
> constant, say 4 (that is random, prolly not the real value) then you will
> basically artificially bump the priority of threads with lastcpu =3D=3D cp=
uid
by 4
> during your comparison.  This means you can stop walking the ksegroup
list of
> threads when you hit a thread whose priority is more than 4 levels less=
 than
> that of the highest priority thread.  Also, the first thread you hit that
meets
> the affinity requirement is the one you run, this should keep a=
 (hopefully)
>  decent bound on the amount of list walking done.

  If KSE's are bound to a particular CPU, how does this affect KSE's &
Threads on different CPU'


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 14: 4:47 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from net2.gendyn.com (nat2.gendyn.com [204.60.171.12])
	by hub.freebsd.org (Postfix) with ESMTP
	id D071D37B405; Tue, 13 Nov 2001 14:04:42 -0800 (PST)
Received: from [153.11.11.3] (helo=plunger.gdeb.com)
	by net2.gendyn.com with esmtp (Exim 2.12 #1)
	id 163lfW-000KuP-00; Tue, 13 Nov 2001 17:04:34 -0500
Received: from clcrtr.gdeb.com ([153.11.109.11])
	by plunger.gdeb.com  with SMTP id QAA01515;
	Tue, 13 Nov 2001 16:54:11 -0500 (EST)
Received: from gdeb.com (gpz.clc.gdeb.com [192.168.3.12])
	by clcrtr.gdeb.com (8.11.4/8.11.4) with ESMTP id fADMAHK47646;
	Tue, 13 Nov 2001 17:10:17 -0500 (EST)
	(envelope-from deischen@gdeb.com)
Message-ID: <3BF198E2.24EE658F@gdeb.com>
Date: Tue, 13 Nov 2001 17:04:18 -0500
From: Daniel Eischen <deischen@gdeb.com>
X-Mailer: Mozilla 4.78 [en] (X11; U; SunOS 5.8 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: Julian Elischer <julian@elischer.org>
Cc: John Baldwin <jhb@FreeBSD.ORG>, arch@FreeBSD.ORG
Subject: Re: Thread scheduling in the kernel
References: <Pine.BSF.4.21.0111130929230.98845-100000@InterJet.elischer.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Julian Elischer wrote:
> 
> (I notice you only comented on the first half, but that's a lot better
> than the complete lack of interest from everyone else.....)
> 
> On Mon, 12 Nov 2001, John Baldwin wrote:
> 
> >
> > On 12-Nov-01 Julian Elischer wrote:
> > >
> > > In an attempt to get the next part of the KSE work designed (design before
> > > code you know.. a strange new concept) I've been trying to work out
> > > the "correct" scheduling methods for such a system.
> > >
> > > There are a few 'tricks' that need to be taken into account..
> > >
> > > a few notes..
> > >
> > >
> > > 1/ Since threads running a syscall hit 'sleep' events
> > > the entities on teh sleep queues must be the  threads.
> > >
> > > 2/ the entity that is scheduled onto the run queues is the KSE.
> > > (as the name suggests).
> > >
> > > 3/ If we have only one run queue, then KSEs for several processors
> > > from the same process, may be on the same queue.
> > >
> > > 4/  If threads 'wake up' they are hung of a list of runnable threads
> > > somewhere. This list could be hanging off the process, or the KSE.
> > > (actually more likely the KSEgroup than the process but...)
> >
> > It should hang off the group.
> 
> This was my original idea.  However I ended up splitting that queue up so
> that it was on each KSE and allowed a KSE with no work to steal work from
> another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to
> processors lightly, then I bind threads at the same time. (lightly)
> 
> The idea is that threads are put on the queue for the KSE on which they
> last ran. Only when a KSE runs out of runnable threads on its own list and
> still has teh CPU, will it try steal work from another in the same group.
> 
> The downside is that there is no overall priority between threads in a
> group.. This is one thing I want o discuss... the queueing model.

I just want to make a couple comments without getting too involved
in how the kernel deals with threads, KSEs, and KSE groups.

I think that at first there will probably be only 1 UTS run
queue per KSE group.  This probably means that the UTS will
also hang blocked threads off its version of the KSE group.  I
guess in this case, unblock events from the kernel can be sent
to any KSE within the group.  But if the UTS wants to have a
run queue for each KSE, then the kernel should only be handling
the blocking and unblocking of threads within the same KSE
in which the thread originally entered the kernel.

I think the UTS will only set priorities for the KSE group.  It
doesn't make sense to me for the (application visible) priority
to be anywhere other than the KSE group.  If the kernel needs
to temporarily play with priorities for its own purposes (inheriting
priority when holding a mutex), then each thread probably needs
an active priority which is MAX(kse->inherited, kseg->prio).

-- 
Dan Eischen

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 14:23: 6 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id 5718A37B405
	for <arch@FreeBSD.ORG>; Tue, 13 Nov 2001 14:22:53 -0800 (PST)
Received: (qmail 4143 invoked from network); 13 Nov 2001 22:22:51 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <deischen@gdeb.com>; 13 Nov 2001 22:22:51 -0000
Message-ID: <XFMail.011113142251.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <3BF198E2.24EE658F@gdeb.com>
Date: Tue, 13 Nov 2001 14:22:51 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Daniel Eischen <deischen@gdeb.com>
Subject: Re: Thread scheduling in the kernel
Cc: arch@FreeBSD.ORG, Julian Elischer <julian@elischer.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 13-Nov-01 Daniel Eischen wrote:
> Julian Elischer wrote:
>> 
>> (I notice you only comented on the first half, but that's a lot better
>> than the complete lack of interest from everyone else.....)
>> 
>> On Mon, 12 Nov 2001, John Baldwin wrote:
>> 
>> >
>> > On 12-Nov-01 Julian Elischer wrote:
>> > >
>> > > In an attempt to get the next part of the KSE work designed (design
>> > > before
>> > > code you know.. a strange new concept) I've been trying to work out
>> > > the "correct" scheduling methods for such a system.
>> > >
>> > > There are a few 'tricks' that need to be taken into account..
>> > >
>> > > a few notes..
>> > >
>> > >
>> > > 1/ Since threads running a syscall hit 'sleep' events
>> > > the entities on teh sleep queues must be the  threads.
>> > >
>> > > 2/ the entity that is scheduled onto the run queues is the KSE.
>> > > (as the name suggests).
>> > >
>> > > 3/ If we have only one run queue, then KSEs for several processors
>> > > from the same process, may be on the same queue.
>> > >
>> > > 4/  If threads 'wake up' they are hung of a list of runnable threads
>> > > somewhere. This list could be hanging off the process, or the KSE.
>> > > (actually more likely the KSEgroup than the process but...)
>> >
>> > It should hang off the group.
>> 
>> This was my original idea.  However I ended up splitting that queue up so
>> that it was on each KSE and allowed a KSE with no work to steal work from
>> another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to
>> processors lightly, then I bind threads at the same time. (lightly)
>> 
>> The idea is that threads are put on the queue for the KSE on which they
>> last ran. Only when a KSE runs out of runnable threads on its own list and
>> still has teh CPU, will it try steal work from another in the same group.
>> 
>> The downside is that there is no overall priority between threads in a
>> group.. This is one thing I want o discuss... the queueing model.
> 
> I just want to make a couple comments without getting too involved
> in how the kernel deals with threads, KSEs, and KSE groups.
> 
> I think that at first there will probably be only 1 UTS run
> queue per KSE group.  This probably means that the UTS will
> also hang blocked threads off its version of the KSE group.  I
> guess in this case, unblock events from the kernel can be sent
> to any KSE within the group.  But if the UTS wants to have a
> run queue for each KSE, then the kernel should only be handling
> the blocking and unblocking of threads within the same KSE
> in which the thread originally entered the kernel.
> 
> I think the UTS will only set priorities for the KSE group.  It
> doesn't make sense to me for the (application visible) priority
> to be anywhere other than the KSE group.  If the kernel needs
> to temporarily play with priorities for its own purposes (inheriting
> priority when holding a mutex), then each thread probably needs
> an active priority which is MAX(kse->inherited, kseg->prio).

What about the priorities passed in to condition variables and msleep/tsleep?

That is why I think Julian wanted per-thread priorities.  Also, the priority
propagation priority is _defintiely_ a thread and not a KSE property, since the
thread owns teh lock that has the assoiated priority, not the KSE.

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 14:40:17 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP
	id AFF2537B418; Tue, 13 Nov 2001 14:40:10 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id OAA00143;
	Tue, 13 Nov 2001 14:25:13 -0800 (PST)
Date: Tue, 13 Nov 2001 14:25:11 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Daniel Eischen <deischen@gdeb.com>
Cc: John Baldwin <jhb@FreeBSD.ORG>, arch@FreeBSD.ORG
Subject: Re: Thread scheduling in the kernel
In-Reply-To: <3BF198E2.24EE658F@gdeb.com>
Message-ID: <Pine.BSF.4.21.0111131424180.99511-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Tue, 13 Nov 2001, Daniel Eischen wrote:
> I think the UTS will only set priorities for the KSE group.  It
> doesn't make sense to me for the (application visible) priority
> to be anywhere other than the KSE group.  If the kernel needs
> to temporarily play with priorities for its own purposes (inheriting
> priority when holding a mutex), then each thread probably needs
> an active priority which is MAX(kse->inherited, kseg->prio).

MAX(thread->inherited, kseg->prio) ?


> 
> -- 
> Dan Eischen
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 14:46:35 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from net2.gendyn.com (nat2.gendyn.com [204.60.171.12])
	by hub.freebsd.org (Postfix) with ESMTP
	id 62EEA37B405; Tue, 13 Nov 2001 14:46:30 -0800 (PST)
Received: from [153.11.11.3] (helo=plunger.gdeb.com)
	by net2.gendyn.com with esmtp (Exim 2.12 #1)
	id 163mJv-000M2p-00; Tue, 13 Nov 2001 17:46:19 -0500
Received: from clcrtr.gdeb.com ([153.11.109.11])
	by plunger.gdeb.com  with SMTP id RAA02907;
	Tue, 13 Nov 2001 17:35:55 -0500 (EST)
Received: from gdeb.com (gpz.clc.gdeb.com [192.168.3.12])
	by clcrtr.gdeb.com (8.11.4/8.11.4) with ESMTP id fADMq6K47675;
	Tue, 13 Nov 2001 17:52:06 -0500 (EST)
	(envelope-from deischen@gdeb.com)
Message-ID: <3BF1A2B0.A0BC7469@gdeb.com>
Date: Tue, 13 Nov 2001 17:46:08 -0500
From: Daniel Eischen <deischen@gdeb.com>
X-Mailer: Mozilla 4.78 [en] (X11; U; SunOS 5.8 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: John Baldwin <jhb@FreeBSD.org>
Cc: arch@FreeBSD.org, Julian Elischer <julian@elischer.org>
Subject: Re: Thread scheduling in the kernel
References: <XFMail.011113142251.jhb@FreeBSD.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

John Baldwin wrote:
> 
> On 13-Nov-01 Daniel Eischen wrote:
> > Julian Elischer wrote:
> >>
> >> (I notice you only comented on the first half, but that's a lot better
> >> than the complete lack of interest from everyone else.....)
> >>
> >> On Mon, 12 Nov 2001, John Baldwin wrote:
> >>
> >> >
> >> > On 12-Nov-01 Julian Elischer wrote:
> >> > >
> >> > > In an attempt to get the next part of the KSE work designed (design
> >> > > before
> >> > > code you know.. a strange new concept) I've been trying to work out
> >> > > the "correct" scheduling methods for such a system.
> >> > >
> >> > > There are a few 'tricks' that need to be taken into account..
> >> > >
> >> > > a few notes..
> >> > >
> >> > >
> >> > > 1/ Since threads running a syscall hit 'sleep' events
> >> > > the entities on teh sleep queues must be the  threads.
> >> > >
> >> > > 2/ the entity that is scheduled onto the run queues is the KSE.
> >> > > (as the name suggests).
> >> > >
> >> > > 3/ If we have only one run queue, then KSEs for several processors
> >> > > from the same process, may be on the same queue.
> >> > >
> >> > > 4/  If threads 'wake up' they are hung of a list of runnable threads
> >> > > somewhere. This list could be hanging off the process, or the KSE.
> >> > > (actually more likely the KSEgroup than the process but...)
> >> >
> >> > It should hang off the group.
> >>
> >> This was my original idea.  However I ended up splitting that queue up so
> >> that it was on each KSE and allowed a KSE with no work to steal work from
> >> another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to
> >> processors lightly, then I bind threads at the same time. (lightly)
> >>
> >> The idea is that threads are put on the queue for the KSE on which they
> >> last ran. Only when a KSE runs out of runnable threads on its own list and
> >> still has teh CPU, will it try steal work from another in the same group.
> >>
> >> The downside is that there is no overall priority between threads in a
> >> group.. This is one thing I want o discuss... the queueing model.
> >
> > I just want to make a couple comments without getting too involved
> > in how the kernel deals with threads, KSEs, and KSE groups.
> >
> > I think that at first there will probably be only 1 UTS run
> > queue per KSE group.  This probably means that the UTS will
> > also hang blocked threads off its version of the KSE group.  I
> > guess in this case, unblock events from the kernel can be sent
> > to any KSE within the group.  But if the UTS wants to have a
> > run queue for each KSE, then the kernel should only be handling
> > the blocking and unblocking of threads within the same KSE
> > in which the thread originally entered the kernel.
> >
> > I think the UTS will only set priorities for the KSE group.  It
> > doesn't make sense to me for the (application visible) priority
> > to be anywhere other than the KSE group.  If the kernel needs
> > to temporarily play with priorities for its own purposes (inheriting
> > priority when holding a mutex), then each thread probably needs
                                              ^^^^^^
> > an active priority which is MAX(kse->inherited, kseg->prio).
                                    ^^^ s/kse/thread
Sorry, I meant thread above, not kse.

> What about the priorities passed in to condition variables and msleep/tsleep?

The KSE group has the base priority from which all member threads
inherit.  The active priority is stored in each thread and is the
maximum of the KSE groups (base) priority and any priority that the
thread inherits from synchronization objects.

> That is why I think Julian wanted per-thread priorities.  Also, the priority
> propagation priority is _defintiely_ a thread and not a KSE property, since the
> thread owns teh lock that has the assoiated priority, not the KSE.

Yep, sorry I did mean thread above, not KSE.

-- 
Dan Eischen

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 16:20:28 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP id 87C7E37B405
	for <freebsd-arch@FreeBSD.org>; Tue, 13 Nov 2001 16:20:16 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA00597;
	Tue, 13 Nov 2001 16:13:53 -0800 (PST)
Date: Tue, 13 Nov 2001 16:13:51 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Glenn Gombert <ggombert@imatowns.com>
Cc: freebsd-arch@FreeBSD.org
Subject: Re: freebsd-arch@FreeBSD.org
In-Reply-To: <3.0.6.32.20011113154711.009793e0@imatowns.com>
Message-ID: <Pine.BSF.4.21.0111131533210.298-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN
Content-Transfer-Encoding: QUOTED-PRINTABLE
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Tue, 13 Nov 2001, Glenn Gombert wrote:

> A couple of questions --

Ok, bu tremember this stuff is still under discussion so any answer give
here may be wrong :-)

>=20
> >1/ Since threads running a syscall hit 'sleep' events
> >the entities on teh sleep queues must be the  threads.
>=20
>   Will the sleep queues (which mix threads from multiple CPUs) impact
> performance as the number of threads dramatically increase ..

Not really.. For several reasons...

There are an awfull lot of sleep queues and they are manipulated using
O(1) operations. "Wakeup_one()" is also independent of the number of
entries, and we don't expect the average number of threads-per-process to
be much more than 1.

 > > > 2/ the entity that is
scheduled onto the run queues is the KSE. > > (as the name suggests).
>=20
>   Is there a number of threads per KSE that is optimum for
> performance? will this be impacted by the UpCalls that are made
> between the Kernel and User land..what determines the optimum number
> of threads to be created per KSE (before another one is created for a
> particular application)..

Up to a sane limit, the number of threads per KSE/KSEGROUP id unlimitted
and controlled by the UTS. The kernel will always ask the UTS it it has
another thread to run whenever the KSE discovers it has no work to do. The
UTS has the option of either runing a new thread, or retunring a 'yield()'
to the kernel.

>=20
> > 3/ If we have only one run queue, then KSEs for several processors=20
> > from the same process, may be on the same queue.
>=20
> > 4/  If threads 'wake up' they are hung of a list of runnable threads
> > somewhere. This list could be hanging off the process, or the KSE.
> > actually more likely the KSEgroup than the process but...)
>=20
>  .. does not one process serve as a 'container' for one KSEG and
> multiple KSE and Threads ?? does this process share the time quanta
> between all its member(s) or is it the job of the UTS to make these
> type of decisions??

There is a one-to-many relationship on each step..

1 process to N KSE groups (a process may start several proces groups and
assign different scheduling characteristics for each)
1 KSEGRP to N KSEs (Each KSE can be used to reserve soem cycles on a
processor). It makes no sense to have more KSEs per KSEGRP than there are
processors.
1 KSEGRP to N threads..  In fact it makes no sense to have more active
KSEs than threads and there tends to be a thread assigned to each KSE
at minimum. (A yielded KSE may not have a thread assigned to it, but then
I said ACTIVE.. :-)

 >=20
>=20
> > 5/ If a KSE reaches teh front of the queue, but the process
> > that is running is not that for which that KSE has some affinity,
> > does it get out of the way to allow another KSE in the queue
> > to get run? or does it just run and 'switch' everything over to the new
> > available processor? Maybe the scheduler looks for the KSE from the sam=
e
> > group, that was assigned to that processor, and runs that, leaving
> > the original KSE at the head of the queue?=20
> > Maybe that happens until all the KSEs in the queue
> > that were from that group have been run? In this case it becomes possib=
le
> > to always have a KSE from that group ready...
>=20
>  Does the kernel scheduler make the decisions about scheduling (once a
> Thread has been created) ?..what is the relationship between the UTS
> and the Kernel Scheduler (from the standpoint of time allocation when
> it comes to Use's and individual threads)

The kernel shceduler decides what threads (which are currently IN THE
KERNEL) shuold be run, but as soon as control passes up to userland, the
UTS can decide which thread is run. The UTS can probably influence the=20
kernelscheduler's decision. Probably the rule "All threads in the kernel
have priority over all threads in the userland" will be the default
behaviour, though we might be able to adjust this on a per KSEGRP basis.
If we can think of an alternative..
The result of this woudl be that on teh starting of a scheduler quantum
all runnable completing syscalls would be completed before the upcall is
made to the UTS. The UTS can then select which of the returned threads
it wants to run...

>=20
> > Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from th=
at
> > group are put on all processors that look for work, until all of them=
=20
> > have been run? (this would ensure that threads from the same process
> > would all be run at the same time which is sometimes good, and sometime=
s
> > bad, depending on the application.
>=20
>  How is the time quanta divided up between KSE's and Threads ??=85who mak=
es
> the decision when each should be placed on the runqueue and run at a
> particular time when the responsibility is devided up between the UTS and
> kernel scheduler=85
                 ^^^
what's with the ^E's??

A KSE get's a quantum when it's active priority is the highest among
runnable KSEs. It will run each thread it has until completion, in turn.
In this case "completion" is one of:
1/ returns to userland
2/ blocks
3/ self destructs
4/ quantum ends.

When it has no runnable threads in the kernel to do, then the next action
for the KSE is to upcal to the UTS.

>=20
> > 6/ When a Thread is made runnable it gets (in the present system) a
> > priority. What priority does a KSE in the run queues have when it has
> > threads of several differnt priorities? Do we sort them in priority ord=
er
> > and drop the priority of the KSE(group) as we go through them
> > until we have less priority than some other kse?
>=20
> > 7/ when a KSE runs out of work, how does it decide whether there is wor=
k
> > that should be stolen from a fellow KSE? How does processor affinity
> > effect this?
>=20
>  Is a KSE not bound to a particular processor with the KSEG able to
> allocate resources between multiple processors?

This is open to debate. It makes no sense to have more KSEs than
processors. It may also be useful to bind a KSe to a particular processor.
Since threads can migrate between KSEs in the same KSEGRP, it may mean
that you's have to make a special KSEGRP with a single KSE to confine the
threads to a single CPU.

>=20
> > 8/ If we had per-processor scheduling queues, How would that effect it?
> > Which element get's put on the queues? Does a KSE
> > stay on the run queue if it has un=3Drun threads, even when it's runnin=
g?
> > How do we handle the arrival of new runnable threads with a KSE
> > when it's running but a fellow KSE is not runnable. Do we=20
> > bump the priority of the other KSE and hand it the new threads?
>=20
>=20
> > remember: here are the 4 structures:
>=20
> > proc  -   owner of all resources (FDs, memory, user creds) except cpu
>=20
> > Ksegroup -  owner of all scheduler controlling characteristics
> > =09(e.g. nice, realtime, number of processors),  N per process.
> > =09Owner of stats used for scheduling calculations.=09
>=20
> > kse -=09kind of a placeholder.  It gets scheduled onto=20
> > =09a processor (by a yet un-named mechaninsm) and provides
> > =09cpu-cycles for the execution of 'threads' (see next).
> > =09Max. Of one per processor per KSE-group.
>=20
> > thread -  The in-kernel incarnation of a user thread that is presently
> > =09in the kernel for some reason (e.g. syscall, pagefault, etc)
> > =09Holds ALL the state needed to resume after sleeping, and is the
> > =09entity that is suspended when the thread hits a 'sleep'.
> > =09"unlimmitted" per KSEgroup. probably have a short-term
> > =09"favourite" KSE/processor.
>=20
>=20
>  What is the relationship between processors and processes?? Does not one
> KSEG distribute multiple KSE's between multiple CPU's?

Yes..

KSEs are the vehicle of concurancy, as they can be runnig at the same time
on different processors. Theoretically KSEs in the same KSEGRP should not
directly compete with each other as there can never be more of them than
there are procerssors. KSEs from a different KSEGRP compete in the same
way that processes now compete.

>=20
> > When a thread blocks, the KSE looks for another thread to run, and if i=
t
> > doesn't find one, it will create one, and upcall back to the=20
> > userland to see if there are more userland threads to run.
> > (if not, it returns to yield the processor)
>=20
> > The question that has been giving me headaches is the=20
> > relationship between these elements, and
> > the definitions of how these structures are linked up and moved
> > around to provide fair efficient scheduling.
>=20
> > If a KSE has a high priority thread and a low priority thread
> > runnable in the kernel, but in reverse order, should it take
> > the high priority from the higher prio. thread and process both,
> > or should it order the threads and run teh high prio one first.
> > In this case what happens whan a higher prio. thread becomes runnable
> > while one is already running, and if the highest prio thread returns to
> > userland, should teh processor move to userland to follow it, or
> > switch to the next priority thread in the kernel.?
> > Do all threads in the kernel have priority over all threads in userland=
?
> > (this might be a reasonable decision).
>=20
>   Does the UTS have any input into the priority of how time is apportione=
d
> to each individual KSE / Thread in the kernel runqueue??..or is that
> entirely up to the kernel scheduler =85

The KSE is a placeholder to which quanta are assigned for the purpose of
running any available and runnable threads. If there are no runnable
threads (including the one in userland) teh KSE will not request any=20
cycles. The KSE applies for these cycles at the priority of the highest
priority thread waiting, where the priority
of the thread is a combination of inherrited priority and KSEGROUP-wide=20
general priority characteristics (e.g. nice, etc.).

Possibly, as the highest priority threads are "completed" teh priority of
the KSE may effectlively drop to that of the next highest thread.
It is conceivable that it may drop below that of a competing KSE, which
may under some circumstances produce a pre-emption.
(this is still under discussion). It's not completely obvious at which
point the UTS is called to allow processing in userland to continue, but I
suspect that it is when there are no more runnable threads in the kernel,
and there are no more KSEs of higher priority requesting CPU.

The question of whether the raised priority from IO that is present in
current UNIX systems should be carried over to the UTS when it finally
gets control, or whether it should be run at it's BASE priority.

Presently if you do IO, your process get's a bost in priority when the IO
completes to allow you to be quickly scheduled to process the results
of the IO quickly, and then request more IO (at which time you=20
probably sleep again). This is to help interactive processes vs.
batch proceses. It is not obvious what the right thing to do for=20
a UTS that has both batch and Interractive threads is....


>=20
>  In general does the memory allocation/recrimination scheme seem adequate
> for all the KSE's/Threads that will be created and destroyed with the new
> implementation=85
>=20
>=20
>=20
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
>=20


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue Nov 13 17: 0:26 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP id C539C37B405
	for <arch@freebsd.org>; Tue, 13 Nov 2001 17:00:17 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA00694;
	Tue, 13 Nov 2001 16:44:28 -0800 (PST)
Date: Tue, 13 Nov 2001 16:44:26 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Glenn Gombert <ggombert@imatowns.com>
Cc: arch@freebsd.org
Subject: RE: Thread scheduling in the kernel
In-Reply-To: <3.0.6.32.20011113163004.009803c0@imatowns.com>
Message-ID: <Pine.BSF.4.21.0111131614160.298-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN
Content-Transfer-Encoding: QUOTED-PRINTABLE
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Tue, 13 Nov 2001, Glenn Gombert wrote:

> > Well, that's cause I think that there are some basic things that need t=
o be
> > decided before we can make the decisisons at the bottom of your e-mail.=
  I
> > think the first thing is that priorities need to be decided.  The real
> question
> > there is do we want per-thread priorities or per-ksegroup priorities?  =
If
> you
> > go totally with per-thread priorities which you seem to be favoring now=
 and
> > just use ksegroup for nice and fixed priorities, then that makes kse gr=
oups
> > simpler at the expense of complicating KSE scheduling. :)
>=20
>  Is not a KSE 'bound' to a particular CPU, with each thread in the KSE
> given a specific amount of time by the kernel scheduler ??. how does
> the UTS play in this (other than to sleep and wakeup threads) =85

Threads become runnable when whatever they were blocking on allows
them to run. A KSE (this still open to discussion) becomes runnable when
there is at least one runnable thread that it could provide cycles to.

The KSE may not be bound to a single processor (though it MIGHT be)
but just able to hop in to take any cycles on any CPU available.
(actually since threads can migrate between KSEs in the same group, they
are actually equivalent, so you might select the KSE that was last on this
processor if you wanted, but it may not gain you much.)

you could put the KSEGRP on the run queue but hte difficulty comes with=20
the accounting. If you take it off to run a KSE on it's behalf, then
what happens if another processor becomes available...? It's not on the=20
run queue.. so even though it may be able to use the extra horsepower
it isn't going to be asked..  If it stays on the run queue head
until it has run out of threads, then it may never leave the head,
as new threads may continually be coming available.
By puting KSEs on the run queues and removing them when they are run, you
can ensure that when their quantum is completed, they are placed back onto
the queue at the tail end..

there are aother answers to theses problems.. that's what we need to
discuss..=20
"who has priority?" - it seems clear there is a component from both the=20
=09thread and the KSEGRP.. selected by the KSE..
"how do we do the queueing to maintain fairness and resposiveness"
=09- I think by queueing KSEs but hopefully someone else
=09has a REALLY SNEAKY and CUTE solution :-)


>=20
> > If we let each thread have a priority and maintain its own scheduling
> > parameters then I would be tempted to put threads on the runqueue's
> rather than
> > kse's primarily because you then have the problem of having to go updat=
e the
> > priorities of KSE's all the time when thread priorites change.  And sin=
ce
> you
> > want a thread to run as soon as its priority allows, this means changig=
n the
> > prioritiy of all KSE's in its group so it gets to run on the first one =
that
>=20
>  what is the mechanism for this (kernel scheduling ) or does the UTS
> become involve as well ? What is the impact on performance (if
> re-scheduling is done on a per-thread basis)=85
>=20
>=20
> > becomes available.  This would point to a single priority in the KSE
> group that
> > all KSE's share that is the highest priority of all runnable threads.  =
If
> the
> > list of runnable threads in the KSE group is priority sorted (as it
> should be)
> > this isn't but so difficult as you look at the priority of the thread a=
t the
> > head of the list.  However, every time that priority changes, you have =
to go
> > shuffle KSE's around on the queue's potentially, rather than just movin=
g
> that
> > one thread around on the queue's (or putting it on the queue as the cas=
e may
> > be).
>=20
>  Is not time allocated between Threads in a KSE based upon the total
> amount of time available to the KSE.. if it is not this way , does not
> Threads associated with a particular application gain an 'unfair'
> advantage when it come to running =85

Fairness is a very important criteria. Time is allocatged between therads
in a KSE in a priority order, with no real pre-emption between them. We
are in the kernel. We control the code.. We can be sure that within the
kernel, codepaths are short before a "completion" of some sort occurs.
(Even if that event is actually the thread re-blocking). When all kernel
activity has completed, teh UTS can be called.. I cannot imagine that it
would be wise to call teh UTS when there are still runnable threads stuck
in a semi-completed state within the kernel.


>=20
> > One comment about preemption: probably what we will go with is only
> preempting
> > for real time threads (including interrupt threads) and not preempt tim=
e
> >  sharing threads until their quantum is up or they block.  The entire
> concept of
> > KSE's as I understand it, is to serve as a holder for the quantum so th=
at we
> > can give a multithreaded process it's full quantum each go-around even =
if
> > threads block in which case we split it across multiple threads.  In th=
at
> case,
> > I think this might be a resonable model:
>=20
>=20
> > I think this will achieve the desired goal of a KSE (preserve quanta fo=
r
> > multithreading time-sharing processes across threads) while still allow=
ing
> > things like priority propagation and preemption to work smoothly.  It's=
 also
> > fairly simple.
>=20
>=20
>=20
> > If you use a priority bias for affinity, then that means you basically
> have a
> > constant, say 4 (that is random, prolly not the real value) then you wi=
ll
> > basically artificially bump the priority of threads with lastcpu =3D=3D=
 cpuid
> by 4
> > during your comparison.  This means you can stop walking the ksegroup
> list of
> > threads when you hit a thread whose priority is more than 4 levels less=
 than
> > that of the highest priority thread.  Also, the first thread you hit th=
at
> meets
> > the affinity requirement is the one you run, this should keep a (hopefu=
lly)
> >  decent bound on the amount of list walking done.
>=20
>   If KSE's are bound to a particular CPU, how does this affect KSE's &
> Threads on different CPU'

Threads are like water. They flow between any available KSEs for their
KSEGRP. (with a slight preference for one on their last processor)
KSEs from the same KSEGRP have the same priority and can therefore never
pre-empt each other on the same processor, this it makes no sense to have
more of them than there are processors.
Binding them to procesors is also dubious..

if KSE A runs on processor A, then KSE B must run on Processor B
unless it is already busy. If KSE A finishes, abd processor B is still
busy, then KSE B can run on procesor A, but this si functionally identical
to the case where the 3rd KSE (that was keeping B busy) finished
and KSE ran on processor B, since both KSE A and KSE B are drawing from
the same pool opf runnable threads.. It MAY make some small sense to
say "Hey that KSEGRP has another quantum and since Processor B is busy,
we'll run the KSE for Processor A again, in KSE B's place" but there
is no real gain in doing so.

One case to consider is as follows:

 KSE A is running at raised priority '3' (1 is more priority) because it
is running a thread (T) that holds a mutex needed by a high priority
process. Thread (S) becomes runnable at a higher priority(2). If there is
a KSE (B) for that KSEGRP available, it is made runnable and its priority
is set to (2). Is it possible that it might pre-empt the KSE from the same
process group (A) on the same processor if there is a thread of priority
(1) running on th eother procesor?. How is this differnt from pre-empting=
=20
the thread  (T) within (A), and running (S) wihtin (A) instead.

This all needs thrashing out. and is what I'm trying to achieve here on
-arch..

>=20
>=20
>=20
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
>=20


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 16: 9:39 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from relay.gnf.org (relay.gnf.org [208.44.31.36])
	by hub.freebsd.org (Postfix) with ESMTP id E962637B417
	for <arch@freebsd.org>; Wed, 14 Nov 2001 16:09:35 -0800 (PST)
Received: from mail.gnf.org (smtp.gnf.org [10.0.0.11])
	by relay.gnf.org (8.11.6/8.11.6) with ESMTP id fAF09YJ15216
	for <arch@freebsd.org>; Wed, 14 Nov 2001 16:09:34 -0800
Received: by mail.gnf.org (Postfix, from userid 888)
	id A436511E504; Wed, 14 Nov 2001 16:06:37 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
	by mail.gnf.org (Postfix) with ESMTP id A019B11A572
	for <arch@freebsd.org>; Wed, 14 Nov 2001 16:06:37 -0800 (PST)
Date: Wed, 14 Nov 2001 16:06:37 -0800 (PST)
From: Gordon Tetlow <gordont@gnf.org>
To: <arch@freebsd.org>
Subject: rc.d issues
Message-ID: <Pine.LNX.4.33.0111141557250.19247-100000@smtp.gnf.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

There are a couple of issues with porting the rc.d infrastructure that 
need to be addressed before going forward. Most notably is NetBSD's use of 
(for example) $ipfilter while FreeBSD uses $ipfilter_enable.

Not wanting to break POLA, I was thinking about hacking /etc/rc.subr to 
check $<keyword> and if that is unset, check $<keyword>_enable. Any 
thoughts?

I have yet to see any thoughts, criticisms, critiques or anything of the
like for the initial patch that I posted (plug
http://hobbes.melthusia.org/~gordont/rc_ng.diff). So I'm going to continue
working along my current path. I've just moved so it's slowed up a bit,
but hopefully I'll be able to return it shortly.

-gordon


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 16:17:25 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP id 1E35737B405
	for <freebsd-arch@freebsd.org>; Wed, 14 Nov 2001 16:15:47 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAF0Flb09186;
	Wed, 14 Nov 2001 16:15:47 -0800 (PST)
	(envelope-from dillon)
Date: Wed, 14 Nov 2001 16:15:47 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111150015.fAF0Flb09186@apollo.backplane.com>
To: freebsd-arch@freebsd.org
Subject: Need review - patch for socket locking and ref counting
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

    This patch adds a reference count to the socket structure
    and cleans up & encapulates the API calls.  I do not yet
    attempt to use sxlocks to lock the socket structure (to allow
    us to multi-thread the network stack), but that is the
    direction I am headed.

    soalloc()/sofree() - no reference counter adjustments
			 (so_count must be 0 or sofree() panics)
			 (soalloc initializes so_count to 0)
 
    socreate()/soclose() - socreate inits ref counter to 1,
			   soclose decrements ref counter.

    soref()		- bump ref counter

    sorele()		- decrement ref counter, calls sofree()
			  when the ref counter hits 0

    holdsock() removed, fgetsock() added in a manner similar to fget() and
    fgetvp().

    I would like a review. 

    Also, I noticed there are two calls to soisdisconnected()
    *AFTER* the code (originally) calls sofree(), which sounds
    bogus to me.  Could someone review the original code and
    give me an opinion?  (see the last two XXX's in the patch
    set).

					Thanks,

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


Index: compat/svr4/svr4_stream.c
===================================================================
RCS file: /home/ncvs/src/sys/compat/svr4/svr4_stream.c,v
retrieving revision 1.22
diff -u -r1.22 svr4_stream.c
--- compat/svr4/svr4_stream.c	2001/09/12 08:36:58	1.22
+++ compat/svr4/svr4_stream.c	2001/11/14 22:10:24
@@ -150,7 +150,6 @@
 	register struct msghdr *mp;
 	int flags;
 {
-	struct file *fp;
 	struct uio auio;
 	register struct iovec *iov;
 	register int i;
@@ -163,8 +162,7 @@
 	struct uio ktruio;
 #endif
 
-	error = holdsock(td->td_proc->p_fd, s, &fp);
-	if (error)
+	if ((error = fgetsock(td, s, &so, NULL)) != 0)
 		return (error);
 	auio.uio_iov = mp->msg_iov;
 	auio.uio_iovcnt = mp->msg_iovlen;
@@ -176,16 +174,14 @@
 	iov = mp->msg_iov;
 	for (i = 0; i < mp->msg_iovlen; i++, iov++) {
 		if ((auio.uio_resid += iov->iov_len) < 0) {
-			fdrop(fp, td);
-			return (EINVAL);
+			error = EINVAL;
+			goto done1;
 		}
 	}
 	if (mp->msg_name) {
 		error = getsockaddr(&to, mp->msg_name, mp->msg_namelen);
-		if (error) {
-			fdrop(fp, td);
-			return (error);
-		}
+		if (error)
+			goto done1;
 	} else {
 		to = 0;
 	}
@@ -211,7 +207,6 @@
 	}
 #endif
 	len = auio.uio_resid;
-	so = (struct socket *)fp->f_data;
 	error = so->so_proto->pr_usrreqs->pru_sosend(so, to, &auio, 0, control,
 						     flags, td);
 	if (error) {
@@ -239,7 +234,8 @@
 bad:
 	if (to)
 		FREE(to, M_SONAME);
-	fdrop(fp, td);
+done1:
+	fputsock(so);
 	return (error);
 }
 
@@ -250,7 +246,6 @@
 	register struct msghdr *mp;
 	caddr_t namelenp;
 {
-	struct file *fp;
 	struct uio auio;
 	register struct iovec *iov;
 	register int i;
@@ -264,8 +259,7 @@
 	struct uio ktruio;
 #endif
 
-	error = holdsock(td->td_proc->p_fd, s, &fp);
-	if (error)
+	if ((error = fgetsock(td, s, &so, NULL)) != 0)
 		return (error);
 	auio.uio_iov = mp->msg_iov;
 	auio.uio_iovcnt = mp->msg_iovlen;
@@ -277,8 +271,8 @@
 	iov = mp->msg_iov;
 	for (i = 0; i < mp->msg_iovlen; i++, iov++) {
 		if ((auio.uio_resid += iov->iov_len) < 0) {
-			fdrop(fp, td);
-			return (EINVAL);
+			error = EINVAL;
+			goto done1;
 		}
 	}
 #ifdef KTRACE
@@ -365,7 +359,8 @@
 		FREE(fromsa, M_SONAME);
 	if (control)
 		m_freem(control);
-	fdrop(fp, td);
+done1:
+	fputsock(so);
 	return (error);
 }
 
Index: kern/kern_descrip.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_descrip.c,v
retrieving revision 1.111
diff -u -r1.111 kern_descrip.c
--- kern/kern_descrip.c	2001/11/14 06:30:35	1.111
+++ kern/kern_descrip.c	2001/11/14 23:42:17
@@ -60,6 +60,8 @@
 #include <sys/unistd.h>
 #include <sys/resourcevar.h>
 #include <sys/event.h>
+#include <sys/sx.h>
+#include <sys/socketvar.h>
 
 #include <machine/limits.h>
 
@@ -1423,6 +1425,51 @@
 fgetvp_write(struct thread *td, int fd, struct vnode **vpp)
 {
 	return(_fgetvp(td, fd, vpp, FWRITE));
+}
+
+/*
+ * Like fget() but loads the underlying socket, or returns an error if
+ * the descriptor does not represent a socket.
+ *
+ * We bump the ref count on the returned socket.  XXX Also obtain the SX lock in
+ * the future.
+ */
+int
+fgetsock(struct thread *td, int fd, struct socket **spp, u_int *fflagp)
+{
+	struct filedesc *fdp;
+	struct file *fp;
+	struct socket *so;
+
+	GIANT_REQUIRED;
+	fdp = td->td_proc->p_fd;
+	*spp = NULL;
+	if (fflagp)
+		*fflagp = 0;
+	if ((u_int)fd >= fdp->fd_nfiles)
+		return(EBADF);
+	if ((fp = fdp->fd_ofiles[fd]) == NULL)
+		return(EBADF);
+	if (fp->f_type != DTYPE_SOCKET)
+		return(ENOTSOCK);
+	if (fp->f_data == NULL)
+		return(EINVAL);
+	so = (struct socket *)fp->f_data;
+	if (fflagp)
+		*fflagp = fp->f_flag;
+	soref(so);
+	*spp = so;
+	return(0);
+}
+
+/*
+ * Drop the reference count on the the socket and XXX release the SX lock in
+ * the future.  The last reference closes the socket.
+ */
+void
+fputsock(struct socket *so)
+{
+	sorele(so);
 }
 
 int
Index: kern/kern_mtxpool.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_mtxpool.c,v
retrieving revision 1.1
diff -u -r1.1 kern_mtxpool.c
--- kern/kern_mtxpool.c	2001/11/13 21:55:12	1.1
+++ kern/kern_mtxpool.c	2001/11/14 04:06:48
@@ -35,9 +35,10 @@
 #include <sys/systm.h>
 
 #ifndef MTX_POOL_SIZE
-#define MTX_POOL_SIZE	128
+#define MTX_POOL_SIZE	128	/* must be a multiple of 4 */
 #endif
-#define MTX_POOL_MASK	(MTX_POOL_SIZE-1)
+#define MTX_POOL_MASK	(MTX_POOL_SIZE - 1)
+#define MTX_POOL_XMASK	(MTX_POOL_MASK & ~3)
 
 static struct mtx mtx_pool_ary[MTX_POOL_SIZE];
 
@@ -54,6 +55,34 @@
     return(&mtx_pool_ary[((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_MASK]);
 }
 
+static __inline
+struct mtx *
+_mtx_pool1_find(void *ptr)
+{
+    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 0]);
+}
+
+static __inline
+struct mtx *
+_mtx_pool2_find(void *ptr)
+{
+    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 1]);
+}
+
+static __inline
+struct mtx *
+_mtx_pool3_find(void *ptr)
+{
+    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 2]);
+}
+
+static __inline
+struct mtx *
+_mtx_pool4_find(void *ptr)
+{
+    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 3]);
+}
+
 static void
 mtx_pool_setup(void *dummy __unused)
 {
@@ -88,6 +117,30 @@
     return(_mtx_pool_find(ptr));
 }
 
+struct mtx *
+mtx_pool1_find(void *ptr)
+{
+    return(_mtx_pool1_find(ptr));
+}
+
+struct mtx *
+mtx_pool2_find(void *ptr)
+{
+    return(_mtx_pool2_find(ptr));
+}
+
+struct mtx *
+mtx_pool3_find(void *ptr)
+{
+    return(_mtx_pool3_find(ptr));
+}
+
+struct mtx *
+mtx_pool4_find(void *ptr)
+{
+    return(_mtx_pool4_find(ptr));
+}
+
 /*
  * Combined find/lock operation.  Lock the pool mutex associated with
  * the specified address.
@@ -98,6 +151,30 @@
     mtx_lock(_mtx_pool_find(ptr));
 }
 
+void 
+mtx_pool1_lock(void *ptr)
+{
+    mtx_lock(_mtx_pool1_find(ptr));
+}
+
+void 
+mtx_pool2_lock(void *ptr)
+{
+    mtx_lock(_mtx_pool2_find(ptr));
+}
+
+void 
+mtx_pool3_lock(void *ptr)
+{
+    mtx_lock(_mtx_pool3_find(ptr));
+}
+
+void 
+mtx_pool4_lock(void *ptr)
+{
+    mtx_lock(_mtx_pool4_find(ptr));
+}
+
 /*
  * Combined find/unlock operation.  Unlock the pool mutex associated with
  * the specified address.
@@ -106,6 +183,30 @@
 mtx_pool_unlock(void *ptr)
 {
     mtx_unlock(_mtx_pool_find(ptr));
+}
+
+void
+mtx_pool1_unlock(void *ptr)
+{
+    mtx_unlock(_mtx_pool1_find(ptr));
+}
+
+void
+mtx_pool2_unlock(void *ptr)
+{
+    mtx_unlock(_mtx_pool2_find(ptr));
+}
+
+void
+mtx_pool3_unlock(void *ptr)
+{
+    mtx_unlock(_mtx_pool3_find(ptr));
+}
+
+void
+mtx_pool4_unlock(void *ptr)
+{
+    mtx_unlock(_mtx_pool4_find(ptr));
 }
 
 SYSINIT(mtxpooli, SI_SUB_MUTEX, SI_ORDER_FIRST, mtx_pool_setup, NULL)   
Index: kern/sys_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/sys_socket.c,v
retrieving revision 1.35
diff -u -r1.35 sys_socket.c
--- kern/sys_socket.c	2001/09/12 08:37:46	1.35
+++ kern/sys_socket.c	2001/11/14 23:48:45
@@ -182,6 +182,12 @@
 	return ((*so->so_proto->pr_usrreqs->pru_sense)(so, ub));
 }
 
+/*
+ * API socket close on file pointer.  We call soclose() to close the 
+ * socket (including initiating closing protocols).  soclose() will
+ * sorele() the file reference but the actual socket will not go away
+ * until the socket's ref count hits 0.
+ */
 /* ARGSUSED */
 int
 soo_close(fp, td)
@@ -189,10 +195,12 @@
 	struct thread *td;
 {
 	int error = 0;
+	struct socket *so;
 
 	fp->f_ops = &badfileops;
-	if (fp->f_data)
-		error = soclose((struct socket *)fp->f_data);
-	fp->f_data = 0;
+	if ((so = fp->f_data) != NULL) {
+		fp->f_data = NULL;
+		error = soclose(so);
+	}
 	return (error);
 }
Index: kern/uipc_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.105
diff -u -r1.105 uipc_socket.c
--- kern/uipc_socket.c	2001/11/12 20:51:40	1.105
+++ kern/uipc_socket.c	2001/11/15 00:03:25
@@ -106,6 +106,8 @@
  * Note that it would probably be better to allocate socket
  * and PCB at the same time, but I'm not convinced that all
  * the protocols can be easily modified to do this.
+ *
+ * soalloc() returns a socket with a ref count of 0.
  */
 struct socket *
 soalloc(waitok)
@@ -119,11 +121,16 @@
 		bzero(so, sizeof *so);
 		so->so_gencnt = ++so_gencnt;
 		so->so_zone = socket_zone;
+		/* sx_init(&so->so_sxlock, "socket sxlock"); */
 		TAILQ_INIT(&so->so_aiojobq);
 	}
 	return so;
 }
 
+/*
+ * socreate returns a socket with a ref count of 1.  The socket should be
+ * closed with soclose().
+ */
 int
 socreate(dom, aso, type, proto, td)
 	int dom;
@@ -162,10 +169,11 @@
 	so->so_type = type;
 	so->so_cred = crhold(td->td_proc->p_ucred);
 	so->so_proto = prp;
+	soref(so);
 	error = (*prp->pr_usrreqs->pru_attach)(so, proto, td);
 	if (error) {
 		so->so_state |= SS_NOFDREF;
-		sofree(so);
+		sorele(so);
 		return (error);
 	}
 	*aso = so;
@@ -186,11 +194,12 @@
 	return (error);
 }
 
-void
-sodealloc(so)
-	struct socket *so;
+static void
+sodealloc(struct socket *so)
 {
 
+	KASSERT(so->so_count == 0, ("sodealloc(): so_count %d", so->so_count));
+	so->so_count = 0;
 	so->so_gencnt = ++so_gencnt;
 	if (so->so_rcv.sb_hiwat)
 		(void)chgsbsize(so->so_cred->cr_uidinfo,
@@ -210,6 +219,7 @@
 	}
 #endif
 	crfree(so->so_cred);
+	/* sx_destroy(&so->so_sxlock); */
 	zfree(so->so_zone, so);
 }
 
@@ -242,6 +252,8 @@
 {
 	struct socket *head = so->so_head;
 
+	KASSERT(so->so_count == 0, ("socket %p so_count not 0", so));
+
 	if (so->so_pcb || (so->so_state & SS_NOFDREF) == 0)
 		return;
 	if (head != NULL) {
@@ -272,6 +284,10 @@
  * Close a socket on last file table reference removal.
  * Initiate disconnect if connected.
  * Free socket when disconnect complete.
+ *
+ * This function will sorele() the socket.  Note that soclose() may be
+ * called prior to the ref count reaching zero.  The actual socket
+ * structure will not be freed until the ref count reaches zero.
  */
 int
 soclose(so)
@@ -329,7 +345,7 @@
 	if (so->so_state & SS_NOFDREF)
 		panic("soclose: NOFDREF");
 	so->so_state |= SS_NOFDREF;
-	sofree(so);
+	sorele(so);
 	splx(s);
 	return (error);
 }
@@ -345,7 +361,7 @@
 
 	error = (*so->so_proto->pr_usrreqs->pru_abort)(so);
 	if (error) {
-		sofree(so);
+		sotryfree(so);	/* note: does not decrement the ref count */
 		return error;
 	}
 	return (0);
Index: kern/uipc_socket2.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_socket2.c,v
retrieving revision 1.76
diff -u -r1.76 uipc_socket2.c
--- kern/uipc_socket2.c	2001/10/11 23:38:15	1.76
+++ kern/uipc_socket2.c	2001/11/14 23:59:33
@@ -210,6 +210,8 @@
  * then we allocate a new structure, propoerly linked into the
  * data structure of the original socket, and return this.
  * Connstatus may be 0, or SO_ISCONFIRMING, or SO_ISCONNECTED.
+ *
+ * note: the ref count on the socket is 0 on return
  */
 struct socket *
 sonewconn(head, connstatus)
@@ -246,7 +248,7 @@
 		so->so_cred = crhold(head->so_cred);
 	if (soreserve(so, head->so_snd.sb_hiwat, head->so_rcv.sb_hiwat) ||
 	    (*so->so_proto->pr_usrreqs->pru_attach)(so, 0, NULL)) {
-		sodealloc(so);
+		sotryfree(so);
 		return ((struct socket *)0);
 	}
 
Index: kern/uipc_syscalls.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_syscalls.c,v
retrieving revision 1.98
diff -u -r1.98 uipc_syscalls.c
--- kern/uipc_syscalls.c	2001/11/14 06:30:35	1.98
+++ kern/uipc_syscalls.c	2001/11/14 23:09:34
@@ -139,7 +139,7 @@
 			fdrop(fp, td);
 		}
 	} else {
-		fp->f_data = (caddr_t)so;
+		fp->f_data = (caddr_t)so;	/* already has ref count */
 		fp->f_flag = FREAD|FWRITE;
 		fp->f_ops = &socketops;
 		fp->f_type = DTYPE_SOCKET;
@@ -164,22 +164,19 @@
 		int	namelen;
 	} */ *uap;
 {
-	struct file *fp;
 	struct sockaddr *sa;
+	struct socket *sp;
 	int error;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error)
+	if ((error = fgetsock(td, uap->s, &sp, NULL)) != 0)
 		goto done2;
-	error = getsockaddr(&sa, uap->name, uap->namelen);
-	if (error) {
-		fdrop(fp, td);
-		goto done2;
-	}
-	error = sobind((struct socket *)fp->f_data, sa, td);
+	if ((error = getsockaddr(&sa, uap->name, uap->namelen)) != 0)
+		goto done1;
+	error = sobind(sp, sa, td);
 	FREE(sa, M_SONAME);
-	fdrop(fp, td);
+done1:
+	fputsock(sp);
 done2:
 	mtx_unlock(&Giant);
 	return (error);
@@ -197,14 +194,13 @@
 		int	backlog;
 	} */ *uap;
 {
-	struct file *fp;
+	struct socket *sp;
 	int error;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error == 0) {
-		error = solisten((struct socket *)fp->f_data, uap->backlog, td);
-		fdrop(fp, td);
+	if ((error = fgetsock(td, uap->s, &sp, NULL)) == 0) {
+		error = solisten(sp, uap->backlog, td);
+		fputsock(sp);
 	}
 	mtx_unlock(&Giant);
 	return(error);
@@ -225,13 +221,12 @@
 	int compat;
 {
 	struct filedesc *fdp;
-	struct file *lfp = NULL;
 	struct file *nfp = NULL;
 	struct sockaddr *sa;
 	int namelen, error, s;
 	struct socket *head, *so;
 	int fd;
-	short fflag;		/* type must match fp->f_flag */
+	u_int fflag;
 
 	mtx_lock(&Giant);
 	fdp = td->td_proc->p_fd;
@@ -241,11 +236,10 @@
 		if(error)
 			goto done2;
 	}
-	error = holdsock(fdp, uap->s, &lfp);
+	error = fgetsock(td, uap->s, &head, &fflag);
 	if (error)
 		goto done2;
 	s = splnet();
-	head = (struct socket *)lfp->f_data;
 	if ((head->so_options & SO_ACCEPTCONN) == 0) {
 		splx(s);
 		error = EINVAL;
@@ -286,7 +280,6 @@
 	TAILQ_REMOVE(&head->so_comp, so, so_list);
 	head->so_qlen--;
 
-	fflag = lfp->f_flag;
 	error = falloc(td, &nfp, &fd);
 	if (error) {
 		/*
@@ -312,7 +305,7 @@
 	if (head->so_sigio != NULL)
 		fsetown(fgetown(head->so_sigio), &so->so_sigio);
 
-	nfp->f_data = (caddr_t)so;
+	nfp->f_data = (caddr_t)so;	/* already has ref count */
 	nfp->f_flag = fflag;
 	nfp->f_ops = &socketops;
 	nfp->f_type = DTYPE_SOCKET;
@@ -375,7 +368,7 @@
 done:
 	if (nfp != NULL)
 		fdrop(nfp, td);
-	fdrop(lfp, td);
+	fputsock(head);
 done2:
 	mtx_unlock(&Giant);
 	return (error);
@@ -420,35 +413,31 @@
 		int	namelen;
 	} */ *uap;
 {
-	struct file *fp;
-	register struct socket *so;
+	struct socket *so;
 	struct sockaddr *sa;
 	int error, s;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error)
+	if ((error = fgetsock(td, uap->s, &so, NULL)) != 0)
 		goto done2;
-	so = (struct socket *)fp->f_data;
 	if ((so->so_state & SS_NBIO) && (so->so_state & SS_ISCONNECTING)) {
 		error = EALREADY;
-		goto done;
+		goto done1;
 	}
 	error = getsockaddr(&sa, uap->name, uap->namelen);
 	if (error)
-		goto done;
+		goto done1;
 	error = soconnect(so, sa, td);
 	if (error)
 		goto bad;
 	if ((so->so_state & SS_NBIO) && (so->so_state & SS_ISCONNECTING)) {
 		FREE(sa, M_SONAME);
 		error = EINPROGRESS;
-		goto done;
+		goto done1;
 	}
 	s = splnet();
 	while ((so->so_state & SS_ISCONNECTING) && so->so_error == 0) {
-		error = tsleep((caddr_t)&so->so_timeo, PSOCK | PCATCH,
-		    "connec", 0);
+		error = tsleep((caddr_t)&so->so_timeo, PSOCK | PCATCH, "connec", 0);
 		if (error)
 			break;
 	}
@@ -462,8 +451,8 @@
 	FREE(sa, M_SONAME);
 	if (error == ERESTART)
 		error = EINTR;
-done:
-	fdrop(fp, td);
+done1:
+	fputsock(so);
 done2:
 	mtx_unlock(&Giant);
 	return (error);
@@ -499,12 +488,12 @@
 		goto free2;
 	fhold(fp1);
 	sv[0] = fd;
-	fp1->f_data = (caddr_t)so1;
+	fp1->f_data = (caddr_t)so1;	/* so1 already has ref count */
 	error = falloc(td, &fp2, &fd);
 	if (error)
 		goto free3;
 	fhold(fp2);
-	fp2->f_data = (caddr_t)so2;
+	fp2->f_data = (caddr_t)so2;	/* so2 already has ref count */
 	sv[1] = fd;
 	error = soconnect2(so1, so2);
 	if (error)
@@ -552,12 +541,11 @@
 	register struct msghdr *mp;
 	int flags;
 {
-	struct file *fp;
 	struct uio auio;
 	register struct iovec *iov;
 	register int i;
 	struct mbuf *control;
-	struct sockaddr *to;
+	struct sockaddr *to = NULL;
 	int len, error;
 	struct socket *so;
 #ifdef KTRACE
@@ -565,8 +553,7 @@
 	struct uio ktruio;
 #endif
 
-	error = holdsock(td->td_proc->p_fd, s, &fp);
-	if (error)
+	if ((error = fgetsock(td, s, &so, NULL)) != 0)
 		return (error);
 	auio.uio_iov = mp->msg_iov;
 	auio.uio_iovcnt = mp->msg_iovlen;
@@ -578,18 +565,14 @@
 	iov = mp->msg_iov;
 	for (i = 0; i < mp->msg_iovlen; i++, iov++) {
 		if ((auio.uio_resid += iov->iov_len) < 0) {
-			fdrop(fp, td);
-			return (EINVAL);
+			error = EINVAL;
+			goto bad;
 		}
 	}
 	if (mp->msg_name) {
 		error = getsockaddr(&to, mp->msg_name, mp->msg_namelen);
-		if (error) {
-			fdrop(fp, td);
-			return (error);
-		}
-	} else {
-		to = 0;
+		if (error)
+			goto bad;
 	}
 	if (mp->msg_control) {
 		if (mp->msg_controllen < sizeof(struct cmsghdr)
@@ -633,7 +616,6 @@
 	}
 #endif
 	len = auio.uio_resid;
-	so = (struct socket *)fp->f_data;
 	error = so->so_proto->pr_usrreqs->pru_sosend(so, to, &auio, 0, control,
 						     flags, td);
 	if (error) {
@@ -659,7 +641,7 @@
 	}
 #endif
 bad:
-	fdrop(fp, td);
+	fputsock(so);
 	if (to)
 		FREE(to, M_SONAME);
 	return (error);
@@ -834,7 +816,6 @@
 	register struct msghdr *mp;
 	caddr_t namelenp;
 {
-	struct file *fp;
 	struct uio auio;
 	register struct iovec *iov;
 	register int i;
@@ -848,8 +829,7 @@
 	struct uio ktruio;
 #endif
 
-	error = holdsock(td->td_proc->p_fd, s, &fp);
-	if (error)
+	if ((error = fgetsock(td, s, &so, NULL)) != 0)
 		return (error);
 	auio.uio_iov = mp->msg_iov;
 	auio.uio_iovcnt = mp->msg_iovlen;
@@ -861,7 +841,7 @@
 	iov = mp->msg_iov;
 	for (i = 0; i < mp->msg_iovlen; i++, iov++) {
 		if ((auio.uio_resid += iov->iov_len) < 0) {
-			fdrop(fp, td);
+			fputsock(so);
 			return (EINVAL);
 		}
 	}
@@ -875,7 +855,6 @@
 	}
 #endif
 	len = auio.uio_resid;
-	so = (struct socket *)fp->f_data;
 	error = so->so_proto->pr_usrreqs->pru_soreceive(so, &fromsa, &auio,
 	    (struct mbuf **)0, mp->msg_control ? &control : (struct mbuf **)0,
 	    &mp->msg_flags);
@@ -975,7 +954,7 @@
 		mp->msg_controllen = ctlbuf - (caddr_t)mp->msg_control;
 	}
 out:
-	fdrop(fp, td);
+	fputsock(so);
 	if (fromsa)
 		FREE(fromsa, M_SONAME);
 	if (control)
@@ -1196,14 +1175,13 @@
 		int	how;
 	} */ *uap;
 {
-	struct file *fp;
+	struct socket *so;
 	int error;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error == 0) {
-		error = soshutdown((struct socket *)fp->f_data, uap->how);
-		fdrop(fp, td);
+	if ((error = fgetsock(td, uap->s, &so, NULL)) == 0) {
+		error = soshutdown(so, uap->how);
+		fputsock(so);
 	}
 	mtx_unlock(&Giant);
 	return(error);
@@ -1224,7 +1202,7 @@
 		int	valsize;
 	} */ *uap;
 {
-	struct file *fp;
+	struct socket *so;
 	struct sockopt sopt;
 	int error;
 
@@ -1234,16 +1212,15 @@
 		return (EINVAL);
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error == 0) {
+	if ((error = fgetsock(td, uap->s, &so, NULL)) == 0) {
 		sopt.sopt_dir = SOPT_SET;
 		sopt.sopt_level = uap->level;
 		sopt.sopt_name = uap->name;
 		sopt.sopt_val = uap->val;
 		sopt.sopt_valsize = uap->valsize;
 		sopt.sopt_td = td;
-		error = sosetopt((struct socket *)fp->f_data, &sopt);
-		fdrop(fp, td);
+		error = sosetopt(so, &sopt);
+		fputsock(so);
 	}
 	mtx_unlock(&Giant);
 	return(error);
@@ -1265,24 +1242,20 @@
 	} */ *uap;
 {
 	int	valsize, error;
-	struct	file *fp;
+	struct  socket *so;
 	struct	sockopt sopt;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error)
+	if ((error = fgetsock(td, uap->s, &so, NULL)) != 0)
 		goto done2;
 	if (uap->val) {
 		error = copyin((caddr_t)uap->avalsize, (caddr_t)&valsize,
 		    sizeof (valsize));
-		if (error) {
-			fdrop(fp, td);
-			goto done2;
-		}
+		if (error)
+			goto done1;
 		if (valsize < 0) {
-			fdrop(fp, td);
 			error = EINVAL;
-			goto done2;
+			goto done1;
 		}
 	} else {
 		valsize = 0;
@@ -1295,13 +1268,14 @@
 	sopt.sopt_valsize = (size_t)valsize; /* checked non-negative above */
 	sopt.sopt_td = td;
 
-	error = sogetopt((struct socket *)fp->f_data, &sopt);
+	error = sogetopt(so, &sopt);
 	if (error == 0) {
 		valsize = sopt.sopt_valsize;
 		error = copyout((caddr_t)&valsize,
 				(caddr_t)uap->avalsize, sizeof (valsize));
 	}
-	fdrop(fp, td);
+done1:
+	fputsock(so);
 done2:
 	mtx_unlock(&Giant);
 	return (error);
@@ -1323,21 +1297,16 @@
 	} */ *uap;
 	int compat;
 {
-	struct file *fp;
-	register struct socket *so;
+	struct socket *so;
 	struct sockaddr *sa;
 	int len, error;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->fdes, &fp);
-	if (error)
+	if ((error = fgetsock(td, uap->fdes, &so, NULL)) != 0)
 		goto done2;
 	error = copyin((caddr_t)uap->alen, (caddr_t)&len, sizeof (len));
-	if (error) {
-		fdrop(fp, td);
-		goto done2;
-	}
-	so = (struct socket *)fp->f_data;
+	if (error)
+		goto done1;
 	sa = 0;
 	error = (*so->so_proto->pr_usrreqs->pru_sockaddr)(so, &sa);
 	if (error)
@@ -1360,7 +1329,8 @@
 bad:
 	if (sa)
 		FREE(sa, M_SONAME);
-	fdrop(fp, td);
+done1:
+	fputsock(so);
 done2:
 	mtx_unlock(&Giant);
 	return (error);
@@ -1408,26 +1378,20 @@
 	} */ *uap;
 	int compat;
 {
-	struct file *fp;
-	register struct socket *so;
+	struct socket *so;
 	struct sockaddr *sa;
 	int len, error;
 
 	mtx_lock(&Giant);
-	error = holdsock(td->td_proc->p_fd, uap->fdes, &fp);
-	if (error)
+	if ((error = fgetsock(td, uap->fdes, &so, NULL)) != 0)
 		goto done2;
-	so = (struct socket *)fp->f_data;
 	if ((so->so_state & (SS_ISCONNECTED|SS_ISCONFIRMING)) == 0) {
-		fdrop(fp, td);
 		error = ENOTCONN;
-		goto done2;
+		goto done1;
 	}
 	error = copyin((caddr_t)uap->alen, (caddr_t)&len, sizeof (len));
-	if (error) {
-		fdrop(fp, td);
-		goto done2;
-	}
+	if (error)
+		goto done1;
 	sa = 0;
 	error = (*so->so_proto->pr_usrreqs->pru_peeraddr)(so, &sa);
 	if (error)
@@ -1450,7 +1414,8 @@
 bad:
 	if (sa)
 		FREE(sa, M_SONAME);
-	fdrop(fp, td);
+done1:
+	fputsock(so);
 done2:
 	mtx_unlock(&Giant);
 	return (error);
@@ -1550,33 +1515,6 @@
 }
 
 /*
- * holdsock() - load the struct file pointer associated
- * with a socket into *fpp.  If an error occurs, non-zero
- * will be returned and *fpp will be set to NULL.
- */
-int
-holdsock(fdp, fdes, fpp)
-	struct filedesc *fdp;
-	int fdes;
-	struct file **fpp;
-{
-	register struct file *fp = NULL;
-	int error = 0;
-
-	if ((unsigned)fdes >= fdp->fd_nfiles ||
-	    (fp = fdp->fd_ofiles[fdes]) == NULL) {
-		error = EBADF;
-	} else if (fp->f_type != DTYPE_SOCKET) {
-		error = ENOTSOCK;
-		fp = NULL;
-	} else {
-		fhold(fp);
-	}
-	*fpp = fp;
-	return(error);
-}
-
-/*
  * Allocate a pool of sf_bufs (sendfile(2) or "super-fast" if you prefer. :-))
  * XXX - The sf_buf functions are currently private to sendfile(2), so have
  * been made static, but may be useful in the future for doing zero-copy in
@@ -1678,10 +1616,9 @@
 int
 sendfile(struct thread *td, struct sendfile_args *uap)
 {
-	struct file *fp = NULL;
 	struct vnode *vp;
 	struct vm_object *obj;
-	struct socket *so;
+	struct socket *so = NULL;
 	struct mbuf *m;
 	struct sf_buf *sf;
 	struct vm_page *pg;
@@ -1701,10 +1638,8 @@
 		error = EINVAL;
 		goto done;
 	}
-	error = holdsock(td->td_proc->p_fd, uap->s, &fp);
-	if (error)
+	if ((error = fgetsock(td, uap->s, &so, NULL)) != 0)
 		goto done;
-	so = (struct socket *)fp->f_data;
 	if (so->so_type != SOCK_STREAM) {
 		error = EINVAL;
 		goto done;
@@ -1988,8 +1923,9 @@
 	}
 	if (vp)
 		vrele(vp);
-	if (fp)
-		fdrop(fp, td);
+	if (so)
+		fputsock(so);
 	mtx_unlock(&Giant);
 	return (error);
 }
+
Index: kern/uipc_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_usrreq.c,v
retrieving revision 1.76
diff -u -r1.76 uipc_usrreq.c
--- kern/uipc_usrreq.c	2001/11/08 02:13:16	1.76
+++ kern/uipc_usrreq.c	2001/11/14 23:59:42
@@ -935,7 +935,7 @@
 		if (unp->unp_addr)
 			FREE(unp->unp_addr, M_SONAME);
 		zfree(unp_zone, unp);
-		sofree(so);
+		sotryfree(so);
 	}
 }
 
Index: net/raw_cb.c
===================================================================
RCS file: /home/ncvs/src/sys/net/raw_cb.c,v
retrieving revision 1.16
diff -u -r1.16 raw_cb.c
--- net/raw_cb.c	1999/08/28 00:48:27	1.16
+++ net/raw_cb.c	2001/11/14 23:59:49
@@ -97,7 +97,7 @@
 	struct socket *so = rp->rcb_socket;
 
 	so->so_pcb = 0;
-	sofree(so);
+	sotryfree(so);
 	LIST_REMOVE(rp, list);
 #ifdef notdef
 	if (rp->rcb_laddr)
Index: net/raw_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/net/raw_usrreq.c,v
retrieving revision 1.20
diff -u -r1.20 raw_usrreq.c
--- net/raw_usrreq.c	2001/09/12 08:37:51	1.20
+++ net/raw_usrreq.c	2001/11/14 23:59:56
@@ -142,8 +142,8 @@
 	if (rp == 0)
 		return EINVAL;
 	raw_disconnect(rp);
-	sofree(so);
-	soisdisconnected(so);
+	sotryfree(so);
+	soisdisconnected(so);	/* XXX huh? called after the sofree()? */
 	return 0;
 }
 
Index: netatalk/ddp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netatalk/ddp_usrreq.c,v
retrieving revision 1.21
diff -u -r1.21 ddp_usrreq.c
--- netatalk/ddp_usrreq.c	2001/09/12 08:37:52	1.21
+++ netatalk/ddp_usrreq.c	2001/11/15 00:00:03
@@ -441,7 +441,7 @@
 {
     soisdisconnected( so );
     so->so_pcb = 0;
-    sofree( so );
+    sotryfree(so);
 
     /* remove ddp from ddp_ports list */
     if ( ddp->ddp_lsat.sat_port != ATADDR_ANYPORT &&
Index: netatm/atm_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/netatm/atm_socket.c,v
retrieving revision 1.8
diff -u -r1.8 atm_socket.c
--- netatm/atm_socket.c	2000/12/07 22:19:04	1.8
+++ netatm/atm_socket.c	2001/11/14 23:58:01
@@ -176,7 +176,7 @@
 	 * Break links and free control blocks
 	 */
 	so->so_pcb = NULL;
-	sofree(so);
+	sotryfree(so);
 
 	atm_free((caddr_t)atp);
 
Index: netinet/in_pcb.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/in_pcb.c,v
retrieving revision 1.92
diff -u -r1.92 in_pcb.c
--- netinet/in_pcb.c	2001/11/06 00:48:01	1.92
+++ netinet/in_pcb.c	2001/11/14 23:58:11
@@ -563,7 +563,7 @@
 	inp->inp_gencnt = ++ipi->ipi_gencnt;
 	in_pcbremlists(inp);
 	so->so_pcb = 0;
-	sofree(so);
+	sotryfree(so);
 	if (inp->inp_options)
 		(void)m_free(inp->inp_options);
 	if (rt) {
Index: netinet6/in6_pcb.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet6/in6_pcb.c,v
retrieving revision 1.21
diff -u -r1.21 in6_pcb.c
--- netinet6/in6_pcb.c	2001/10/17 18:07:05	1.21
+++ netinet6/in6_pcb.c	2001/11/14 23:58:15
@@ -606,7 +606,7 @@
 	inp->inp_gencnt = ++ipi->ipi_gencnt;
 	in_pcbremlists(inp);
 	sotoinpcb(so) = 0;
-	sofree(so);
+	sotryfree(so);
 
 	if (inp->in6p_options)
 		m_freem(inp->in6p_options);
Index: netipx/ipx_pcb.c
===================================================================
RCS file: /home/ncvs/src/sys/netipx/ipx_pcb.c,v
retrieving revision 1.21
diff -u -r1.21 ipx_pcb.c
--- netipx/ipx_pcb.c	2001/09/12 08:37:56	1.21
+++ netipx/ipx_pcb.c	2001/11/14 23:58:22
@@ -268,7 +268,7 @@
 	struct socket *so = ipxp->ipxp_socket;
 
 	so->so_pcb = 0;
-	sofree(so);
+	sotryfree(so);
 	if (ipxp->ipxp_route.ro_rt != NULL)
 		rtfree(ipxp->ipxp_route.ro_rt);
 	remque(ipxp);
Index: netipx/ipx_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netipx/ipx_usrreq.c,v
retrieving revision 1.29
diff -u -r1.29 ipx_usrreq.c
--- netipx/ipx_usrreq.c	2001/09/12 08:37:56	1.29
+++ netipx/ipx_usrreq.c	2001/11/14 23:58:25
@@ -426,7 +426,7 @@
 	s = splnet();
 	ipx_pcbdetach(ipxp);
 	splx(s);
-	sofree(so);
+	sotryfree(so);
 	soisdisconnected(so);
 	return (0);
 }
Index: netnatm/natm.c
===================================================================
RCS file: /home/ncvs/src/sys/netnatm/natm.c,v
retrieving revision 1.13
diff -u -r1.13 natm.c
--- netnatm/natm.c	2001/04/05 04:20:48	1.13
+++ netnatm/natm.c	2001/11/14 23:58:41
@@ -133,7 +133,7 @@
      */
     npcb_free(npcb, NPCB_DESTROY);	/* drain */
     so->so_pcb = NULL;
-    sofree(so);
+    sotryfree(so);
  out:
     splx(s);
     return (error);
@@ -481,7 +481,7 @@
 
       npcb_free(npcb, NPCB_DESTROY);	/* drain */
       so->so_pcb = NULL;
-      sofree(so);
+      sotryfree(so);
 
       break;
 
Index: netns/idp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netns/idp_usrreq.c,v
retrieving revision 1.9
diff -u -r1.9 idp_usrreq.c
--- netns/idp_usrreq.c	1999/08/28 00:49:47	1.9
+++ netns/idp_usrreq.c	2001/11/14 23:58:57
@@ -491,8 +491,8 @@
 
 	case PRU_ABORT:
 		ns_pcbdetach(nsp);
-		sofree(so);
-		soisdisconnected(so);
+		sotryfree(so);
+		soisdisconnected(so);	/* XXX huh, called after sofree()? */
 		break;
 
 	case PRU_SOCKADDR:
Index: netns/ns_pcb.c
===================================================================
RCS file: /home/ncvs/src/sys/netns/ns_pcb.c,v
retrieving revision 1.9
diff -u -r1.9 ns_pcb.c
--- netns/ns_pcb.c	1999/08/28 00:49:51	1.9
+++ netns/ns_pcb.c	2001/11/14 23:59:03
@@ -232,7 +232,7 @@
 	struct socket *so = nsp->nsp_socket;
 
 	so->so_pcb = 0;
-	sofree(so);
+	sotryfree(so);
 	if (nsp->nsp_route.ro_rt)
 		rtfree(nsp->nsp_route.ro_rt);
 	remque(nsp);
Index: nfsserver/nfs_syscalls.c
===================================================================
RCS file: /home/ncvs/src/sys/nfsserver/nfs_syscalls.c,v
retrieving revision 1.72
diff -u -r1.72 nfs_syscalls.c
--- nfsserver/nfs_syscalls.c	2001/09/28 04:37:08	1.72
+++ nfsserver/nfs_syscalls.c	2001/11/14 22:30:42
@@ -143,9 +143,12 @@
 		error = copyin(uap->argp, (caddr_t)&nfsdarg, sizeof(nfsdarg));
 		if (error)
 			goto done2;
-		error = holdsock(td->td_proc->p_fd, nfsdarg.sock, &fp);
-		if (error)
+		if ((error = fget(td, nfsdarg.sock, &fp)) != 0)
 			goto done2;
+		if (fp->f_type != DTYPE_SOCKET) {
+			fdrop(fp, td);
+			goto done2;
+		}
 		/*
 		 * Get the client address for connected sockets.
 		 */
Index: sys/file.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/file.h,v
retrieving revision 1.32
diff -u -r1.32 file.h
--- sys/file.h	2001/11/14 06:30:36	1.32
+++ sys/file.h	2001/11/14 21:57:21
@@ -50,6 +50,7 @@
 struct uio;
 struct knote;
 struct vnode;
+struct socket;
 
 /*
  * Kernel descriptor table.
@@ -118,6 +119,9 @@
 int fgetvp __P((struct thread *td, int fd, struct vnode **vpp));
 int fgetvp_read __P((struct thread *td, int fd, struct vnode **vpp));
 int fgetvp_write __P((struct thread *td, int fd, struct vnode **vpp));
+
+int fgetsock __P((struct thread *td, int fd, struct socket **spp, u_int *fflagp));
+void fputsock __P((struct socket *sp));
 
 static __inline void
 fhold(fp)
Index: sys/socketvar.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/socketvar.h,v
retrieving revision 1.63
diff -u -r1.63 socketvar.h
--- sys/socketvar.h	2001/10/25 02:03:37	1.63
+++ sys/socketvar.h	2001/11/15 00:07:07
@@ -38,6 +38,7 @@
 #define _SYS_SOCKETVAR_H_
 
 #include <sys/queue.h>			/* for TAILQ macros */
+#include <sys/sx.h>			/* SX locks */
 #include <sys/selinfo.h>		/* for struct selinfo */
 
 /*
@@ -52,6 +53,7 @@
 
 struct socket {
 	struct	vm_zone *so_zone;	/* zone we were allocated from */
+	int	so_count;		/* reference count */
 	short	so_type;		/* generic type, see socket.h */
 	short	so_options;		/* from socket call, see socket.h */
 	short	so_linger;		/* time to linger while closing */
@@ -244,6 +246,24 @@
 	} \
 }
 
+/*
+ * soref()/sorele() ref-count the socket structure.  Note that you must
+ * still explicitly close the socket, but the last ref count will free
+ * the structure.
+ */
+
+#define soref(so)	++so->so_count
+
+#define sorele(so)	do {				\
+				if (--so->so_count == 0)\
+					sofree(so);	\
+			} while (0)
+
+#define sotryfree(so)	do {				\
+				if (so->so_count == 0)	\
+					sofree(so);	\
+			} while(0)
+
 #define	sorwakeup(so)	do { \
 			  if (sb_notify(&(so)->so_rcv)) \
 			    sowakeup((so), &(so)->so_rcv); \
@@ -360,7 +380,7 @@
 int	soconnect2 __P((struct socket *so1, struct socket *so2));
 int	socreate __P((int dom, struct socket **aso, int type, int proto,
 	    struct thread *td));
-void	sodealloc __P((struct socket *so));
+/*void	sodealloc __P((struct socket *so));*/
 int	sodisconnect __P((struct socket *so));
 void	sofree __P((struct socket *so));
 int	sogetopt __P((struct socket *so, struct sockopt *sopt));

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 17:20:16 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP id 358B937B417
	for <freebsd-arch@freebsd.org>; Wed, 14 Nov 2001 17:20:12 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id RAA05599;
	Wed, 14 Nov 2001 17:17:02 -0800 (PST)
Date: Wed, 14 Nov 2001 17:17:00 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: freebsd-arch@freebsd.org
Subject: Re: Need review - patch for socket locking and ref counting
In-Reply-To: <200111150015.fAF0Flb09186@apollo.backplane.com>
Message-ID: <Pine.BSF.4.21.0111141712380.4779-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

how does it cope with the old
"unix domain socket being passed across itself" case?


(I'm guessing it's references on the pcb that are tricky there and not
references on the sockets)


On Wed, 14 Nov 2001, Matthew Dillon wrote:

>     This patch adds a reference count to the socket structure
>     and cleans up & encapulates the API calls.  I do not yet
>     attempt to use sxlocks to lock the socket structure (to allow
>     us to multi-thread the network stack), but that is the
>     direction I am headed.
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 17:22:11 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from elvis.mu.org (elvis.mu.org [216.33.66.196])
	by hub.freebsd.org (Postfix) with ESMTP id 36BFB37B416
	for <freebsd-arch@freebsd.org>; Wed, 14 Nov 2001 17:22:09 -0800 (PST)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id DFA1381D05; Wed, 14 Nov 2001 19:22:03 -0600 (CST)
Date: Wed, 14 Nov 2001 19:22:03 -0600
From: Alfred Perlstein <bright@mu.org>
To: Julian Elischer <julian@elischer.org>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	freebsd-arch@freebsd.org
Subject: Re: Need review - patch for socket locking and ref counting
Message-ID: <20011114192203.H13393@elvis.mu.org>
References: <200111150015.fAF0Flb09186@apollo.backplane.com> <Pine.BSF.4.21.0111141712380.4779-100000@InterJet.elischer.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <Pine.BSF.4.21.0111141712380.4779-100000@InterJet.elischer.org>; from julian@elischer.org on Wed, Nov 14, 2001 at 05:17:00PM -0800
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

> On Wed, 14 Nov 2001, Matthew Dillon wrote:
> 
> >     This patch adds a reference count to the socket structure
> >     and cleans up & encapulates the API calls.  I do not yet
> >     attempt to use sxlocks to lock the socket structure (to allow
> >     us to multi-thread the network stack), but that is the
> >     direction I am headed.

* Julian Elischer <julian@elischer.org> [011114 19:20] wrote:
> how does it cope with the old
> "unix domain socket being passed across itself" case?
> 
> 
> (I'm guessing it's references on the pcb that are tricky there and not
> references on the sockets)

That's handled in the "struct file" handling code.

-- 
-Alfred Perlstein [alfred@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
 start asking why software is ignoring 30 years of accumulated wisdom.'
                           http://www.morons.org/rants/gpl-harmful.php3

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 18:59:52 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from monorchid.lemis.com (monorchid.lemis.com [192.109.197.75])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0090637B417; Wed, 14 Nov 2001 18:59:48 -0800 (PST)
Received: by monorchid.lemis.com (Postfix, from userid 1004)
	id EC3FF786E1; Thu, 15 Nov 2001 13:29:45 +1030 (CST)
Date: Thu, 15 Nov 2001 13:29:45 +1030
From: Greg Lehey <grog@FreeBSD.org>
To: Bruce Evans <bde@zeta.org.au>,
	Matthew Dillon <dillon@apollo.backplane.com>
Cc: Peter Wemm <peter@wemm.org>, Robert Watson <rwatson@FreeBSD.ORG>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
Message-ID: <20011115132945.C33267@monorchid.lemis.com>
References: <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011111191735.00D053807@overcee.netplex.com.au> <20011112165530.B34657-100000@delplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200111121009.fACA9SI75024@apollo.backplane.com> <20011112165530.B34657-100000@delplex.bde.org>
User-Agent: Mutt/1.3.23i
Organization: The FreeBSD Project
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
WWW-Home-Page: http://www.FreeBSD.org/
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On Monday, 12 November 2001 at 17:32:12 +1100, Bruce Evans wrote:
> On Sun, 11 Nov 2001, Peter Wemm wrote:
>
>> Robert Watson wrote:
>>
>>> It seems to me that unless a very strong argument exists against using
>>> curproc/curthread (and I don't preclude one existing), using them would
>>> actually be an improvement, as it would assert that this class of
>
>> My gripe is that on i386, it creates a LOT of work for the compiler.
>
> That's just an implementation detail for one arch.  I did strongly object
> to the implementation, but...

I must say that I don't have much sympathy for the compiler.  If it
also creates a lot of work for the processors, that's a different
matter.

>> Count me in the 'curproc considered harmful' camp.  (or curthread).
>
> Count me ouside of it.

Agreed (for once).

On Monday, 12 November 2001 at  2:09:28 -0800, Matthew Dillon wrote:
>> Passing the pointer down through 20 subroutines (some of which don't
>> even use it except to pass it along) may add up to much.
>>
>> Bruce
>
>     I agree that it is kind of silly to pass a global down through N levels
>     of procedures.  Just on principle.  On the otherhand I don't expect
>     the performance to be better or worse, or even for there to be any
>     real difference in code size.  Fewer instructions per routine in
>     more routines, with more memory writes (pass as argument on stack),
>     verses more instructions in fewer routines, with only memory reads
>     (access as global).  Without there being a clear winner there isn't
>     much of a reason to change the existing code.

OK, I've just got back from a conference to find several thousand
messages, many of them requiring to be read, so I haven't had much
time to look at this, but wouldn't it make more sense to pass the proc
or thread pointer (or whatever substructure is really needed) in a
structure which is being handed from function to function anyway?
struct buf would appear to be the correct one in this case.  I would
also expect this to make it easier for exceptions like NFS code.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 19:44:47 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP id DEDFE37B41A
	for <freebsd-arch@FreeBSD.ORG>; Wed, 14 Nov 2001 19:44:38 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAF3ibT11896;
	Wed, 14 Nov 2001 19:44:37 -0800 (PST)
	(envelope-from dillon)
Date: Wed, 14 Nov 2001 19:44:37 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111150344.fAF3ibT11896@apollo.backplane.com>
To: Alfred Perlstein <bright@mu.org>
Cc: Julian Elischer <julian@elischer.org>, freebsd-arch@FreeBSD.ORG
Subject: Re: Need review - patch for socket locking and ref counting
References: <200111150015.fAF0Flb09186@apollo.backplane.com> <Pine.BSF.4.21.0111141712380.4779-100000@InterJet.elischer.org> <20011114192203.H13393@elvis.mu.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:* Julian Elischer <julian@elischer.org> [011114 19:20] wrote:
:> how does it cope with the old
:> "unix domain socket being passed across itself" case?
:> 
:> 
:> (I'm guessing it's references on the pcb that are tricky there and not
:> references on the sockets)
:
:That's handled in the "struct file" handling code.
:
:-- 
:-Alfred Perlstein [alfred@freebsd.org]

    Yah.  Hopefully I will never have to touch the GC code.  Again.

    What this stuff is (and by the way, don't bother trying to test
    it, I haven't tested it myself yet)... what this stuff is is
    basically the infrastructure that we will be building the MP 
    locking system for the network stack on top of.  Amoung other
    things.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 21:11: 4 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from elvis.mu.org (elvis.mu.org [216.33.66.196])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8F88E37B41B; Wed, 14 Nov 2001 21:11:02 -0800 (PST)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 30ECC81D01; Wed, 14 Nov 2001 23:10:57 -0600 (CST)
Date: Wed, 14 Nov 2001 23:10:57 -0600
From: Alfred Perlstein <bright@mu.org>
To: Greg Lehey <grog@FreeBSD.org>
Cc: Bruce Evans <bde@zeta.org.au>,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Peter Wemm <peter@wemm.org>, Robert Watson <rwatson@FreeBSD.ORG>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
Message-ID: <20011114231057.K13393@elvis.mu.org>
References: <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011111191735.00D053807@overcee.netplex.com.au> <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011115132945.C33267@monorchid.lemis.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20011115132945.C33267@monorchid.lemis.com>; from grog@FreeBSD.org on Thu, Nov 15, 2001 at 01:29:45PM +1030
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

If you want to see why curproc sucks then please investigate what
happens when you NDINIT a nameidata with another thread pointer
other than your own, then perform a vn_open.  kablooey!

My recent addition of vn_open_cred and modification of nfs_lock.c
was to get around this badness of the API.

-- 
-Alfred Perlstein [alfred@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
 start asking why software is ignoring 30 years of accumulated wisdom.'
                           http://www.morons.org/rants/gpl-harmful.php3

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed Nov 14 21:18:25 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id C3D1137B416; Wed, 14 Nov 2001 21:18:23 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAF5IMW18730;
	Wed, 14 Nov 2001 21:18:22 -0800 (PST)
	(envelope-from dillon)
Date: Wed, 14 Nov 2001 21:18:22 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111150518.fAF5IMW18730@apollo.backplane.com>
To: Alfred Perlstein <bright@mu.org>
Cc: Greg Lehey <grog@FreeBSD.ORG>, Bruce Evans <bde@zeta.org.au>,
	Peter Wemm <peter@wemm.org>, Robert Watson <rwatson@FreeBSD.ORG>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011111191735.00D053807@overcee.netplex.com.au> <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011115132945.C33267@monorchid.lemis.com> <20011114231057.K13393@elvis.mu.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


:If you want to see why curproc sucks then please investigate what
:happens when you NDINIT a nameidata with another thread pointer
:other than your own, then perform a vn_open.  kablooey!
:
:My recent addition of vn_open_cred and modification of nfs_lock.c
:was to get around this badness of the API.
:
:-- 
:-Alfred Perlstein [alfred@freebsd.org]

    I'm not sure this is a fair argument.  Just about all the code
    in the system taking a struct thread * pointer assumes that the
    thread is the current thread and so avoid much of the locking that
    they would normally have to do on it.  Passing some other thread
    to a good chunk of this code will have very weird broken results.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15  6: 6:40 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id AF1E037B416; Thu, 15 Nov 2001 06:06:36 -0800 (PST)
Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAFE6Li87788;
	Thu, 15 Nov 2001 09:06:21 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Thu, 15 Nov 2001 09:06:20 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.ORG>
X-Sender: robert@fledge.watson.org
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Alfred Perlstein <bright@mu.org>, Greg Lehey <grog@FreeBSD.ORG>,
	Bruce Evans <bde@zeta.org.au>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <200111150518.fAF5IMW18730@apollo.backplane.com>
Message-ID: <Pine.NEB.3.96L.1011115090027.87678A-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Wed, 14 Nov 2001, Matthew Dillon wrote:

> 
> :If you want to see why curproc sucks then please investigate what
> :happens when you NDINIT a nameidata with another thread pointer
> :other than your own, then perform a vn_open.  kablooey!
> :
> :My recent addition of vn_open_cred and modification of nfs_lock.c
> :was to get around this badness of the API.
> :
> :-- 
> :-Alfred Perlstein [alfred@freebsd.org]
> 
>     I'm not sure this is a fair argument.  Just about all the code
>     in the system taking a struct thread * pointer assumes that the
>     thread is the current thread and so avoid much of the locking that
>     they would normally have to do on it.  Passing some other thread
>     to a good chunk of this code will have very weird broken results.

In my mind, that is in fact the primary argument *to* use curproc instead
of passing around process and thread pointers.  If the routine implicitly
assumes curproc or curthread for locking/referencing purposes, either
there needs to be a way to assert that:

int
foobar(struct thread *td, int arg)
{

	PROMISE_ME_ITS_CURTHREAD_OR_DIE_HORRIBLY(td);

	arg += td->td_only_safe_to_read_without_lock_if_curthread;
	/*
	 * Contrived example a little less contrived:
	 * return (td->td_ucred->cr_uid == arg);
	 */
	return (arg);
}


or, we simply need to use curthread and curproc, and not allow anything
else to be passed in.

int
foobar(int arg)
{

	arg += curthread->td_only_safe_to_read_without_lock_if_curthread;
	return (arg);
}

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15  6:14:50 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8661537B405; Thu, 15 Nov 2001 06:14:45 -0800 (PST)
Received: by flood.ping.uio.no (Postfix, from userid 2602)
	id D3A9614C40; Thu, 15 Nov 2001 15:14:41 +0100 (CET)
X-URL: http://www.ofug.org/~des/
X-Disclaimer: The views expressed in this message do not necessarily
  coincide with those of any organisation or company with
  which I am or have been affiliated.
To: Robert Watson <rwatson@FreeBSD.ORG>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	Alfred Perlstein <bright@mu.org>, Greg Lehey <grog@FreeBSD.ORG>,
	Bruce Evans <bde@zeta.org.au>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011115090027.87678A-100000@fledge.watson.org>
From: Dag-Erling Smorgrav <des@ofug.org>
Date: 15 Nov 2001 15:14:41 +0100
In-Reply-To: <Pine.NEB.3.96L.1011115090027.87678A-100000@fledge.watson.org>
Message-ID: <xzpn11o6rzi.fsf@flood.ping.uio.no>
Lines: 16
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Robert Watson <rwatson@FreeBSD.ORG> writes:
> In my mind, that is in fact the primary argument *to* use curproc instead
> of passing around process and thread pointers.  If the routine implicitly
> assumes curproc or curthread for locking/referencing purposes, either
> there needs to be a way to assert that:
> [example of PROMISE_ME_ITS_CURTHREAD_OR_DIE_HORRIBLY(td)]
> or, we simply need to use curthread and curproc, and not allow anything
> else to be passed in.

I greatly prefer the first approach, as it allows us to gradually fix
parts of the kernel to be curthread-agnostic without the hassle and
breakage that inevitably follow from massive API changes.

DES
-- 
Dag-Erling Smorgrav - des@ofug.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 10:54:58 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id 560BB37B405; Thu, 15 Nov 2001 10:54:50 -0800 (PST)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAFIsPi02832;
	Thu, 15 Nov 2001 13:54:25 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Thu, 15 Nov 2001 13:54:25 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.ORG>
X-Sender: robert@fledge.watson.org
To: Dag-Erling Smorgrav <des@ofug.org>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	Alfred Perlstein <bright@mu.org>, Greg Lehey <grog@FreeBSD.ORG>,
	Bruce Evans <bde@zeta.org.au>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
In-Reply-To: <xzpn11o6rzi.fsf@flood.ping.uio.no>
Message-ID: <Pine.NEB.3.96L.1011115135118.1877C-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On 15 Nov 2001, Dag-Erling Smorgrav wrote:

> Robert Watson <rwatson@FreeBSD.ORG> writes:
> > In my mind, that is in fact the primary argument *to* use curproc instead
> > of passing around process and thread pointers.  If the routine implicitly
> > assumes curproc or curthread for locking/referencing purposes, either
> > there needs to be a way to assert that:
> > [example of PROMISE_ME_ITS_CURTHREAD_OR_DIE_HORRIBLY(td)]
> > or, we simply need to use curthread and curproc, and not allow anything
> > else to be passed in.
> 
> I greatly prefer the first approach, as it allows us to gradually fix
> parts of the kernel to be curthread-agnostic without the hassle and
> breakage that inevitably follow from massive API changes. 

The implicit question behind that, though, is: are there places in the
kernel that will always be locked into using curproc/curthread, simply due
to the structure and behavior of the kernel environment.  For example, I
would generally think that 'borrowing' a proc or thread structure is a bad
idea, rather, you want that proc or thread to 'loan' you references to
supporting ref-counted structures (vmspaces, creds, ...).  On a small
scale, routines like 'copyin' and 'copyout' already follow the "must use
curproc/curthread, so don't bother taking one on the command line"
strategy.  If we were to assert that a certain class of functions always
acted on behalf of the calling thread or process, that's not necessarily
bad.  It might allow us to substantially simplify locking and reference
handling, for example.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 11:26:53 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31])
	by hub.freebsd.org (Postfix) with ESMTP
	id DFF5537B405; Thu, 15 Nov 2001 11:26:48 -0800 (PST)
Received: by flood.ping.uio.no (Postfix, from userid 2602)
	id 0558414C2E; Thu, 15 Nov 2001 20:26:46 +0100 (CET)
X-URL: http://www.ofug.org/~des/
X-Disclaimer: The views expressed in this message do not necessarily
  coincide with those of any organisation or company with
  which I am or have been affiliated.
To: Robert Watson <rwatson@FreeBSD.ORG>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	Alfred Perlstein <bright@mu.org>, Greg Lehey <grog@FreeBSD.ORG>,
	Bruce Evans <bde@zeta.org.au>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011115135118.1877C-100000@fledge.watson.org>
From: Dag-Erling Smorgrav <des@ofug.org>
Date: 15 Nov 2001 20:26:45 +0100
In-Reply-To: <Pine.NEB.3.96L.1011115135118.1877C-100000@fledge.watson.org>
Message-ID: <xzpg07f6dje.fsf@flood.ping.uio.no>
Lines: 34
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Robert Watson <rwatson@FreeBSD.ORG> writes:
> The implicit question behind that, though, is: are there places in the
> kernel that will always be locked into using curproc/curthread, simply due
> to the structure and behavior of the kernel environment.

There's a number of cases here:

 1) the thread in question is curthread, and it is locked.
 2) the thread may be any thread, but it is locked.
 3) the thread may be any thread, and is not locked.

(am I correct in assuming that curthread is *always* locked in code
called from syscalls?)

In some cases it doesn't make sense to assume anything but 1), because
it is the case in 99% of the situations where the code is invoked and
assuming 2) or 3) would involve a severe performance penalty for the
common case.  Copyin() is one example of this; for the rare cases
where you need to copy data from a non-current thread's address space
(mainly ptrace() and procfs), there is proc_rwmem().

In some cases it is desirable for an API to handle non-current
threads.  In those cases, it is the responsibility of the API
functions to make sure the thread they're manipulating is properly
locked.

In some cases it is desirable for an API to handle non-current
threads, but assume that the thread is locked, to save the overhead of
mutex operations.  In those cases, the code should be protected by
mutex assertions.

DES
-- 
Dag-Erling Smorgrav - des@ofug.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 11:29:12 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211])
	by hub.freebsd.org (Postfix) with ESMTP id 5FB3D37B442
	for <freebsd-arch@FreeBSD.ORG>; Thu, 15 Nov 2001 11:28:57 -0800 (PST)
Received: (qmail 69254 invoked from network); 15 Nov 2001 19:28:56 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <des@ofug.org>; 15 Nov 2001 19:28:56 -0000
Message-ID: <XFMail.011115112856.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <xzpg07f6dje.fsf@flood.ping.uio.no>
Date: Thu, 15 Nov 2001 11:28:56 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Dag-Erling Smorgrav <des@ofug.org>
Subject: Re: cur{thread/proc}, or not.
Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm <peter@wemm.org>,
	Bruce Evans <bde@zeta.org.au>, Greg Lehey <grog@FreeBSD.ORG>,
	Alfred Perlstein <bright@mu.org>,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Robert Watson <rwatson@FreeBSD.ORG>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 15-Nov-01 Dag-Erling Smorgrav wrote:
> Robert Watson <rwatson@FreeBSD.ORG> writes:
>> The implicit question behind that, though, is: are there places in the
>> kernel that will always be locked into using curproc/curthread, simply due
>> to the structure and behavior of the kernel environment.
> 
> There's a number of cases here:
> 
>  1) the thread in question is curthread, and it is locked.
>  2) the thread may be any thread, but it is locked.
>  3) the thread may be any thread, and is not locked.
> 
> (am I correct in assuming that curthread is *always* locked in code
> called from syscalls?)

Err, no.  curthread doesn't even have a lock.  Look at sys/proc.h.  There are
some fields we don't use any locks on, because we assume that only curthread
messes with its own copy, or some such.

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 11:52:28 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31])
	by hub.freebsd.org (Postfix) with ESMTP
	id C509137B421; Thu, 15 Nov 2001 11:52:21 -0800 (PST)
Received: by flood.ping.uio.no (Postfix, from userid 2602)
	id 2B99114C2E; Thu, 15 Nov 2001 20:52:20 +0100 (CET)
X-URL: http://www.ofug.org/~des/
X-Disclaimer: The views expressed in this message do not necessarily
  coincide with those of any organisation or company with
  which I am or have been affiliated.
To: John Baldwin <jhb@FreeBSD.org>
Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm <peter@wemm.org>,
	Bruce Evans <bde@zeta.org.au>, Greg Lehey <grog@FreeBSD.ORG>,
	Alfred Perlstein <bright@mu.org>,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Robert Watson <rwatson@FreeBSD.ORG>
Subject: Re: cur{thread/proc}, or not.
References: <XFMail.011115112856.jhb@FreeBSD.org>
From: Dag-Erling Smorgrav <des@ofug.org>
Date: 15 Nov 2001 20:52:19 +0100
In-Reply-To: <XFMail.011115112856.jhb@FreeBSD.org>
Message-ID: <xzpbsi36ccs.fsf@flood.ping.uio.no>
Lines: 10
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

John Baldwin <jhb@FreeBSD.org> writes:
> Err, no.  curthread doesn't even have a lock.  Look at sys/proc.h.  There are
> some fields we don't use any locks on, because we assume that only curthread
> messes with its own copy, or some such.

Hmm, then you need to lock the entire process, don't you?

DES
-- 
Dag-Erling Smorgrav - des@ofug.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 12: 2: 5 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206])
	by hub.freebsd.org (Postfix) with ESMTP id E337137B428
	for <freebsd-arch@FreeBSD.ORG>; Thu, 15 Nov 2001 12:01:51 -0800 (PST)
Received: (qmail 11881 invoked from network); 15 Nov 2001 20:01:22 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail6.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <des@ofug.org>; 15 Nov 2001 20:01:22 -0000
Message-ID: <XFMail.011115120150.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <xzpbsi36ccs.fsf@flood.ping.uio.no>
Date: Thu, 15 Nov 2001 12:01:50 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Dag-Erling Smorgrav <des@ofug.org>
Subject: Re: cur{thread/proc}, or not.
Cc: Robert Watson <rwatson@FreeBSD.ORG>, 
Cc: Robert Watson <rwatson@FreeBSD.ORG>,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Alfred Perlstein <bright@mu.org>, Greg Lehey <grog@FreeBSD.ORG>,
	Bruce Evans <bde@zeta.org.au>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 15-Nov-01 Dag-Erling Smorgrav wrote:
> John Baldwin <jhb@FreeBSD.org> writes:
>> Err, no.  curthread doesn't even have a lock.  Look at sys/proc.h.  There
>> are
>> some fields we don't use any locks on, because we assume that only curthread
>> messes with its own copy, or some such.
> 
> Hmm, then you need to lock the entire process, don't you?

Only for certain things.  We don't actually lock the process unless we need to
down inside of a syscall.

> DES
> -- 
> Dag-Erling Smorgrav - des@ofug.org

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 12:11:40 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180])
	by hub.freebsd.org (Postfix) with ESMTP id 6748037B419
	for <freebsd-arch@FreeBSD.ORG>; Thu, 15 Nov 2001 12:11:17 -0800 (PST)
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fAFKBHM24661
	for <freebsd-arch@FreeBSD.ORG>; Thu, 15 Nov 2001 12:11:17 -0800 (PST)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id 34216380A; Thu, 15 Nov 2001 12:11:17 -0800 (PST)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: Need review - patch for socket locking and ref counting 
In-Reply-To: <200111150015.fAF0Flb09186@apollo.backplane.com> 
Date: Thu, 15 Nov 2001 12:11:17 -0800
From: Peter Wemm <peter@wemm.org>
Message-Id: <20011115201117.34216380A@overcee.netplex.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:

> +static __inline
> +struct mtx *
> +_mtx_pool1_find(void *ptr)
> +{
> +    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 0
    ]);
> +}

At the very least, this is not going to compile very well on 64 bit machines.
You cannot cast a pointer to an int.  At needs to be uintptr_t at minimum.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 12:24:12 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122])
	by hub.freebsd.org (Postfix) with ESMTP
	id 3C58E37B417; Thu, 15 Nov 2001 12:24:06 -0800 (PST)
Received: from dialup-209.245.139.20.dial1.sanjose1.level3.net ([209.245.139.20] helo=mindspring.com)
	by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 164T3M-0005mO-00; Thu, 15 Nov 2001 12:24:05 -0800
Message-ID: <3BF4248D.1735C282@mindspring.com>
Date: Thu, 15 Nov 2001 12:24:45 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.ORG>
Cc: Dag-Erling Smorgrav <des@ofug.org>,
	Matthew Dillon <dillon@apollo.backplane.com>,
	Alfred Perlstein <bright@mu.org>, Greg Lehey <grog@FreeBSD.ORG>,
	Bruce Evans <bde@zeta.org.au>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: cur{thread/proc}, or not.
References: <Pine.NEB.3.96L.1011115135118.1877C-100000@fledge.watson.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Robert Watson wrote:
> The implicit question behind that, though, is: are there places in the
> kernel that will always be locked into using curproc/curthread, simply due
> to the structure and behavior of the kernel environment.  For example, I
> would generally think that 'borrowing' a proc or thread structure is a bad
> idea, rather, you want that proc or thread to 'loan' you references to
> supporting ref-counted structures (vmspaces, creds, ...).  On a small
> scale, routines like 'copyin' and 'copyout' already follow the "must use
> curproc/curthread, so don't bother taking one on the command line"
> strategy.  If we were to assert that a certain class of functions always
> acted on behalf of the calling thread or process, that's not necessarily
> bad.  It might allow us to substantially simplify locking and reference
> handling, for example.

Regardless of how many angels that can dance on this pin, it
would be a good idea to document lock assumtions in and out
of all functions, using both comments and assert().

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 12:46: 4 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212])
	by hub.freebsd.org (Postfix) with ESMTP id D2A3737B419
	for <freebsd-arch@FreeBSD.ORG>; Thu, 15 Nov 2001 12:46:00 -0800 (PST)
Received: (qmail 2677 invoked from network); 15 Nov 2001 20:46:00 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <peter@wemm.org>; 15 Nov 2001 20:46:00 -0000
Message-ID: <XFMail.011115124557.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <20011115201117.34216380A@overcee.netplex.com.au>
Date: Thu, 15 Nov 2001 12:45:57 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Peter Wemm <peter@wemm.org>
Subject: Re: Need review - patch for socket locking and ref counting
Cc: freebsd-arch@FreeBSD.ORG,
	Matthew Dillon <dillon@apollo.backplane.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 15-Nov-01 Peter Wemm wrote:
> Matthew Dillon wrote:
> 
>> +static __inline
>> +struct mtx *
>> +_mtx_pool1_find(void *ptr)
>> +{
>> +    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) |
>> 0
>     ]);
>> +}
> 
> At the very least, this is not going to compile very well on 64 bit machines.
> You cannot cast a pointer to an int.  At needs to be uintptr_t at minimum.

I would also prefer a generic mechanism for multiple pools with a struct
mtx_pool containing a count, index for alloc, and pointer to the array of
locks and pass it as the first arg to mtx_pool_foo().  This would also entail a
mtx_pool_init(struct mtx_pool *mp, int size); and a
mtx_pool_destroy(struct mtx_pool *mp);  This is much cleaner and extensible
than hardcoding 4 pools of equal size.

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 14: 0:31 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mta04.onebox.com (mta04.onebox.com [64.68.77.147])
	by hub.freebsd.org (Postfix) with ESMTP id 74E2237B41D
	for <arch@FreeBSD.ORG>; Thu, 15 Nov 2001 14:00:26 -0800 (PST)
Received: from onebox.com ([10.1.111.7]) by mta04.onebox.com
          (InterMail vM.4.01.03.23 201-229-121-123-20010418) with SMTP
          id <20011115220026.SYKD12575.mta04.onebox.com@onebox.com>
          for <arch@FreeBSD.ORG>; Thu, 15 Nov 2001 14:00:26 -0800
Received: from [63.49.208.149] by onebox.com with HTTP; Thu, 15 Nov 2001 14:00:26 -0800
Date: Thu, 15 Nov 2001 14:00:26 -0800
Subject: KSE Mail-List Archive Summary
From: "Glenn Gombert" <glenngombert@onebox.com>
To: arch@FreeBSD.ORG
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
Message-Id: <20011115220026.SYKD12575.mta04.onebox.com@onebox.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


 I put together a summary of some of the important KSE & Mutex discussions
(threads) from the last few months on my freebsd web site at
 "freebsd.imatowns.com" .. I mainly did if for my own reference..(I did

not try and include everything but the major themes and topics covered)
but thought that others might find them useful as well..

-- 
Glenn Gombert
glenngombert@onebox.com - email
(513) 587-2643 x2263 - voicemail/fax


__________________________________________________
FREE voicemail, email, and fax...all in one place.
Sign Up Now! http://www.onebox.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 14:53:55 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mta08.onebox.com (mta08.onebox.com [64.68.76.143])
	by hub.freebsd.org (Postfix) with ESMTP id 6E4C537B417
	for <arch@FreeBSD.org>; Thu, 15 Nov 2001 14:53:53 -0800 (PST)
Received: from onebox.com ([10.1.101.5]) by mta08.onebox.com
          (InterMail vM.4.01.03.23 201-229-121-123-20010418) with SMTP
          id <20011115225353.UUBN16107.mta08.onebox.com@onebox.com>;
          Thu, 15 Nov 2001 14:53:53 -0800
Received: from [165.121.195.182] by onebox.com with HTTP; Thu, 15 Nov 2001 14:53:53 -0800
Date: Thu, 15 Nov 2001 14:53:53 -0800
Subject: Re: KSE Mail-List Archive Summary
From: "Glenn Gombert" <glenngombert@onebox.com>
To: sandeepj@research.bell-labs.com (Sandeep Joshi)
Cc: glenngombert@onebox.com
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
Message-Id: <20011115225353.UUBN16107.mta08.onebox.com@onebox.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


Sorry...this should now be fixed....I had to do it on my 'Windoz' machine
at work :)

-- 
Glenn Gombert
glenngombert@onebox.com - email
(513) 587-2643 x2263 - voicemail/fax


---- sandeepj@research.bell-labs.com (Sandeep Joshi) wrote:
> hello Glen,
> 
> This link doesnt work from Unix-based navigators
> since there is a space in the link
> 
> http://freebsd.imatowns.com/BSD KSE-Mail Summary.txt
> 
> -A passive observer
> 

__________________________________________________
FREE voicemail, email, and fax...all in one place.
Sign Up Now! http://www.onebox.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu Nov 15 16: 0:20 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9])
	by hub.freebsd.org (Postfix) with ESMTP id 7EED737B41A
	for <arch@FreeBSD.ORG>; Thu, 15 Nov 2001 16:00:17 -0800 (PST)
Received: from localhost (localhost.elischer.org [127.0.0.1])
	by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA10446;
	Thu, 15 Nov 2001 15:55:59 -0800 (PST)
Date: Thu, 15 Nov 2001 15:55:58 -0800 (PST)
From: Julian Elischer <julian@elischer.org>
To: Glenn Gombert <glenngombert@onebox.com>
Cc: arch@FreeBSD.ORG
Subject: Re: KSE Mail-List Archive Summary
In-Reply-To: <20011115220026.SYKD12575.mta04.onebox.com@onebox.com>
Message-ID: <Pine.BSF.4.21.0111151555510.6632-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

thanks!...


On Thu, 15 Nov 2001, Glenn Gombert wrote:

> 
>  I put together a summary of some of the important KSE & Mutex discussions
> (threads) from the last few months on my freebsd web site at
>  "freebsd.imatowns.com" .. I mainly did if for my own reference..(I did
> 
> not try and include everything but the major themes and topics covered)
> but thought that others might find them useful as well..
> 
> -- 
> Glenn Gombert
> glenngombert@onebox.com - email
> (513) 587-2643 x2263 - voicemail/fax
> 
> 
> 
> __________________________________________________
> FREE voicemail, email, and fax...all in one place.
> Sign Up Now! http://www.onebox.com
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Fri Nov 16 19: 0:43 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9E05F37B419; Fri, 16 Nov 2001 19:00:40 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAH30dv75857;
	Fri, 16 Nov 2001 19:00:39 -0800 (PST)
	(envelope-from dillon)
Date: Fri, 16 Nov 2001 19:00:39 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111170300.fAH30dv75857@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.ORG>
Cc: Peter Wemm <peter@wemm.org>, freebsd-arch@FreeBSD.ORG
Subject: Re: Need review - patch for socket locking and ref counting
References:  <XFMail.011115124557.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

    I've thought about it a bit and I've come to the conclusion that
    we should *not* have multiple mutex pools.

    The single pool we have works wonderfully for interlock operations.
    For example, the interlocks used inside the sxlock structure and code,
    and inside the lockmgr structure and code (the lockmgr previously used
    its own hacked up pool for its interlock).  The pool effectively cuts
    the size and overhead of higher level structures - such as sxlocks - down
    considerably.

    But our ability to use pools for higher level constructs, like the sxlocks
    themselves, is severely limited.  My attempts so far have only resulted
    in more obfuscated code.  

    I think the pool implementation should be left as it is and used ONLY
    for interlocks and 'leaf' locks, as I originally designed it.  Adding
    multiple-pools (and the allocation / freeing / management headaches 
    that go along with that) will only create a mess.  I don't think it's
    even possible to use a pool of sx locks safely, for example, even with
    the multiple pool concept.

    The current pool code is nice because it simplifies our code base
    somewhat rather then make it more complex.  I see absolutely no need
    for a multiple-pool mechanism at this time.

    For similar reasons I believe we should also simplify the APIs to 
    other low level constructs.  I would like to simplify the SX lock
    API (get rid of sx_tryupgrade() and sx_downgrade()), and I would 
    like to see a more simplified structure if possible in order to 
    make SX locks more useful as embedded entities in higher level system
    structures such as TCP sockets or PCBs.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


:On 15-Nov-01 Peter Wemm wrote:
:> Matthew Dillon wrote:
:> 
:>> +static __inline
:>> +struct mtx *
:>> +_mtx_pool1_find(void *ptr)
:>> +{
:>> +    return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) |
:>> 0
:>     ]);
:>> +}
:> 
:> At the very least, this is not going to compile very well on 64 bit machines.
:> You cannot cast a pointer to an int.  At needs to be uintptr_t at minimum.
:
:I would also prefer a generic mechanism for multiple pools with a struct
:mtx_pool containing a count, index for alloc, and pointer to the array of
:locks and pass it as the first arg to mtx_pool_foo().  This would also entail a
:mtx_pool_init(struct mtx_pool *mp, int size); and a
:mtx_pool_destroy(struct mtx_pool *mp);  This is much cleaner and extensible
:than hardcoding 4 pools of equal size.
:
:-- 
:
:John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
:"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat Nov 17  2:24: 0 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by hub.freebsd.org (Postfix) with ESMTP id 151B337B416
	for <freebsd-arch@FreeBSD.ORG>; Sat, 17 Nov 2001 02:23:58 -0800 (PST)
Received: (qmail 31836 invoked from network); 17 Nov 2001 10:23:57 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <dillon@apollo.backplane.com>; 17 Nov 2001 10:23:57 -0000
Message-ID: <XFMail.011117022356.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <200111170300.fAH30dv75857@apollo.backplane.com>
Date: Sat, 17 Nov 2001 02:23:56 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Matthew Dillon <dillon@apollo.backplane.com>
Subject: Re: Need review - patch for socket locking and ref counting
Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm <peter@wemm.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 17-Nov-01 Matthew Dillon wrote:
>     I've thought about it a bit and I've come to the conclusion that
>     we should *not* have multiple mutex pools.
> 
>     The single pool we have works wonderfully for interlock operations.
>     For example, the interlocks used inside the sxlock structure and code,
>     and inside the lockmgr structure and code (the lockmgr previously used
>     its own hacked up pool for its interlock).  The pool effectively cuts
>     the size and overhead of higher level structures - such as sxlocks - down
>     considerably.

They've added 4 new lock order reversals to my boot messages.  For that we need
a pool of mutexes with MTX_NOWITNESS.  However, MTX_NOWITNESS is not
appropriate for locks outside of sx and lockmgr backing locks.
 
>     But our ability to use pools for higher level constructs, like the
> sxlocks
>     themselves, is severely limited.  My attempts so far have only resulted
>     in more obfuscated code.

???  If you want a sx lock pool, it would be just as simple as the mtx pool you
have now, just s/mtx/sx/, and thus sx_pool_slock, etc.  Not that complicated. 
Not sure it is all that useful either though.

>     I think the pool implementation should be left as it is and used ONLY
>     for interlocks and 'leaf' locks, as I originally designed it.  Adding
>     multiple-pools (and the allocation / freeing / management headaches 
>     that go along with that) will only create a mess.  I don't think it's
>     even possible to use a pool of sx locks safely, for example, even with
>     the multiple pool concept.

Errr, it's all of two extra functions and one extra parameter to the others. 
This should not be difficult.

>     The current pool code is nice because it simplifies our code base
>     somewhat rather then make it more complex.  I see absolutely no need
>     for a multiple-pool mechanism at this time.

Are you planning to turn on MTX_NOWITNESS then and then be forced not to use
pool locks for anything besides sx and lockmgr backing locks since they won't
have WITNESS checks performed for them?  Different types of locks have
different types of requirements.

>     For similar reasons I believe we should also simplify the APIs to 
>     other low level constructs.  I would like to simplify the SX lock
>     API (get rid of sx_tryupgrade() and sx_downgrade()), and I would 
>     like to see a more simplified structure if possible in order to 
>     make SX locks more useful as embedded entities in higher level system
>     structures such as TCP sockets or PCBs.

Err, the try_upgrade and downgrade are trivial and add nothing to the sx lock
structure itself.  They were also specifically requested for use in porting XFS
to FreeBSD and are useful in other areas such as Brian's changes to make
vm_map's use sx locks instead of lockmgr locks.  We can always optimize the
locks later, it is more important right now to actually put locks in places so
that actual multithreading can occur.

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat Nov 17  5:24:24 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from harrier.prod.itd.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12])
	by hub.freebsd.org (Postfix) with ESMTP
	id 97F8F37B416; Sat, 17 Nov 2001 05:24:22 -0800 (PST)
Received: from dialup-209.245.142.3.dial1.sanjose1.level3.net ([209.245.142.3] helo=mindspring.com)
	by harrier.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 1655SC-0004ee-00; Sat, 17 Nov 2001 05:24:16 -0800
Message-ID: <3BF6652F.FC50C99A@mindspring.com>
Date: Sat, 17 Nov 2001 05:25:03 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: John Baldwin <jhb@FreeBSD.ORG>, Peter Wemm <peter@wemm.org>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: Need review - patch for socket locking and ref counting
References: <XFMail.011115124557.jhb@FreeBSD.org> <200111170300.fAH30dv75857@apollo.backplane.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:
> 
>     I've thought about it a bit and I've come to the conclusion that
>     we should *not* have multiple mutex pools.

It's pretty obvious even under casual thought that the deadlock
avoidance can't work correctly in theis scenario, so you MUST
limit yourself to last acquisition.

By the same token, it does not make sense to permit recursion on
such mutexes.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat Nov 17  8:40: 9 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206])
	by hub.freebsd.org (Postfix) with ESMTP id 4FD4537B416
	for <freebsd-arch@FreeBSD.ORG>; Sat, 17 Nov 2001 08:40:07 -0800 (PST)
Received: (qmail 26684 invoked from network); 17 Nov 2001 16:39:40 -0000
Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender <jhb@FreeBSD.org>)
          by mail6.speakeasy.net (qmail-ldap-1.03) with SMTP
          for <tlambert2@mindspring.com>; 17 Nov 2001 16:39:40 -0000
Message-ID: <XFMail.011117084006.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <3BF6652F.FC50C99A@mindspring.com>
Date: Sat, 17 Nov 2001 08:40:06 -0800 (PST)
From: John Baldwin <jhb@FreeBSD.org>
To: Terry Lambert <tlambert2@mindspring.com>
Subject: Re: Need review - patch for socket locking and ref counting
Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm <peter@wemm.org>,
	Matthew Dillon <dillon@apollo.backplane.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On 17-Nov-01 Terry Lambert wrote:
> Matthew Dillon wrote:
>> 
>>     I've thought about it a bit and I've come to the conclusion that
>>     we should *not* have multiple mutex pools.
> 
> It's pretty obvious even under casual thought that the deadlock
> avoidance can't work correctly in theis scenario, so you MUST
> limit yourself to last acquisition.

Err, witness doesn't do deadlock avoidance, and it just checks lock orders. 
However, the problem is that the order of a larger lock (reader writer lock) is
being compared with those of its components.  Obviously one is going to acquire
the lock used to implement a reader/writer lock both while holding and not
holding the reader/writer lock.  Witness cannot efficiently handle this, so
instead we disable witness checks on the component locks.

> By the same token, it does not make sense to permit recursion on
> such mutexes.

Err, we don't on most mutexes.

-- 

John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat Nov 17 10:30:20 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id D27D637B417; Sat, 17 Nov 2001 10:30:12 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fAHIUBu80966;
	Sat, 17 Nov 2001 10:30:11 -0800 (PST)
	(envelope-from dillon)
Date: Sat, 17 Nov 2001 10:30:11 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200111171830.fAHIUBu80966@apollo.backplane.com>
To: John Baldwin <jhb@FreeBSD.ORG>
Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm <peter@wemm.org>
Subject: Re: Need review - patch for socket locking and ref counting
References:  <XFMail.011117022356.jhb@FreeBSD.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

:>     I think the pool implementation should be left as it is and used ONLY
:>     for interlocks and 'leaf' locks, as I originally designed it.  Adding
:>     multiple-pools (and the allocation / freeing / management headaches 
:>     that go along with that) will only create a mess.  I don't think it's
:>     even possible to use a pool of sx locks safely, for example, even with
:>     the multiple pool concept.
:
:Errr, it's all of two extra functions and one extra parameter to the others. 
:This should not be difficult.

    Difficulty isn't the problem.  Confusion and Mess are the problems.

:
:>     The current pool code is nice because it simplifies our code base
:>     somewhat rather then make it more complex.  I see absolutely no need
:>     for a multiple-pool mechanism at this time.
:
:Are you planning to turn on MTX_NOWITNESS then and then be forced not to use
:pool locks for anything besides sx and lockmgr backing locks since they won't
:have WITNESS checks performed for them?  Different types of locks have
:different types of requirements.

    I'll turn on MTX_NOWITNESS.

    Again.  Difficulty is not the problem here.  Confusion and Mess are
    the problems.  It is not necessarily a good idea to take every locking
    API we have and give each one dozens of features and capabilities
    that go mostly unused.

:>     For similar reasons I believe we should also simplify the APIs to 
:>     other low level constructs.  I would like to simplify the SX lock
:>     API (get rid of sx_tryupgrade() and sx_downgrade()), and I would 
:>     like to see a more simplified structure if possible in order to 
:>     make SX locks more useful as embedded entities in higher level system
:>     structures such as TCP sockets or PCBs.
:
:Err, the try_upgrade and downgrade are trivial and add nothing to the sx lock
:structure itself.  They were also specifically requested for use in porting XFS
:to FreeBSD and are useful in other areas such as Brian's changes to make

    I don't think it's worth it just for XFS, 

:vm_map's use sx locks instead of lockmgr locks.  We can always optimize the
:locks later, it is more important right now to actually put locks in places so
:that actual multithreading can occur.

    I don't see it as being necessary for VM maps.  Since interrupts are in
    their own threads VM maps can probaly do away with much of the junk they
    needed for -stable.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

:John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
:"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message