From owner-freebsd-arch Sun Nov 11 7:32: 8 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id C19DD37B41F for ; Sun, 11 Nov 2001 07:32:03 -0800 (PST) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fABFVsB11812 for ; Sun, 11 Nov 2001 10:31:55 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Sun, 11 Nov 2001 10:31:54 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: freebsd-arch@FreeBSD.org Subject: cur{thread/proc}, or not. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Every now and then, we get to discuss curproc, and its merits. Let's do it again. There are a number of uses of curproc in the netinet code, used to retrieve credentials for authorization somewhere down the stack, when no proc or thread pointer has been passed down. With the eventual addition of td->td_ucred, it will be desirable to use the credential for the current thread, rather than the proc, which will require locking to use. (This is, incidentally, true of many places in the system). As I understand it, use of curproc was branded 'undesirable' at some point in the semi-distant past, and since that time, a reference to 'proc' has been passed down the stack. With a change to KSE, this has been translated to references the thread, but the issue remains the same. This comes up in particular because I have a tree where I have propagated the thread pointer down if_ioctl in the network stack: the normal ioctl call carries a thread pointer now, but when it is translated into if_ioctl by the network stack, that pointer is lost. This raises the question: should we (in practice) be adding process or thread pointers to many more of the function arguments, or should we switch to using curproc/curthread instead. The argument I've seen a couple of times for using the proc/thread pointer is that of delegation: a kernel thread might be acting on behalf of another process, and need a reference to the process so that it can use its (file descriptors, credential, address space, ...). I suspect that, in practice, this is a Bad Idea, given the increased complexity of fine-grained threading/locking and SMPng. "borrowing" references in such an environment seems like a recipe for buginess, and instead such references should be "given" by the thread that obeys the locking/reference counting, and should not be done at the level of the proc. For example, for a credential, you would simply grab another reference to the credential and pass off the reference, rather than sharing a reference. In fact, it seems that in a lot of places where a struct proc is passed in, the implicit assumption of the code is that this is the "current process", and as we add more process-related locking, that assumption will probably only grow stronger, so as to not raise lock order issues. I don't pretend to have a grasp of all the issues here, so the purpose of this message is to raise the issues so that I can understand them. I have a tree where I've eliminated many references to curproc; however, I'm now wondering if it wouldn't simply be more useful to eliminate many of the references to struct proc in the function arguments, and use curproc instead, and add references to ucred (and related ref-counted structures) as needed for delegation types of situations. In particular, that would suggest the following changes: (1) 'suser' would always use 'curthread', and lose its proc/thread argument (proc in the main tree, thread in my tree). 'suser_cred' would be used for delegation situations (as is the case in my tree). (Note that this remains incompatible with other platforms, which generally accept a cred argument for 'suser', including other *BSD and Solaris.) (2) proc/thread arguments would (in general) be removed (gradually) from the arguments of many existing kernel functions, and 'curproc'/'curthread' would be used instead. For example, in the 'VOP_*' interface, use of the 'p' or 'td' entries would be abandoned, and 'cred' would be more widely passed down (such as into open). (Note that this is the path taken by a number of other fine-grained UNIX kernels, including Solaris, IRIX, et al). (3) Use of 'curproc' would be removed in a number of places, where abstracted functions such as 'suser' would invoke curthread instead. It seems to me that unless a very strong argument exists against using curproc/curthread (and I don't preclude one existing), using them would actually be an improvement, as it would assert that this class of 'borrowing' couldn't exist, simplifying the kernel, not to mention squeezing a bit more stuff out of the stack (which, at ten levels deep, actually begins to add up on 64-bit machines). I believe that there are many places where the 'p' passed in is implicitly assumed to be the current process, and that making that reliance explicit would be an improvement, rather than a problem. Flames appreciated. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 10:40:16 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 0D24D37B405; Sun, 11 Nov 2001 10:40:09 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id KAA89779; Sun, 11 Nov 2001 10:23:07 -0800 (PST) Date: Sun, 11 Nov 2001 10:23:06 -0800 (PST) From: Julian Elischer To: Robert Watson Cc: freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, 11 Nov 2001, Robert Watson wrote: > > Every now and then, we get to discuss curproc, and its merits. Let's do > it again. > > There are a number of uses of curproc in the netinet code, used to > retrieve credentials for authorization somewhere down the stack, when no > proc or thread pointer has been passed down. With the eventual addition > of td->td_ucred, it will be desirable to use the credential for the > current thread, rather than the proc, which will require locking to use. > (This is, incidentally, true of many places in the system). As I > understand it, use of curproc was branded 'undesirable' at some point in > the semi-distant past, and since that time, a reference to 'proc' has been > passed down the stack. With a change to KSE, this has been translated to > references the thread, but the issue remains the same. This comes up in > particular because I have a tree where I have propagated the thread > pointer down if_ioctl in the network stack: the normal ioctl call carries > a thread pointer now, but when it is translated into if_ioctl by the > network stack, that pointer is lost. This raises the question: should we > (in practice) be adding process or thread pointers to many more of the > function arguments, or should we switch to using curproc/curthread > instead. I think we should, though there are some cases where it is not clear that there is always a thread to add other than the base thread of the idle process. Also, since thread structures in the kernel are only assigned to a process to do work for the duration of the particular system call that they are performing, no thread pointer should be stored somewhere where it may be referenced after the syscall has returned to userland. In that case the best you can do is a proc pointer. Also, in SMPng cur{thread,proc} takes some time to get as I'm told that dereferencing %fs is very slow.. (Not sure how true that is). > > The argument I've seen a couple of times for using the proc/thread pointer > is that of delegation: a kernel thread might be acting on behalf of > another process, and need a reference to the process so that it can use > its (file descriptors, credential, address space, ...). I was worried about this case doing the KSE switchover but I never actually saw a case where it was obviously doing this... (though I have my suspicions that it may still happen in some non-obvious places). > I suspect that, > in practice, this is a Bad Idea, given the increased complexity of > fine-grained threading/locking and SMPng. "borrowing" references in such > an environment seems like a recipe for buginess, and instead such > references should be "given" by the thread that obeys the > locking/reference counting, and should not be done at the level of the > proc. For example, for a credential, you would simply grab another > reference to the credential and pass off the reference, rather than > sharing a reference. In fact, it seems that in a lot of places where a > struct proc is passed in, the implicit assumption of the code is that this > is the "current process", and as we add more process-related locking, that > assumption will probably only grow stronger, so as to not raise lock order > issues. There are other reasons for needing the pointer than for a credential. For example in AIO, the process pointer is stored so that address space can be loaned to the aio threads to do the IO. > > I don't pretend to have a grasp of all the issues here, so the purpose of > this message is to raise the issues so that I can understand them. I have > a tree where I've eliminated many references to curproc; however, I'm now > wondering if it wouldn't simply be more useful to eliminate many of the > references to struct proc in the function arguments, and use curproc > instead, and add references to ucred (and related ref-counted structures) > as needed for delegation types of situations. In particular, that would > suggest the following changes: I have thought about this both ways... both have advantages. In some architectures, getting curthread might be very expensive. Removing the proc pointers would take us back where we were before BSD4.4 (anyone know if Kirk is on this list?) > > (1) 'suser' would always use 'curthread', and lose its proc/thread > argument (proc in the main tree, thread in my tree). 'suser_cred' > would be used for delegation situations (as is the case in my tree). > > (Note that this remains incompatible with other platforms, which > generally accept a cred argument for 'suser', including other *BSD and > Solaris.) > > (2) proc/thread arguments would (in general) be removed (gradually) from > the arguments of many existing kernel functions, and > 'curproc'/'curthread' would be used instead. For example, in the > 'VOP_*' interface, use of the 'p' or 'td' entries would be abandoned, > and 'cred' would be more widely passed down (such as into open). > > (Note that this is the path taken by a number of other fine-grained > UNIX kernels, including Solaris, IRIX, et al). > > (3) Use of 'curproc' would be removed in a number of places, where > abstracted functions such as 'suser' would invoke curthread instead. I believe it was an early move to start to prepare for some sort of SMP work where they couldn't think of an architecture neutral way of getting 'curthread' that was guaranteed to be efficient everywhere. > > It seems to me that unless a very strong argument exists against using > curproc/curthread (and I don't preclude one existing), using them would > actually be an improvement, as it would assert that this class of > 'borrowing' couldn't exist, simplifying the kernel, not to mention > squeezing a bit more stuff out of the stack (which, at ten levels deep, > actually begins to add up on 64-bit machines). I believe that there are > many places where the 'p' passed in is implicitly assumed to be the > current process, and that making that reliance explicit would be an > improvement, rather than a problem. > > Flames appreciated. I think you'll get few flames.. but probably a lot of silence froma many people. > > Robert N M Watson FreeBSD Core Team, TrustedBSD Project > robert@fledge.watson.org NAI Labs, Safeport Network Services > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 11:17:47 2001 Delivered-To: freebsd-arch@freebsd.org Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180]) by hub.freebsd.org (Postfix) with ESMTP id 7DD8537B418; Sun, 11 Nov 2001 11:17:35 -0800 (PST) Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3]) by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fABJHZM10690; Sun, 11 Nov 2001 11:17:35 -0800 (PST) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by overcee.netplex.com.au (Postfix) with ESMTP id 00D053807; Sun, 11 Nov 2001 11:17:34 -0800 (PST) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: Robert Watson Cc: freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: Date: Sun, 11 Nov 2001 11:17:34 -0800 From: Peter Wemm Message-Id: <20011111191735.00D053807@overcee.netplex.com.au> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Robert Watson wrote: > It seems to me that unless a very strong argument exists against using > curproc/curthread (and I don't preclude one existing), using them would > actually be an improvement, as it would assert that this class of > 'borrowing' couldn't exist, simplifying the kernel, not to mention > squeezing a bit more stuff out of the stack (which, at ten levels deep, > actually begins to add up on 64-bit machines). I believe that there are > many places where the 'p' passed in is implicitly assumed to be the > current process, and that making that reliance explicit would be an > improvement, rather than a problem. My gripe is that on i386, it creates a LOT of work for the compiler. Consider this small function in kern_kthread.c: void kthread_exit(int ecode) { sx_xlock(&proctree_lock); PROC_LOCK(curproc); proc_reparent(curproc, initproc); PROC_UNLOCK(curproc); sx_xunlock(&proctree_lock); exit1(curthread, W_EXITCODE(ecode, 0)); } Have a look at http://people.freebsd.org/~peter/macros.c where I've cpp'ed it and indented it for readability. Anyway, kthread_exit() turns into this for the compiler to choke on: void kthread_exit(int ecode) { _sx_xlock((&proctree_lock), 0, 0); do { do { if (!atomic_cmpset_ptr(&(((((&((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc))->p_mtx)))))->mtx_lock, (void *)0x00000004, ((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; }))))) _mtx_lock_sleep(((((&((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc))->p_mtx)))), (((0))), ((0)), ((0))); } while (0); do { if ((((((0))) & 0x00000002) == 0 && (((&(((&((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc))->p_mtx)))->mtx_object))->lo_flags & 0x00040000) == 0)); } while (0); } while (0); proc_reparent((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc), initproc); do { do { if (((((((0)))) & 0x00000002) == 0 && (((&(((&((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc))->p_mtx)))->mtx_object))->lo_flags & 0x00040000) == 0)); } while (0); do { if (!atomic_cmpset_ptr(&(((((&((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc))->p_mtx)))))->mtx_lock, ((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; }))), (void *)0x00000004)) _mtx_unlock_sleep(((((&((({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; })->td_proc))->p_mtx)))), (((0))), ((0)), ((0))); } while (0); } while (0); _sx_xunlock((&proctree_lock), 0, 0); exit1(({ __typeof(((struct globaldata *) 0)->gd_curthread) __result; if (sizeof(__result) == 1) { u_char __b; __asm volatile ("movb %%fs:%1,%0":"=r" (__b):"m"(*(u_char *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __b; } else if (sizeof(__result) == 2) { u_short __w; __asm volatile ("movw %%fs:%1,%0":"=r" (__w):"m"(*(u_short *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __w; } else if (sizeof(__result) == 4) { u_int __i; __asm volatile ("movl %%fs:%1,%0":"=r" (__i):"m"(*(u_int *) (((size_t) (&((struct globaldata *) 0)->gd_curthread))))); __result = *(__typeof(((struct globaldata *) 0)->gd_curthread) *) & __i; } else { __result = *({ __typeof(((struct globaldata *) 0)->gd_curthread) * __p; __asm volatile ("movl %%fs:%1,%0; addl %2,%0":"=r" (__p):"m"(*(struct globaldata *) (((size_t) (&((struct globaldata *) 0)->gd_prvspace)))), "i"(((size_t) (&((struct globaldata *) 0)->gd_curthread)))); __p; }); } __result; }), ((ecode) << 8 | (0))); } Ever wonder why the kernel gets slower and slower to compile? Ever compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by the speed? Count me in the 'curproc considered harmful' camp. (or curthread). Yes, this doesn't end up as a lot of code in the end, but the compiler still has to digest it and the optimizer has got to do a sh!tload of work to eliminate massive quantities of unused code. Just imagine what happens without -O. Regarding 64 bit machines, all of our 64 bit platforms use register passing, some with fixed size register frames. On those, the difference of saving one argument isn't going to add up to much, if anything. And it would still require an intermediate frame to hold the calculated value of curproc/curthread where its used. Cheers, -Peter -- Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 14:28:12 2001 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by hub.freebsd.org (Postfix) with ESMTP id 4A16737B419; Sun, 11 Nov 2001 14:28:10 -0800 (PST) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.6/8.11.6) with ESMTP id fABMRCL01972; Sun, 11 Nov 2001 23:27:17 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: Peter Wemm Cc: Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: Your message of "Sun, 11 Nov 2001 11:17:34 PST." <20011111191735.00D053807@overcee.netplex.com.au> Date: Sun, 11 Nov 2001 23:27:12 +0100 Message-ID: <1970.1005517632@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG In message <20011111191735.00D053807@overcee.netplex.com.au>, Peter Wemm writes: > [ass'y output of gcc] > >Ever wonder why the kernel gets slower and slower to compile? Ever >compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by >the speed? > >Count me in the 'curproc considered harmful' camp. (or curthread). Peters example more than clenches the argument for me, but I also wonder if we would not paint ourselves into a corner with the cur{proc|thread} stuff if the future ends up being more parallel and cluster-oriented. Roberto! come over here! Do you zink zese Curproc and Curthread they will get losst on zeir way home ? Good boy. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 14:49:26 2001 Delivered-To: freebsd-arch@freebsd.org Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180]) by hub.freebsd.org (Postfix) with ESMTP id EB3C237B41A; Sun, 11 Nov 2001 14:49:20 -0800 (PST) Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3]) by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fABMnKM11180; Sun, 11 Nov 2001 14:49:20 -0800 (PST) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by overcee.netplex.com.au (Postfix) with ESMTP id B24623807; Sun, 11 Nov 2001 14:49:19 -0800 (PST) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: Poul-Henning Kamp Cc: Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <1970.1005517632@critter.freebsd.dk> Date: Sun, 11 Nov 2001 14:49:19 -0800 From: Peter Wemm Message-Id: <20011111224919.B24623807@overcee.netplex.com.au> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Poul-Henning Kamp wrote: > In message <20011111191735.00D053807@overcee.netplex.com.au>, Peter Wemm writ es: > > > [ass'y output of gcc] > > > >Ever wonder why the kernel gets slower and slower to compile? Ever > >compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by > >the speed? > > > >Count me in the 'curproc considered harmful' camp. (or curthread). > > Peters example more than clenches the argument for me, but I also > wonder if we would not paint ourselves into a corner with the > cur{proc|thread} stuff if the future ends up being more parallel > and cluster-oriented. I believe it would be a lot easier to remove the p/td arguments later once we know that we dont need them, than to remove them now and discover later that we do need them and have to go back and figure it all out again. To answer Robert.. By all means be explicit about creds etc, but lets not get two different bikesheds^H^H^H^H^H^Hchanges mixed up together. Cheers, -Peter -- Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 14:53: 5 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id D8C3B37B41A for ; Sun, 11 Nov 2001 14:52:58 -0800 (PST) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fABMqfB16719; Sun, 11 Nov 2001 17:52:41 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Sun, 11 Nov 2001 17:52:40 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Peter Wemm Cc: Poul-Henning Kamp , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <20011111224919.B24623807@overcee.netplex.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, 11 Nov 2001, Peter Wemm wrote: > I believe it would be a lot easier to remove the p/td arguments later > once we know that we dont need them, than to remove them now and > discover later that we do need them and have to go back and figure it > all out again. > > To answer Robert.. By all means be explicit about creds etc, but lets > not get two different bikesheds^H^H^H^H^H^Hchanges mixed up together. Well, my concern was really whether or not I should go ahead and commit the if_ioctl changes to add a td argument, which scatter new thread references all over the place, when adopting a 'curthread' philosophy would make that a waste of time. I'll post the patches, once I've merged in some recent changes, on Monday. To be honest, I don't really mind either way, I was just interested in getting a sense of the arguments {for, against} moving to curthread/curproc. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 17:50:58 2001 Delivered-To: freebsd-arch@freebsd.org Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184]) by hub.freebsd.org (Postfix) with ESMTP id B122F37B416; Sun, 11 Nov 2001 17:50:53 -0800 (PST) Received: from beastie.mckusick.com (localhost [127.0.0.1]) by beastie.mckusick.com (8.11.4/8.9.3) with ESMTP id fABIFG336949; Sun, 11 Nov 2001 10:15:21 -0800 (PST) (envelope-from mckusick@beastie.mckusick.com) Message-Id: <200111111815.fABIFG336949@beastie.mckusick.com> To: Robert Watson Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.ORG In-Reply-To: Your message of "Sun, 11 Nov 2001 10:31:54 EST." Date: Sun, 11 Nov 2001 10:15:16 -0800 From: Kirk McKusick Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Some many years ago, I tried to get rid of all the references to curproc in the filesystem code, and quickly came to the realization that it would require adding a proc pointer to virtually every subroutine in the filesystem code. For the reasons that you have noted, this is ugly and adds bloat to the stack space. On the other hand, there are places where the filesystem code does not want to use the current process credential. One of the more evident ones is in the NFS server code which wants to pass down the credential of the requesting client rather than its own. Solaris uses a very ugly hack where the server thread replaces its credential with that of its client, does the VOP call, then puts its own credential back when it returns. This sort of problem could exist in almost any instance where the kernel is acting as a server. So, completely removing process/credential references from the kernel interfaces is not the right solution either. Kirk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Nov 11 22:33:42 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by hub.freebsd.org (Postfix) with ESMTP id C236137B416; Sun, 11 Nov 2001 22:33:32 -0800 (PST) Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id RAA16949; Mon, 12 Nov 2001 17:33:22 +1100 Date: Mon, 12 Nov 2001 17:32:12 +1100 (EST) From: Bruce Evans X-X-Sender: To: Peter Wemm Cc: Robert Watson , Subject: Re: cur{thread/proc}, or not. In-Reply-To: <20011111191735.00D053807@overcee.netplex.com.au> Message-ID: <20011112165530.B34657-100000@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, 11 Nov 2001, Peter Wemm wrote: > Robert Watson wrote: > > > It seems to me that unless a very strong argument exists against using > > curproc/curthread (and I don't preclude one existing), using them would > > actually be an improvement, as it would assert that this class of > My gripe is that on i386, it creates a LOT of work for the compiler. That's just an implementation detail for one arch. I did strongly object to the implementation, but... > Consider this small function in kern_kthread.c: > void > kthread_exit(int ecode) > { > > sx_xlock(&proctree_lock); > PROC_LOCK(curproc); > proc_reparent(curproc, initproc); > PROC_UNLOCK(curproc); > sx_xunlock(&proctree_lock); > exit1(curthread, W_EXITCODE(ecode, 0)); > } > > Have a look at http://people.freebsd.org/~peter/macros.c where I've cpp'ed > it and indented it for readability. Anyway, kthread_exit() turns into > this for the compiler to choke on: > [235 lines of bletcherous code deleted] The corresponding code for RELENG_4 is: Source: --- void kthread_exit(int ecode) { proc_reparent(curproc, initproc); exit1(curproc, W_EXITCODE(ecode, 0)); } --- Preprocssor output (!SMP case): --- void kthread_exit(int ecode) { proc_reparent(curproc, initproc); exit1(curproc, (( ecode ) << 8 | ( 0 )) ); } --- Preprocssor output (SMP case): --- void kthread_exit(int ecode) { proc_reparent((( struct proc * )_global_curproc_nv()) , initproc); exit1((( struct proc * )_global_curproc_nv()) , (( ecode ) << 8 | ( 0 )) ); } --- The preprocssor output didn't even need editing to look this nice. _global_curproc_nv() is an inline function, so the compiler has more work to do in the SMP case than might appear. This function is: static __inline int _global_curproc_nv(void) { \ int val; \ __asm("movl %%fs:gd_curproc",%0" : "=r" (val)); \ return (val); \ } \ which is only about 10 times smaller than the corresponding code in -current (it has one case instead of 4, and has a much simpler reference to gd_curproc). The size of the output in -current can be reduced by a factor of about 2 by copying curproc to a local variable. > Ever wonder why the kernel gets slower and slower to compile? Ever > compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by > the speed? Better yet, compile a 2.1 or 2.2 kernel under 2.1 or 2.2 and get about 25% more speed (mostly from not having pessimizations in gcc). > Count me in the 'curproc considered harmful' camp. (or curthread). Count me ouside of it. > Regarding 64 bit machines, all of our 64 bit platforms use register > passing, some with fixed size register frames. On those, the difference > of saving one argument isn't going to add up to much, if anything. And > it would still require an intermediate frame to hold the calculated value > of curproc/curthread where its used. Passing the pointer down through 20 subroutines (some of which don't even use it except to pass it along) may add up to much. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 2: 9:36 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 3839E37B417; Mon, 12 Nov 2001 02:09:33 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACA9SI75024; Mon, 12 Nov 2001 02:09:28 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 02:09:28 -0800 (PST) From: Matthew Dillon Message-Id: <200111121009.fACA9SI75024@apollo.backplane.com> To: Bruce Evans Cc: Peter Wemm , Robert Watson , Subject: Re: cur{thread/proc}, or not. References: <20011112165530.B34657-100000@delplex.bde.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :> Have a look at http://people.freebsd.org/~peter/macros.c where I've cpp'ed :> it and indented it for readability. Anyway, kthread_exit() turns into :> this for the compiler to choke on: : :> [235 lines of bletcherous code deleted] It's a mess, but the code produced isn't too bad. It's much better now that the mutexes are calling real procedures. : :Passing the pointer down through 20 subroutines (some of which don't :even use it except to pass it along) may add up to much. : :Bruce I agree that it is kind of silly to pass a global down through N levels of procedures. Just on principle. On the otherhand I don't expect the performance to be better or worse, or even for there to be any real difference in code size. Fewer instructions per routine in more routines, with more memory writes (pass as argument on stack), verses more instructions in fewer routines, with only memory reads (access as global). Without there being a clear winner there isn't much of a reason to change the existing code. If we stopped trying to be fancy with interrupt scheduling and went back to the BSDI methodology the kernel code could assume that %fs doesn't change out from under it and we could *GREATLY* simplify the __PCPU_GET() code to something like this: static __inline struct globaldata * __globaldata(void) { struct globaldata *gd; __asm("movl %%fs,%0" : "=r" (gd)); return(gd); } #define __PCPU_GET(name) (__globaldata()->name) Which would allow GCC to generate somewhat better code output (about 1K less code in the text segment as well) as well as allow the per-cpu variables to be accessed more normally without having to macros to GET and SET them. Else we are stuck with what we have. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 2:47:41 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id C8EB137B417 for ; Mon, 12 Nov 2001 02:47:37 -0800 (PST) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fACAlOB24388; Mon, 12 Nov 2001 05:47:25 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 12 Nov 2001 05:47:24 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Kirk McKusick Cc: freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <200111111815.fABIFG336949@beastie.mckusick.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, 11 Nov 2001, Kirk McKusick wrote: > Some many years ago, I tried to get rid of all the references to curproc > in the filesystem code, and quickly came to the realization that it > would require adding a proc pointer to virtually every subroutine in the > filesystem code. For the reasons that you have noted, this is ugly and > adds bloat to the stack space. On the other hand, there are places where > the filesystem code does not want to use the current process credential. > One of the more evident ones is in the NFS server code which wants to > pass down the credential of the requesting client rather than its own. > Solaris uses a very ugly hack where the server thread replaces its > credential with that of its client, does the VOP call, then puts its own > credential back when it returns. This sort of problem could exist in > almost any instance where the kernel is acting as a server. So, > completely removing process/credential references from the kernel > interfaces is not the right solution either. Right now, many of the VFS calls pass a credential in, which is used in lieu of the process credential in most cases. The prominent exceptions to this rule seem to be in the device code (where process credentials are used), and in the smattering of VOP calls where in UFS/FFS, an authorization decision is not required. By putting the credential into these calls, I think most NFS cases could be normalized. This would be consistent with the approach adopted by several other systems I looked at, and seems like it may intuitively be the right approach given the 'file' cached credential model. As Peter has pointed out, this change could be independent of any choice about curproc/curthread, and is probably worth doing regardless of the choice there. Probably the right 'approach' here is to assume that operations on 'vnode' require a 'ucred', whereas operations on 'file' generally do not. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 4:10:31 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by hub.freebsd.org (Postfix) with ESMTP id 5DED237B405; Mon, 12 Nov 2001 04:10:26 -0800 (PST) Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id XAA19129; Mon, 12 Nov 2001 23:10:16 +1100 Date: Mon, 12 Nov 2001 23:09:06 +1100 (EST) From: Bruce Evans X-X-Sender: To: Matthew Dillon Cc: Peter Wemm , Robert Watson , Subject: Re: cur{thread/proc}, or not. In-Reply-To: <200111121009.fACA9SI75024@apollo.backplane.com> Message-ID: <20011112221522.E36389-100000@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Matthew Dillon wrote: > If we stopped trying to be fancy with interrupt scheduling and went > back to the BSDI methodology the kernel code could assume that > %fs doesn't change out from under it and we could *GREATLY* Strictly, that the GDT entry for %fs doesn't change. We could safely assume this already for the !SMP case. > simplify the __PCPU_GET() code to something like this: > > static __inline > struct globaldata * > __globaldata(void) > { > struct globaldata *gd; > > __asm("movl %%fs,%0" : "=r" (gd)); > return(gd); > } > > #define __PCPU_GET(name) (__globaldata()->name) > > Which would allow GCC to generate somewhat better code output > (about 1K less code in the text segment as well) as well as > allow the per-cpu variables to be accessed more normally without > having to macros to GET and SET them. This is essentially a slightly pessimized version of the RELENG_4 code for the SMP case (RELENG_4 avoids going through the pointer in for most per-cpu global accesses). It also helps to declare __globaldata() as __pure2 so that gcc can tell that it always returns the same value. It doesn't quite always return the same value, but I can't think of any cases where a cached value would remain valid long enough to cause problems. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 4:39:47 2001 Delivered-To: freebsd-arch@freebsd.org Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180]) by hub.freebsd.org (Postfix) with ESMTP id B07DF37B41A; Mon, 12 Nov 2001 04:39:25 -0800 (PST) Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3]) by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fACCdPM13118; Mon, 12 Nov 2001 04:39:25 -0800 (PST) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by overcee.netplex.com.au (Postfix) with ESMTP id 6A89E380A; Mon, 12 Nov 2001 04:39:25 -0800 (PST) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: Matthew Dillon Cc: Bruce Evans , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <200111121009.fACA9SI75024@apollo.backplane.com> Date: Mon, 12 Nov 2001 04:39:25 -0800 From: Peter Wemm Message-Id: <20011112123925.6A89E380A@overcee.netplex.com.au> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Matthew Dillon wrote: > :> Have a look at http://people.freebsd.org/~peter/macros.c where I've cpp'e d > :> it and indented it for readability. Anyway, kthread_exit() turns into > :> this for the compiler to choke on: > : > :> [235 lines of bletcherous code deleted] > > It's a mess, but the code produced isn't too bad. It's much better > now that the mutexes are calling real procedures. Mutexes only call procedures if debugging options are on. If you compile without INVARIANTS, KTR, or WITNESS, then you get the maximum inline versions. Regarding __globaldata() .. That's almost how an intermediate version of globals.h did it on the i386, about rev 1.16. We always have the option to go back to something like later on if preemption turns out to be a wash. Your inline function doesn't work though.. %fs isn't a general purpose register.. You can't store a pointer in the register itself. You have to use an indirect memory reference to fetch the pointer. ie: struct globaldata *gd; __asm("movl %%fs,%0" : "=r" (gd)); return(gd); must be more like this: __asm("movl %%fs:0,%0" : "=r" (gd)); ie: read memory location 0 from the %fs segment. Note that the RELENG_4 macros call inlines: #define GLOBAL_FUNC(name) \ static __inline void *_global_ptr_##name(void) { \ void *val; \ __asm __volatile("movl $gd_" #name ",%0;" \ "addl %%fs:globaldata,%0" : "=r" (val)); \ return (val); \ } \ static __inline void *_global_ptr_##name##_nv(void) { \ void *val; \ __asm("movl $gd_" #name ",%0;" \ "addl %%fs:globaldata,%0" : "=r" (val)); \ return (val); \ } \ static __inline int _global_##name(void) { \ int val; \ __asm __volatile("movl %%fs:gd_" #name ",%0" : "=r" (val)); \ return (val); \ } \ static __inline int _global_##name##_nv(void) { \ int val; \ __asm("movl %%fs:gd_" #name ",%0" : "=r" (val)); \ return (val); \ } \ static __inline void _global_##name##_set(int val) { \ __asm __volatile("movl %0,%%fs:gd_" #name : : "r" (val)); \ } \ static __inline void _global_##name##_set_nv(int val) { \ __asm("movl %0,%%fs:gd_" #name : : "r" (val)); \ } ... GLOBAL_FUNC(curproc) GLOBAL_FUNC(astpending) GLOBAL_FUNC(curpcb) GLOBAL_FUNC(npxproc) GLOBAL_FUNC(common_tss) GLOBAL_FUNC(switchtime) GLOBAL_FUNC(switchticks) ... Bruce neglected to show the spammage from this in his cut/paste. Here's what it really looks like in RELENG_4, and remember that this is *without* mutexes and atomic support, etc, and after I have cleaned it up so that hopefully the mail system wont shred it: static __inline void *_global_ptr_curproc (void) { void *val; __asm volatile ("movl $gd_" "curproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_curproc_nv(void) { void *val; __asm("movl $gd_" "curproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_curproc (void) { int val; __asm volatile ("movl %%fs:gd_" "curproc" ",%0" : "=r" (val)); return (val); } static __inline int _global_curproc_nv(void) { int val; __asm("movl %%fs:gd_" "curproc" ",%0" : "=r" (val)); return (val); } static __inline void _global_curproc_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "curproc" : : "r" (val)); } static __inline void _global_curproc_set_nv(int val) { __asm("movl %0,%%fs:gd_" "curproc" : : "r" (val)); } static __inline void *_global_ptr_astpending (void) { void *val; __asm volatile ("movl $gd_" "astpending" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_astpending_nv(void) { void *val; __asm("movl $gd_" "astpending" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_astpending (void) { int val; __asm volatile ("movl %%fs:gd_" "astpending" ",%0" : "=r" (val)); return (val); } static __inline int _global_astpending_nv(void) { int val; __asm("movl %%fs:gd_" "astpending" ",%0" : "=r" (val)); return (val); } static __inline void _global_astpending_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "astpending" : : "r" (val)); } static __inline void _global_astpending_set_nv(int val) { __asm("movl %0,%%fs:gd_" "astpending" : : "r" (val)); } static __inline void *_global_ptr_curpcb (void) { void *val; __asm volatile ("movl $gd_" "curpcb" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_curpcb_nv(void) { void *val; __asm("movl $gd_" "curpcb" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_curpcb (void) { int val; __asm volatile ("movl %%fs:gd_" "curpcb" ",%0" : "=r" (val)); return (val); } static __inline int _global_curpcb_nv(void) { int val; __asm("movl %%fs:gd_" "curpcb" ",%0" : "=r" (val)); return (val); } static __inline void _global_curpcb_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "curpcb" : : "r" (val)); } static __inline void _global_curpcb_set_nv(int val) { __asm("movl %0,%%fs:gd_" "curpcb" : : "r" (val)); } static __inline void *_global_ptr_npxproc (void) { void *val; __asm volatile ("movl $gd_" "npxproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_npxproc_nv(void) { void *val; __asm("movl $gd_" "npxproc" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_npxproc (void) { int val; __asm volatile ("movl %%fs:gd_" "npxproc" ",%0" : "=r" (val)); return (val); } static __inline int _global_npxproc_nv(void) { int val; __asm("movl %%fs:gd_" "npxproc" ",%0" : "=r" (val)); return (val); } static __inline void _global_npxproc_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "npxproc" : : "r" (val)); } static __inline void _global_npxproc_set_nv(int val) { __asm("movl %0,%%fs:gd_" "npxproc" : : "r" (val)); } static __inline void *_global_ptr_common_tss (void) { void *val; __asm volatile ("movl $gd_" "common_tss" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_common_tss_nv(void) { void *val; __asm("movl $gd_" "common_tss" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_common_tss (void) { int val; __asm volatile ("movl %%fs:gd_" "common_tss" ",%0" : "=r" (val)); return (val); } static __inline int _global_common_tss_nv(void) { int val; __asm("movl %%fs:gd_" "common_tss" ",%0" : "=r" (val)); return (val); } static __inline void _global_common_tss_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "common_tss" : : "r" (val)); } static __inline void _global_common_tss_set_nv(int val) { __asm("movl %0,%%fs:gd_" "common_tss" : : "r" (val)); } static __inline void *_global_ptr_switchtime (void) { void *val; __asm volatile ("movl $gd_" "switchtime" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_switchtime_nv(void) { void *val; __asm("movl $gd_" "switchtime" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_switchtime (void) { int val; __asm volatile ("movl %%fs:gd_" "switchtime" ",%0" : "=r" (val)); return (val); } static __inline int _global_switchtime_nv(void) { int val; __asm("movl %%fs:gd_" "switchtime" ",%0" : "=r" (val)); return (val); } static __inline void _global_switchtime_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "switchtime" : : "r" (val)); } static __inline void _global_switchtime_set_nv(int val) { __asm("movl %0,%%fs:gd_" "switchtime" : : "r" (val)); } static __inline void *_global_ptr_switchticks (void) { void *val; __asm volatile ("movl $gd_" "switchticks" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_switchticks_nv(void) { void *val; __asm("movl $gd_" "switchticks" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_switchticks (void) { int val; __asm volatile ("movl %%fs:gd_" "switchticks" ",%0" : "=r" (val)); return (val); } static __inline int _global_switchticks_nv(void) { int val; __asm("movl %%fs:gd_" "switchticks" ",%0" : "=r" (val)); return (val); } static __inline void _global_switchticks_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "switchticks" : : "r" (val)); } static __inline void _global_switchticks_set_nv(int val) { __asm("movl %0,%%fs:gd_" "switchticks" : : "r" (val)); } static __inline void *_global_ptr_common_tssd (void) { void *val; __asm volatile ("movl $gd_" "common_tssd" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_common_tssd_nv(void) { void *val; __asm("movl $gd_" "common_tssd" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_common_tssd (void) { int val; __asm volatile ("movl %%fs:gd_" "common_tssd" ",%0" : "=r" (val)); return (val); } static __inline int _global_common_tssd_nv(void) { int val; __asm("movl %%fs:gd_" "common_tssd" ",%0" : "=r" (val)); return (val); } static __inline void _global_common_tssd_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "common_tssd" : : "r" (val)); } static __inline void _global_common_tssd_set_nv(int val) { __asm("movl %0,%%fs:gd_" "common_tssd" : : "r" (val)); } static __inline void *_global_ptr_tss_gdt (void) { void *val; __asm volatile ("movl $gd_" "tss_gdt" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_tss_gdt_nv(void) { void *val; __asm("movl $gd_" "tss_gdt" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_tss_gdt (void) { int val; __asm volatile ("movl %%fs:gd_" "tss_gdt" ",%0" : "=r" (val)); return (val); } static __inline int _global_tss_gdt_nv(void) { int val; __asm("movl %%fs:gd_" "tss_gdt" ",%0" : "=r" (val)); return (val); } static __inline void _global_tss_gdt_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "tss_gdt" : : "r" (val)); } static __inline void _global_tss_gdt_set_nv(int val) { __asm("movl %0,%%fs:gd_" "tss_gdt" : : "r" (val)); } static __inline void *_global_ptr_cpuid (void) { void *val; __asm volatile ("movl $gd_" "cpuid" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_cpuid_nv(void) { void *val; __asm("movl $gd_" "cpuid" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_cpuid (void) { int val; __asm volatile ("movl %%fs:gd_" "cpuid" ",%0" : "=r" (val)); return (val); } static __inline int _global_cpuid_nv(void) { int val; __asm("movl %%fs:gd_" "cpuid" ",%0" : "=r" (val)); return (val); } static __inline void _global_cpuid_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "cpuid" : : "r" (val)); } static __inline void _global_cpuid_set_nv(int val) { __asm("movl %0,%%fs:gd_" "cpuid" : : "r" (val)); } static __inline void *_global_ptr_other_cpus (void) { void *val; __asm volatile ("movl $gd_" "other_cpus" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_other_cpus_nv(void) { void *val; __asm("movl $gd_" "other_cpus" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_other_cpus (void) { int val; __asm volatile ("movl %%fs:gd_" "other_cpus" ",%0" : "=r" (val)); return (val); } static __inline int _global_other_cpus_nv(void) { int val; __asm("movl %%fs:gd_" "other_cpus" ",%0" : "=r" (val)); return (val); } static __inline void _global_other_cpus_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "other_cpus" : : "r" (val)); } static __inline void _global_other_cpus_set_nv(int val) { __asm("movl %0,%%fs:gd_" "other_cpus" : : "r" (val)); } static __inline void *_global_ptr_inside_intr (void) { void *val; __asm volatile ("movl $gd_" "inside_intr" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_inside_intr_nv(void) { void *val; __asm("movl $gd_" "inside_intr" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_inside_intr (void) { int val; __asm volatile ("movl %%fs:gd_" "inside_intr" ",%0" : "=r" (val)); return (val); } static __inline int _global_inside_intr_nv(void) { int val; __asm("movl %%fs:gd_" "inside_intr" ",%0" : "=r" (val)); return (val); } static __inline void _global_inside_intr_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "inside_intr" : : "r" (val)); } static __inline void _global_inside_intr_set_nv(int val) { __asm("movl %0,%%fs:gd_" "inside_intr" : : "r" (val)); } static __inline void *_global_ptr_prv_CMAP1 (void) { void *val; __asm volatile ("movl $gd_" "prv_CMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_CMAP1_nv(void) { void *val; __asm("movl $gd_" "prv_CMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_CMAP1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CMAP1" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_CMAP1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CMAP1" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_CMAP1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CMAP1" : : "r" (val)); } static __inline void _global_prv_CMAP1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CMAP1" : : "r" (val)); } static __inline void *_global_ptr_prv_CMAP2 (void) { void *val; __asm volatile ("movl $gd_" "prv_CMAP2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_CMAP2_nv(void) { void *val; __asm("movl $gd_" "prv_CMAP2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_CMAP2 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CMAP2" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_CMAP2_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CMAP2" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_CMAP2_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CMAP2" : : "r" (val)); } static __inline void _global_prv_CMAP2_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CMAP2" : : "r" (val)); } static __inline void *_global_ptr_prv_CMAP3 (void) { void *val; __asm volatile ("movl $gd_" "prv_CMAP3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_CMAP3_nv(void) { void *val; __asm("movl $gd_" "prv_CMAP3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_CMAP3 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CMAP3" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_CMAP3_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CMAP3" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_CMAP3_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CMAP3" : : "r" (val)); } static __inline void _global_prv_CMAP3_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CMAP3" : : "r" (val)); } static __inline void *_global_ptr_prv_PMAP1 (void) { void *val; __asm volatile ("movl $gd_" "prv_PMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_PMAP1_nv(void) { void *val; __asm("movl $gd_" "prv_PMAP1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_PMAP1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_PMAP1" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_PMAP1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_PMAP1" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_PMAP1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_PMAP1" : : "r" (val)); } static __inline void _global_prv_PMAP1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_PMAP1" : : "r" (val)); } static __inline void *_global_ptr_prv_CADDR1 (void) { void *val; __asm volatile ("movl $gd_" "prv_CADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_CADDR1_nv(void) { void *val; __asm("movl $gd_" "prv_CADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_CADDR1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CADDR1" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_CADDR1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CADDR1" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_CADDR1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CADDR1" : : "r" (val)); } static __inline void _global_prv_CADDR1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CADDR1" : : "r" (val)); } static __inline void *_global_ptr_prv_CADDR2 (void) { void *val; __asm volatile ("movl $gd_" "prv_CADDR2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_CADDR2_nv(void) { void *val; __asm("movl $gd_" "prv_CADDR2" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_CADDR2 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CADDR2" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_CADDR2_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CADDR2" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_CADDR2_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CADDR2" : : "r" (val)); } static __inline void _global_prv_CADDR2_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CADDR2" : : "r" (val)); } static __inline void *_global_ptr_prv_CADDR3 (void) { void *val; __asm volatile ("movl $gd_" "prv_CADDR3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_CADDR3_nv(void) { void *val; __asm("movl $gd_" "prv_CADDR3" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_CADDR3 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_CADDR3" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_CADDR3_nv(void) { int val; __asm("movl %%fs:gd_" "prv_CADDR3" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_CADDR3_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_CADDR3" : : "r" (val)); } static __inline void _global_prv_CADDR3_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_CADDR3" : : "r" (val)); } static __inline void *_global_ptr_prv_PADDR1 (void) { void *val; __asm volatile ("movl $gd_" "prv_PADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline void *_global_ptr_prv_PADDR1_nv(void) { void *val; __asm("movl $gd_" "prv_PADDR1" ",%0;" "addl %%fs:globaldata,%0" : "=r" (val)); return (val); } static __inline int _global_prv_PADDR1 (void) { int val; __asm volatile ("movl %%fs:gd_" "prv_PADDR1" ",%0" : "=r" (val)); return (val); } static __inline int _global_prv_PADDR1_nv(void) { int val; __asm("movl %%fs:gd_" "prv_PADDR1" ",%0" : "=r" (val)); return (val); } static __inline void _global_prv_PADDR1_set(int val) { __asm volatile ("movl %0,%%fs:gd_" "prv_PADDR1" : : "r" (val)); } static __inline void _global_prv_PADDR1_set_nv(int val) { __asm("movl %0,%%fs:gd_" "prv_PADDR1" : : "r" (val)); } void kproc_start(udata) const void *udata; { const struct kproc_desc *kp = udata; int error; error = kthread_create((void (*)(void *))kp->func, 0 , kp->global_procpp, kp->arg0); if (error) panic("kproc_start: %s: error %d", kp->arg0, error); } int kthread_create(void (*func)(void *), void *arg, struct proc **newpp, const char *fmt, ...) { int error; va_list ap; struct proc *p2; if (!proc0.p_stats ) { panic("kthread_create called too soon"); } error = fork1(&proc0, (1<<5) | (1<<2) | (1<<4) , &p2); if (error) return error; if (newpp != 0 ) *newpp = p2; p2->p_flag |= 0x00004 | 0x00200 ; p2->p_procsig->ps_flag |= 0x0001 ; { if (( p2 )->p_lock++ == 0 && (( p2 )->p_flag & 0x00004 ) == 0) faultin( p2 ); } ; (( ap ) = (va_list)__builtin_next_arg( fmt )) ; vsnprintf(p2->p_comm, sizeof(p2->p_comm), fmt, ap); ; cpu_set_fork_handler(p2, func, arg); return 0; } void kthread_exit(int ecode) { proc_reparent((( struct proc * )_global_curproc_nv()) , initproc); exit1((( struct proc * )_global_curproc_nv()) , (( ecode ) << 8 | ( 0 )) ); } int suspend_kproc(struct proc *p, int timo) { if ((p->p_flag & 0x00200 ) == 0) return (22 ); ( p->p_siglist ).__bits[((( 17 ) - 1) >> 5) ] |= (1 << ((( 17 ) - 1) & 31)) ; return tsleep((caddr_t)&p->p_siglist, 40 , "suspkp", timo); } int resume_kproc(struct proc *p) { if ((p->p_flag & 0x00200 ) == 0) return (22 ); ( p->p_siglist ).__bits[((( 17 ) - 1) >> 5) ] &= ~(1 << ((( 17 ) - 1) & 31)) ; wakeup((caddr_t)&p->p_siglist); return (0); } void kproc_suspend_loop(struct proc *p) { while ((( p->p_siglist ).__bits[((( 17 ) - 1) >> 5) ] & (1 << ((( 17 ) - 1) & 31)) ) ) { wakeup((caddr_t)&p->p_siglist); tsleep((caddr_t)&p->p_siglist, 40 , "kpsusp", 0); } } And dont forget the extra support code required for this: #include "assym.s" #ifdef SMP /* * Define layout of per-cpu address space. * This is "constructed" in locore.s on the BSP and in mp_machdep.c * for each AP. DO NOT REORDER THESE WITHOUT UPDATING THE REST! */ .globl _SMP_prvspace, _lapic .set _SMP_prvspace,(MPPTDI << PDRSHIFT) .set _lapic,_SMP_prvspace + (NPTEPG-1) * PAGE_SIZE .globl gd_idlestack,gd_idlestack_top .set gd_idlestack,PS_IDLESTACK .set gd_idlestack_top,PS_IDLESTACK_TOP #endif /* * Define layout of the global data. On SMP this lives in * the per-cpu address space, otherwise it's in the data segment. */ .globl globaldata #ifndef SMP .data ALIGN_DATA globaldata: .space GD_SIZEOF /* in data segment */ #else .set globaldata,0 #endif .globl gd_curproc, gd_curpcb, gd_npxproc, gd_astpending .globl gd_common_tss, gd_switchtime, gd_switchticks .set gd_curproc,globaldata + GD_CURPROC .set gd_astpending,globaldata + GD_ASTPENDING .set gd_curpcb,globaldata + GD_CURPCB .set gd_npxproc,globaldata + GD_NPXPROC .set gd_common_tss,globaldata + GD_COMMON_TSS .set gd_switchtime,globaldata + GD_SWITCHTIME .set gd_switchticks,globaldata + GD_SWITCHTICKS .globl gd_common_tssd, gd_tss_gdt .set gd_common_tssd,globaldata + GD_COMMON_TSSD .set gd_tss_gdt,globaldata + GD_TSS_GDT #ifdef USER_LDT .globl gd_currentldt .set gd_currentldt,globaldata + GD_CURRENTLDT #endif #ifndef SMP .globl _curproc, _curpcb, _npxproc, _astpending .globl _common_tss, _switchtime, _switchticks .set _curproc,globaldata + GD_CURPROC .set _astpending,globaldata + GD_ASTPENDING .set _curpcb,globaldata + GD_CURPCB .set _npxproc,globaldata + GD_NPXPROC .set _common_tss,globaldata + GD_COMMON_TSS .set _switchtime,globaldata + GD_SWITCHTIME .set _switchticks,globaldata + GD_SWITCHTICKS .globl _common_tssd, _tss_gdt .set _common_tssd,globaldata + GD_COMMON_TSSD .set _tss_gdt,globaldata + GD_TSS_GDT #ifdef USER_LDT .globl _currentldt .set _currentldt,globaldata + GD_CURRENTLDT #endif #endif #ifdef SMP /* * The BSP version of these get setup in locore.s and pmap.c, while * the AP versions are setup in mp_machdep.c. */ .globl gd_cpuid, gd_cpu_lockid, gd_other_cpus .globl gd_ss_eflags, gd_inside_intr .globl gd_prv_CMAP1, gd_prv_CMAP2, gd_prv_CMAP3, gd_prv_PMAP1 .globl gd_prv_CADDR1, gd_prv_CADDR2, gd_prv_CADDR3, gd_prv_PADDR1 .set gd_cpuid,globaldata + GD_CPUID .set gd_cpu_lockid,globaldata + GD_CPU_LOCKID .set gd_other_cpus,globaldata + GD_OTHER_CPUS .set gd_ss_eflags,globaldata + GD_SS_EFLAGS .set gd_inside_intr,globaldata + GD_INSIDE_INTR .set gd_prv_CMAP1,globaldata + GD_PRV_CMAP1 .set gd_prv_CMAP2,globaldata + GD_PRV_CMAP2 .set gd_prv_CMAP3,globaldata + GD_PRV_CMAP3 .set gd_prv_PMAP1,globaldata + GD_PRV_PMAP1 .set gd_prv_CADDR1,globaldata + GD_PRV_CADDR1 .set gd_prv_CADDR2,globaldata + GD_PRV_CADDR2 .set gd_prv_CADDR3,globaldata + GD_PRV_CADDR3 .set gd_prv_PADDR1,globaldata + GD_PRV_PADDR1 #endif The globals.s code has to be in exact sync with the C headers. And we push a whole bunch of stuff into the kernel namelist as well: # nm /kernel | sort | more 00000000 A globaldata 00000004 A gd_curproc 00000008 A gd_npxproc 0000000c A gd_curpcb 00000010 A gd_switchtime 00000018 A gd_common_tss 00000080 A gd_switchticks 00000084 A gd_common_tssd 0000008c A gd_tss_gdt 00000090 A gd_cpuid 00000094 A gd_cpu_lockid 00000098 A gd_other_cpus 0000009c A gd_inside_intr 000000a0 A gd_ss_eflags 000000a4 A gd_prv_CMAP1 000000a8 A gd_prv_CMAP2 000000ac A gd_prv_CMAP3 000000b0 A gd_prv_PMAP1 000000b4 A gd_prv_CADDR1 000000b8 A gd_prv_CADDR2 000000bc A gd_prv_CADDR3 000000c0 A gd_prv_PADDR1 000000c4 A gd_astpending 00005000 A gd_idlestack 00008000 A gd_idlestack_top 9fc00000 A PTmap 9fe7f000 A PTD 9fe7f9fc A PTDpde 9fe7fffc A APTDpde a0000000 A kernbase a011ffb0 T btext a0120019 t begin a0120064 T sigcode a0120084 t _osigcode a01200ac t _esigcode a01200ac t recover_bootinfo a01200b9 t newboot a01200fb t got_bi_size a012010c t got_common_bi_size a012010f t olddiskboot a0120120 t identify_cpu [....] Anyway, we have plenty of time to come back to this if it turns out that we dont need the complexity. We have *lots* of optimization choices. But we should not start restricting our options yet. Cheers, -Peter -- Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 8:34:57 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212]) by hub.freebsd.org (Postfix) with ESMTP id 243A337B418 for ; Mon, 12 Nov 2001 08:34:52 -0800 (PST) Received: (qmail 80361 invoked from network); 12 Nov 2001 16:34:51 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 12 Nov 2001 16:34:51 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 12 Nov 2001 08:34:51 -0800 (PST) From: John Baldwin To: Julian Elischer Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.org, Robert Watson Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 11-Nov-01 Julian Elischer wrote: > Also, in SMPng cur{thread,proc} takes some time to get as I'm told that > dereferencing %fs is very slow.. (Not sure how true that is). I'm not sure it is any slower than pushing the variable onto the stack and then reading it from the stack. Reading the variable off teh stack is still a memory read, as is reading curproc, so it's not really that slow. %fs is no slower than %ds, not anything that compares to the amount of time to go out to cache or memory and read the thing. > There are other reasons for needing the pointer than for a credential. > For example in AIO, the process pointer is stored so that > address space can be loaned to the aio threads to do the IO. Yeah, but it isn't used. All that is used for is to find the vmspace to dink with the aio thread's vmspace AFAICT. > I have thought about this both ways... > both have advantages. In some architectures, getting curthread might > be very expensive. I don't think it is as expensive as people think it is. :) -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 8:43:23 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211]) by hub.freebsd.org (Postfix) with ESMTP id 0509337B405 for ; Mon, 12 Nov 2001 08:43:19 -0800 (PST) Received: (qmail 94565 invoked from network); 12 Nov 2001 16:43:17 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 12 Nov 2001 16:43:17 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <20011111191735.00D053807@overcee.netplex.com.au> Date: Mon, 12 Nov 2001 08:43:17 -0800 (PST) From: John Baldwin To: Peter Wemm Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.ORG, Robert Watson Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 11-Nov-01 Peter Wemm wrote: > Robert Watson wrote: > >> It seems to me that unless a very strong argument exists against using >> curproc/curthread (and I don't preclude one existing), using them would >> actually be an improvement, as it would assert that this class of >> 'borrowing' couldn't exist, simplifying the kernel, not to mention >> squeezing a bit more stuff out of the stack (which, at ten levels deep, >> actually begins to add up on 64-bit machines). I believe that there are >> many places where the 'p' passed in is implicitly assumed to be the >> current process, and that making that reliance explicit would be an >> improvement, rather than a problem. > > My gripe is that on i386, it creates a LOT of work for the compiler. > > Consider this small function in kern_kthread.c: > void > kthread_exit(int ecode) > { > > sx_xlock(&proctree_lock); > PROC_LOCK(curproc); > proc_reparent(curproc, initproc); > PROC_UNLOCK(curproc); > sx_xunlock(&proctree_lock); > exit1(curthread, W_EXITCODE(ecode, 0)); > } > > Have a look at http://people.freebsd.org/~peter/macros.c where I've cpp'ed > it and indented it for readability. Anyway, kthread_exit() turns into > this for the compiler to choke on: This is why one does 'struct proc *p; p = curproc;' and then s/curproc/p/. As it is our current macros collapse that PCPU_GET() down into one instruction. We actually used to have it be multiple instructions, but then peopel got all upset and whined and complained about it being 2 instructions or whatever it was when SMPng first went in. Also, regarding the preemption stuff on the side: - BSD/OS happily preempts arbitrarily for interrupts just in case that wasn't clear, and - curthread doesn't change when we get preempted, just things like cpuid or PCPU_GET(spinlocks) need to be worried about. Since the only PCPU macro commonly used is curthread, then you don't have to worry about this in most cases. > Ever wonder why the kernel gets slower and slower to compile? Ever > compiled a 2.1 or 2.2 kernel on a modern machine and been shocked away by > the speed? And 2.1 and 2.2 don't support SMP. If we didn't have SMP then PCPU_FOO() could certainly be simpler. They could just be global variables like they used to be in fact. Now, maybe as a hack for now, you could try something like having a simple case for PCPU_GET() on the x86 that is PCPU_GET_CUTHREAD() or something and define curthread to be that. Sheesh. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 9:31:44 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 309CA37B419; Mon, 12 Nov 2001 09:31:39 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACHVck84386; Mon, 12 Nov 2001 09:31:38 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 09:31:38 -0800 (PST) From: Matthew Dillon Message-Id: <200111121731.fACHVck84386@apollo.backplane.com> To: Peter Wemm Cc: Bruce Evans , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: <20011112123925.6A89E380A@overcee.netplex.com.au> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :> It's a mess, but the code produced isn't too bad. It's much better :> now that the mutexes are calling real procedures. : :Mutexes only call procedures if debugging options are on. If you compile :without INVARIANTS, KTR, or WITNESS, then you get the maximum inline :versions. Sigh. Well, better then nothing I guess. :Regarding __globaldata() .. That's almost how an intermediate version :of globals.h did it on the i386, about rev 1.16. We always have the option :to go back to something like later on if preemption turns out to be a wash. : :Your inline function doesn't work though.. %fs isn't a general purpose :register.. You can't store a pointer in the register itself. You have :to use an indirect memory reference to fetch the pointer. Ach. Right, of course. :Anyway, we have plenty of time to come back to this if it turns out that :we dont need the complexity. We have *lots* of optimization choices. :But we should not start restricting our options yet. : :Cheers, :-Peter :-- :Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au Well, that's part of the problem. We *don't* hav elots of optimization choices. The way things are currently set-up it is not possible to depend on *anything* being stable without obtaining a mutex first. I'm not going to worry about it for the moment, I have bigger fish to fry. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 14:49:58 2001 Delivered-To: freebsd-arch@freebsd.org Received: from raven.mail.pas.earthlink.net (raven.mail.pas.earthlink.net [207.217.120.39]) by hub.freebsd.org (Postfix) with ESMTP id A4B6137B417; Mon, 12 Nov 2001 14:49:53 -0800 (PST) Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com) by raven.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 163Ptn-0002Aj-00; Mon, 12 Nov 2001 14:49:52 -0800 Message-ID: <3BF05241.74F895EF@mindspring.com> Date: Mon, 12 Nov 2001 14:50:41 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Robert Watson Cc: freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Robert Watson wrote: > There are a number of uses of curproc in the netinet code, used to > retrieve credentials for authorization somewhere down the stack, when no > proc or thread pointer has been passed down. I think that the majority of the netinet code can be handled by using the socket credential, instead of the process credential. > With the eventual addition > of td->td_ucred, it will be desirable to use the credential for the > current thread, rather than the proc, which will require locking to use. I think locking credential instances is bad. The real question you want to answer is whether or not the credential instance that was used to acquire a socket should be used continuously from there on out (i.e. it is a grant), or whether it should change when the process credential changes (i.e. it is a lease). You seem to be arguing for a lease. I would argue for a grant. One issue is that there are cases where write permission is tested before each write. There are also cases, where you obtain a privileged socket, and then relinquish privileges after obtaining it; such cases are explicitly modelled on a grant model rather than a lease model. The point is that if the credentials are granted, then a change in credential is not a change of the credential itself, but is instead a copy-on-write proposition. In other words, credentials, once granted, are priviledge stable. If this is the case, then they are written when they are instanced, cloned before they are modified (indeed, it seems that the clone/modify operation must be made atomic), and thus are never written once instanced -- only destroyed on the 1->0 reference transition. If so, then no locking is required, since the LCK CMPXCHG can be utilized to do atomic increment and decrement on the reference counting, without needing locks. > As I > understand it, use of curproc was branded 'undesirable' at some point in > the semi-distant past, and since that time, a reference to 'proc' has been > passed down the stack. With a change to KSE, this has been translated to > references the thread, but the issue remains the same. This comes up in > particular because I have a tree where I have propagated the thread > pointer down if_ioctl in the network stack: the normal ioctl call carries > a thread pointer now, but when it is translated into if_ioctl by the > network stack, that pointer is lost. This raises the question: should we > (in practice) be adding process or thread pointers to many more of the > function arguments, or should we switch to using curproc/curthread > instead. The "curproc" undesirability stems primarily from credentials enforcement during interrupt processing. I think that this is not an insurmountable issue, but I would argue that these are more appropriate for object credentials, where the objects in question are not threads or processes. For example, if we were to process incoming TCP connections up through the "accept" code at interrupt time, one might naievely assume that, since the current socket code down through the accept processing code off the queue filled in at NETISR seems to require a proc credential, that it is therefore necessary to have a proc credential at interrupt time in order to do this processing. The answer is that this is a false assumption, and is predicated on historical code, and nothing more. Specifically, if I need a credential for a newly accepted socket that I am now creating, I can add a reference to the listen socket credential -- I //do not need// a process credential in order to do an accept. There is a lot of this type of fuzzy thinking, asking "how can I propagate the process credential that I used to use for this operation down to the underlying code?", when the real question should be "what is the appropriate credential to use for this operation, and is the process credential really what I want to use in this case?". I think it's possible to get rid of most of the process credential references -- and therefore, most of the proc references -- at all points below the /sys/kern/uipc_socket*.c level. > I don't pretend to have a grasp of all the issues here, so the purpose of > this message is to raise the issues so that I can understand them. I have > a tree where I've eliminated many references to curproc; however, I'm now > wondering if it wouldn't simply be more useful to eliminate many of the > references to struct proc in the function arguments, and use curproc > instead, and add references to ucred (and related ref-counted structures) > as needed for delegation types of situations. In particular, that would > suggest the following changes: I think this is the wrong direction, but if you wanted to do this, I think that you would need to put the cur* symbols into the per CPU private pages. This is problematic in the extreme, because it means that you must set these values each time going down, in order to be able to substitute a per CPU global for the stack reference. I think this is a bad thing, in general, and will lead only to trouble later. I would much rather that the credentials be object referenced off of non-process, non-thread objects, based on whatever the correct scoping really is, for the security model you want to enforce. My "accept" example is only one of a class of changes that could facilitate this. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 14:54:34 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id EF6BA37B416; Mon, 12 Nov 2001 14:54:31 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACMsNd06845; Mon, 12 Nov 2001 14:54:23 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 14:54:23 -0800 (PST) From: Matthew Dillon Message-Id: <200111122254.fACMsNd06845@apollo.backplane.com> To: Terry Lambert Cc: Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: <3BF05241.74F895EF@mindspring.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :The point is that if the credentials are granted, then a :change in credential is not a change of the credential itself, :but is instead a copy-on-write proposition. In other words, :credentials, once granted, are priviledge stable. : :If this is the case, then they are written when they are :instanced, cloned before they are modified (indeed, it seems :that the clone/modify operation must be made atomic), and :thus are never written once instanced -- only destroyed on :the 1->0 reference transition. : :If so, then no locking is required, since the LCK CMPXCHG can :be utilized to do atomic increment and decrement on the :reference counting, without needing locks. :... : :-- Terry Yes, I believe this is how credentials work. I looked at the code about 6 months ago. We should not have to do any locking of the credential stuff, only simple mutexing around the ref counter. That is how it should work is how I believe it currently works. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15: 8:46 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id 0211437B416 for ; Mon, 12 Nov 2001 15:08:43 -0800 (PST) Received: (qmail 30403 invoked from network); 12 Nov 2001 23:08:41 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 12 Nov 2001 23:08:41 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <3BF05241.74F895EF@mindspring.com> Date: Mon, 12 Nov 2001 15:08:36 -0800 (PST) From: John Baldwin To: Terry Lambert Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.org, Robert Watson Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 12-Nov-01 Terry Lambert wrote: > Robert Watson wrote: >> With the eventual addition >> of td->td_ucred, it will be desirable to use the credential for the >> current thread, rather than the proc, which will require locking to use. > > I think locking credential instances is bad. No, he's not locking credentials, he would be locking the process to avoid having the credential change out from under him. However, this won't be needed in most cases since each thread has a read-only reference to the process credential. (When the process changes credentials, the references of other threads force it to duplicate its current cred into a new one before making the change.) > If so, then no locking is required, since the LCK CMPXCHG can > be utilized to do atomic increment and decrement on the > reference counting, without needing locks. Except that people keep complaining about using atomic ops for ref counts, however that can be done later as an optimization. Regarding object credentials, I agree, and I thought that this was how things were already performed. >> I don't pretend to have a grasp of all the issues here, so the purpose of >> this message is to raise the issues so that I can understand them. I have >> a tree where I've eliminated many references to curproc; however, I'm now >> wondering if it wouldn't simply be more useful to eliminate many of the >> references to struct proc in the function arguments, and use curproc >> instead, and add references to ucred (and related ref-counted structures) >> as needed for delegation types of situations. In particular, that would >> suggest the following changes: > > I think this is the wrong direction, but if you wanted to do this, > I think that you would need to put the cur* symbols into the per > CPU private pages. This is problematic in the extreme, because it > means that you must set these values each time going down, in order > to be able to substitute a per CPU global for the stack reference. Errr, Terry. Where do you think curthread/curproc lives now? It's _already_ in a per-CPU page. We set curthread/curproc on each context switch. > I would much rather that the credentials be object referenced off > of non-process, non-thread objects, based on whatever the correct > scoping really is, for the security model you want to enforce. My > "accept" example is only one of a class of changes that could > facilitate this. I agree with this. I think Robert's question wasn't just about socket credentials however, his question was why pass a proc pointer (or thread poiter) all the way down the stack that is implicitly assumed to be curproc/curthread in several places instead of just using curproc/curthread which your only response seems to be to suggest that we "change" to doing something that we already do. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15: 8:57 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211]) by hub.freebsd.org (Postfix) with ESMTP id 1B11A37B416 for ; Mon, 12 Nov 2001 15:08:53 -0800 (PST) Received: (qmail 39315 invoked from network); 12 Nov 2001 23:08:43 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 12 Nov 2001 23:08:43 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200111122254.fACMsNd06845@apollo.backplane.com> Date: Mon, 12 Nov 2001 15:08:37 -0800 (PST) From: John Baldwin To: Matthew Dillon Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.ORG, Robert Watson , Terry Lambert Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 12-Nov-01 Matthew Dillon wrote: >:The point is that if the credentials are granted, then a >:change in credential is not a change of the credential itself, >:but is instead a copy-on-write proposition. In other words, >:credentials, once granted, are priviledge stable. >: >:If this is the case, then they are written when they are >:instanced, cloned before they are modified (indeed, it seems >:that the clone/modify operation must be made atomic), and >:thus are never written once instanced -- only destroyed on >:the 1->0 reference transition. >: >:If so, then no locking is required, since the LCK CMPXCHG can >:be utilized to do atomic increment and decrement on the >:reference counting, without needing locks. >:... >: >:-- Terry > > Yes, I believe this is how credentials work. I looked at > the code about 6 months ago. We should not have to do any > locking of the credential stuff, only simple mutexing > around the ref counter. That is how it should work > is how I believe it currently works. Yep. They use a mutex for the refcount for now, but I still have patches that some people don't like for implementing a simple refcount API just using atomic operations. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:16:24 2001 Delivered-To: freebsd-arch@freebsd.org Received: from raven.mail.pas.earthlink.net (raven.mail.pas.earthlink.net [207.217.120.39]) by hub.freebsd.org (Postfix) with ESMTP id 803E837B417; Mon, 12 Nov 2001 15:16:22 -0800 (PST) Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com) by raven.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 163QJR-0007Wq-00; Mon, 12 Nov 2001 15:16:22 -0800 Message-ID: <3BF05877.B9E886D8@mindspring.com> Date: Mon, 12 Nov 2001 15:17:11 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Matthew Dillon Cc: Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: <3BF05241.74F895EF@mindspring.com> <200111122254.fACMsNd06845@apollo.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Matthew Dillon wrote: > Yes, I believe this is how credentials work. I looked at > the code about 6 months ago. We should not have to do any > locking of the credential stuff, only simple mutexing > around the ref counter. That is how it should work > is how I believe it currently works. FWIW: Robert had implied that more heavyweight locking of the process (or thread) structure was necessary to access the credential, which is correct, if you are referencing it that was. The part of me you quoted here was a conclusion based on using direct references to value-stable credentials rather than value-colatile proc or thread structs. It only works to refute Roberts argument if you include that; it's not correct to conclude that the way it currently works is sufficient in the face of the proc/thread dereference issues that Robert was trying to address (and which I tried to address by avoiding entirely). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:20:14 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 7FEE937B416; Mon, 12 Nov 2001 15:20:11 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95318; Mon, 12 Nov 2001 15:02:12 -0800 (PST) Date: Mon, 12 Nov 2001 15:02:11 -0800 (PST) From: Julian Elischer To: Terry Lambert Cc: Robert Watson , freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. In-Reply-To: <3BF05241.74F895EF@mindspring.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Terry Lambert wrote: > > I think this is the wrong direction, but if you wanted to do this, > I think that you would need to put the cur* symbols into the per > CPU private pages. This is problematic in the extreme, because it > means that you must set these values each time going down, in order > to be able to substitute a per CPU global for the stack reference. curproc and curthread ARE in the per-cpu private pages. on x86, the %fs segment register points to a small segment that includes the appropriate pages for that cpu. Each cpu is initialised with a different %fs register value. Your private info is accessed as an offset into the 'f' segment which is not used by anything else. 'curthread' is a macro that generates %fs(gd_curthread) (I forget the exact syntax) Similar for other CPUs > I think this is a bad thing, in general, and will lead only to > trouble later. > > I would much rather that the credentials be object referenced off > of non-process, non-thread objects, based on whatever the correct > scoping really is, for the security model you want to enforce. My > "accept" example is only one of a class of changes that could > facilitate this. > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:20:22 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 09D5937B416; Mon, 12 Nov 2001 15:20:15 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95324; Mon, 12 Nov 2001 15:04:27 -0800 (PST) Date: Mon, 12 Nov 2001 15:04:27 -0800 (PST) From: Julian Elischer To: Matthew Dillon Cc: Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <200111122254.fACMsNd06845@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Matthew Dillon wrote: > :The point is that if the credentials are granted, then a > :change in credential is not a change of the credential itself, > :but is instead a copy-on-write proposition. In other words, > :credentials, once granted, are priviledge stable. > : > :If this is the case, then they are written when they are > :instanced, cloned before they are modified (indeed, it seems > :that the clone/modify operation must be made atomic), and > :thus are never written once instanced -- only destroyed on > :the 1->0 reference transition. > : > :If so, then no locking is required, since the LCK CMPXCHG can > :be utilized to do atomic increment and decrement on the > :reference counting, without needing locks. > :... > : > :-- Terry > > Yes, I believe this is how credentials work. I looked at > the code about 6 months ago. We should not have to do any > locking of the credential stuff, only simple mutexing > around the ref counter. That is how it should work > is how I believe it currently works. This is not how they work, but rather how they WILL work given that the commit happens soon (maybe it was already done last week and I missed it...) > > -Matt > Matthew Dillon > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:20:36 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 9503637B419; Mon, 12 Nov 2001 15:20:24 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACNKLC07027; Mon, 12 Nov 2001 15:20:21 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 15:20:21 -0800 (PST) From: Matthew Dillon Message-Id: <200111122320.fACNKLC07027@apollo.backplane.com> To: John Baldwin Cc: freebsd-arch@FreeBSD.ORG, Robert Watson , Terry Lambert Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG : :Yep. They use a mutex for the refcount for now, but I still have patches that :some people don't like for implementing a simple refcount API just using atomic :operations. : :-- : :John Baldwin -- http://www.FreeBSD.org/~jhb/ I haven't seen your patches but I like the idea of a simple API for incrementing and decrementing a refcnt_t type of variable that hides the underlying 'how'. For example, on some architectures you could use atomic ops, on others you could use a small pool of mutexes. Specifically, I really dislike the mutex embedded in the ucred structure. It is entirely unnecessary - a simple global pool of shared mutexes is sufficient, hashed by pointer address, or using atomic ops on architectures that support them. Something like this: /* * machine independant sys/refcnt.h */ #ifndef ARCH_OVERRIDE_REFCNT typedef int refcnt_t; #endif ... /* * machine independant kern/refcnt.c */ #ifndef ARCH_OVERRIDE_REFCNT #define MTX_POOL 32 static struct mtx mtx_pool[MTX_POOL]; /* * called in early startup to initialize * mutexes (if necessary) */ void refcnt_init(void) { ... } /* * Increment the ref counter. panic if we * overflow. */ void refcnt_bump(refcnt_t *rp) { /* * architecture dependant. e.g. atomic op * in I386, maybe a pool mutex for alpha, etc * etc etc. */ } /* * Decrement the ref counter. panic if we * overflow. Returns the ref counter after * it has been decremented (typically used to * determine that the associated structure * is no longer in use). */ int refcnt_drop(refcnt_t *rp) { /* * architecture dependant. e.g. atomic op * in I386, maybe a pool mutex for alpha, etc * etc etc. */ } #endif You could have a default set of ref counter routines that use a global pool of mutexes to avoid having to implement them for each architecture, and you could have architecture overrides of those routines to implement architecture-specific optimizations. Similar pool-type functions (using the same pool) can be used to sequence structure deallocations / cloning / etc. In fact, the one huge advantage of a pool mutex is that it is independant of the structure, so you don't race a deallocation routine when obtaining the mutex prior to checking that the structure is even valid. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:32:31 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211]) by hub.freebsd.org (Postfix) with ESMTP id BD31737B417 for ; Mon, 12 Nov 2001 15:32:28 -0800 (PST) Received: (qmail 53828 invoked from network); 12 Nov 2001 23:32:27 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 12 Nov 2001 23:32:27 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200111122320.fACNKLC07027@apollo.backplane.com> Date: Mon, 12 Nov 2001 15:32:21 -0800 (PST) From: John Baldwin To: Matthew Dillon Subject: Re: cur{thread/proc}, or not. Cc: Terry Lambert , Cc: Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.ORG Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 12-Nov-01 Matthew Dillon wrote: >: >:Yep. They use a mutex for the refcount for now, but I still have patches >:that >:some people don't like for implementing a simple refcount API just using >:atomic >:operations. >: >:-- >: >:John Baldwin -- http://www.FreeBSD.org/~jhb/ > > I haven't seen your patches but I like the idea of a simple API for > incrementing and decrementing a refcnt_t type of variable that > hides the underlying 'how'. For example, on some architectures > you could use atomic ops, on others you could use a small pool > of mutexes. http://www.freebsd.org/~jhb/patches/refcount.patch It's slightly different than this in that refcount_drop() returns a boolean that is true if the count just dropped to zero. It only uses mutexes when using debugging and doesn't use a pool, but currently it is implemented completely with atomic ops on all currently supported archs. Hmm, it needs a change in that no memory barriers are really needed except that maybe the atomic_add should use a release barrier. This refcount has some problems, however. The only reliable way to do a refcount_shared() primitive would be to do int refcount_shared(refcount_t *count) { int rval; rval = !refcount_drop(count); refcount_hold(count); } But that is evil and has a race condition. Changing refcount_drop() to return the current value would be more workable I suppose and would allow you to do this by doing a hold and then a drop and see if the value is > 1 to see if it's shared. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:35:51 2001 Delivered-To: freebsd-arch@freebsd.org Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122]) by hub.freebsd.org (Postfix) with ESMTP id 4ED1E37B418; Mon, 12 Nov 2001 15:35:42 -0800 (PST) Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com) by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 163Qc8-0006CX-00; Mon, 12 Nov 2001 15:35:41 -0800 Message-ID: <3BF05CFE.EAE5EEE4@mindspring.com> Date: Mon, 12 Nov 2001 15:36:30 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: John Baldwin Cc: freebsd-arch@FreeBSD.org, Robert Watson Subject: Re: cur{thread/proc}, or not. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG John Baldwin wrote: > > If so, then no locking is required, since the LCK CMPXCHG can > > be utilized to do atomic increment and decrement on the > > reference counting, without needing locks. > > Except that people keep complaining about using atomic ops for > ref counts, however that can be done later as an optimization. Is this the MIPS argument? There is a way around this problem on brain damaged processors, which has been known to CS for a long time. A heavy-weight idempotent-but-not-atomic portable approach would make these people happy, since then their pet processors would not look so much like pigs compared to other processors that were handicapped by having to run the same code. I don't think of it as a premature optimization so much as it is a premature generalization. If we want to be general, then we should provide C code for all but the very platform specific things, since this would be incredibly more useful for any port attempt than doing P/V idempotent counting. > Regarding object credentials, I agree, and I thought that this > was how things were already performed. Not where the proc or thread is used to reference the cred, though there is much code that uses the read-only reference. > > I think this is the wrong direction, but if you wanted to do this, > > I think that you would need to put the cur* symbols into the per > > CPU private pages. This is problematic in the extreme, because it > > means that you must set these values each time going down, in order > > to be able to substitute a per CPU global for the stack reference. > > Errr, Terry. Where do you think curthread/curproc lives now? It's > _already_ in a per-CPU page. We set curthread/curproc on each context > switch. Yes. That is Evil Overhead That Must Go Away. My use of "need" was probably not emphatic enough -- I should have said "MUST forever after". This isn't really very clear without my example, where I do the processing as the result of an interrupt, rather than in the context of a process. :-(. > > I would much rather that the credentials be object referenced off > > of non-process, non-thread objects, based on whatever the correct > > scoping really is, for the security model you want to enforce. My > > "accept" example is only one of a class of changes that could > > facilitate this. > > I agree with this. I think Robert's question wasn't just about socket > credentials however, his question was why pass a proc pointer (or thread > poiter) all the way down the stack that is implicitly assumed to be > curproc/curthread in several places instead of just using curproc/curthread > which your only response seems to be to suggest that we "change" to doing > something that we already do. No; I think that most of the passed references to proc/curproc can be eliminated. Now, of course, we will have to deal with the cruft idea of "curcred"... I dislike the idea of "cur" anything. It means that we have to assume top-down procedural processing, with queueing breaks at both interrupt and NETISR (to cite specific examples). Doing this is demonstrably the wrong thing to do, even if we ignore the global non-cacheable per CPU page overhead. If anyone has any reservations on this, I suggest they do some network performance testing with the Duke University port of the LRP + RESCON code to FreeBSD 4.3 from the original Rice Univeristy code (before anyone gets too happy, there is a non-commercial use license on this, and I personally think a queued fair share scheduler has significantly lower overhead than resource containers, for what that's worth). Your connection per second rate alone will triple if you use this appraoch. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:37: 1 2001 Delivered-To: freebsd-arch@freebsd.org Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122]) by hub.freebsd.org (Postfix) with ESMTP id CC64C37B417; Mon, 12 Nov 2001 15:36:59 -0800 (PST) Received: from dialup-209.245.136.188.dial1.sanjose1.level3.net ([209.245.136.188] helo=mindspring.com) by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 163QdP-0007WL-00; Mon, 12 Nov 2001 15:36:59 -0800 Message-ID: <3BF05D4C.55A9A459@mindspring.com> Date: Mon, 12 Nov 2001 15:37:48 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: John Baldwin Cc: Matthew Dillon , freebsd-arch@FreeBSD.ORG, Robert Watson Subject: Re: cur{thread/proc}, or not. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG John Baldwin wrote: > the refcount for now, but I still have patches that > some people don't like for implementing a simple refcount API just using > atomic operations. Please commit these. Using mutexes in this instance is just a happy way to put the performance in the toilet. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:50:49 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 0960137B416; Mon, 12 Nov 2001 15:50:46 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACNojg07127; Mon, 12 Nov 2001 15:50:45 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 15:50:45 -0800 (PST) From: Matthew Dillon Message-Id: <200111122350.fACNojg07127@apollo.backplane.com> To: John Baldwin Cc: Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG You want to be very careful not to bloat the concept. We already have severe bloatage in the mutex code and that has led to a lot of unnecessary complexity. A huge amount, in fact. We have so many types of mutexes it makes my head spin and I'm not very happy about it. Forget about 'shared' verses 'exclusive'. A reference count is a reference count, that's all. If you keep the concept simple you can implement more functionality horizontally rather then implementing more complexity vertically. For example, consider this API for pool mutexes. /* * obtain related pool mutex */ void pool_mtx_lock(void *ptr); { } /* * release related pool mutex. */ void pool_mtx_unlock(void *ptr) { } Now consider how this could be combined with, say, the zalloc() and zfree() code. Consider how it could be combined with the refcount code. It might even be possible to remove the stable-storage requirement. Consider a vnode verses its underlying VM object. Consider this: vp = vnode ... we already have a ref count on the vp. while ((object = vp->v_object) != NULL) { pool_mtx_lock(object) if (vp->v_object == object) break; pool_mtx_unlock(object) } /* object guarenteed to be associated with vnode */ ++object->ref_cnt; pool_mtx_unlock(object); ... continue working on object Structural overhead: 0 bytes Parallelism: high Now consider how this might be combined with the refcnt pool code: CODE PIECE 1: vp = vnode ... we already have a ref count on the vp. while ((object = vp->v_object) != NULL) { pool_mtx_lock(&object->ref_cnt); if (vp->v_object == object) break; pool_mtx_unlock(&object->ref_cnt) } /* object guarenteed to be associated with vnode */ ++object->ref_cnt; pool_mtx_unlock(&object->ref_cnt); CODE PIECE 2 (compatible with CODE PIECE 1): /* object is a known good object that will not be going away soon */ refcnt_bump(&object->ref_cnt); ... use object ... refcnt_drop(&object->ref_cnt); And there you have it. An utterly simple API of four routines (refcnt routines and pool routines), with a huge amount of capability. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:53: 4 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id 2C06837B417 for ; Mon, 12 Nov 2001 15:52:56 -0800 (PST) Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fACNqjB37001; Mon, 12 Nov 2001 18:52:45 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 12 Nov 2001 18:52:45 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Terry Lambert Cc: freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. In-Reply-To: <3BF05241.74F895EF@mindspring.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Terry Lambert wrote: > Robert Watson wrote: > > There are a number of uses of curproc in the netinet code, used to > > retrieve credentials for authorization somewhere down the stack, when no > > proc or thread pointer has been passed down. > > I think that the majority of the netinet code can be handled by using > the socket credential, instead of the process credential. The majority, yes, but not all. In particular, there are a number of desirable behaviors where you *do* want to use the process credential. In particular, relating to binding activities, where current semantics permit a 'privileged process' to create and bind sockets such that they have access to otherwise restricted ports, transfer them to unprivileged processes, but not grant the full scope of privilege to those processes. A primary example of this in use in practice might be a situation where an I/O socket is handed off from a network daemon to an unprivileged process, such as inetd handing off to fingerd: fingerd should not retain inetd's privileges regarding many aspects of the socket's behavior. This argument might be seen more convincingly from the perspective of UDP sockets. Yes, it is true that in most cases use of the socket credential is desirable, but in a number of important cases, it is not. There are some related cases in VFS, where we consider a per-jail securelevel based on the acting process, not the file-opening process. Similarly, there are some ioctl's on tty devices that are subject (process) credential authorized: these are in general present to handle the case where descriptors to these objects are (and must be) inherited. There are some related cases, such as fd passing via unix domain sockets, where the same properties can prove very useful: the ability to transfer access to sockets/files via LPC as 'rights' rather than delegating all rights. > > With the eventual addition > > of td->td_ucred, it will be desirable to use the credential for the > > current thread, rather than the proc, which will require locking to use. > > I think locking credential instances is bad. That is not what we're talking about. We're talking about locking the process structure. No one is suggesting this. > The real question you want to answer is whether or not the credential > instance that was used to acquire a socket should be used continuously > from there on out (i.e. it is a grant), or whether it should change when > the process credential changes (i.e. it is a lease). You seem to be > arguing for a lease. I would argue for a grant. > > One issue is that there are cases where write permission is tested > before each write. There are also cases, where you obtain a privileged > socket, and then relinquish privileges after obtaining it; such cases > are explicitly modelled on a grant model rather than a lease model. > > The point is that if the credentials are granted, then a change in > credential is not a change of the credential itself, but is instead a > copy-on-write proposition. In other words, credentials, once granted, > are priviledge stable. > > If this is the case, then they are written when they are instanced, > cloned before they are modified (indeed, it seems that the clone/modify > operation must be made atomic), and thus are never written once > instanced -- only destroyed on the 1->0 reference transition. Everyone agrees that the ucred semantics are copy-on-write. This is well-documented, and not something we're currently interested in changing (although some platforms have opted to sacrifice memory in order to reduce locking/atomic operations, and that's something we might eventually want to consider if we move to very fine-grained and highly parallel operation). > If so, then no locking is required, since the LCK CMPXCHG can be > utilized to do atomic increment and decrement on the reference counting, > without needing locks. There is some disagreement on the topic of atomic operations due to portability issues (among other things), but that's not what we're talking about. > > As I > > understand it, use of curproc was branded 'undesirable' at some point in > > the semi-distant past, and since that time, a reference to 'proc' has been > > passed down the stack. With a change to KSE, this has been translated to > > references the thread, but the issue remains the same. This comes up in > > particular because I have a tree where I have propagated the thread > > pointer down if_ioctl in the network stack: the normal ioctl call carries > > a thread pointer now, but when it is translated into if_ioctl by the > > network stack, that pointer is lost. This raises the question: should we > > (in practice) be adding process or thread pointers to many more of the > > function arguments, or should we switch to using curproc/curthread > > instead. > > The "curproc" undesirability stems primarily from credentials > enforcement during interrupt processing. I think that this is not an > insurmountable issue, but I would argue that these are more appropriate > for object credentials, where the objects in question are not threads or > processes. > > For example, if we were to process incoming TCP connections up through > the "accept" code at interrupt time, one might naievely assume that, > since the current socket code down through the accept processing code > off the queue filled in at NETISR seems to require a proc credential, > that it is therefore necessary to have a proc credential at interrupt > time in order to do this processing. > > The answer is that this is a false assumption, and is predicated on > historical code, and nothing more. > > Specifically, if I need a credential for a newly accepted socket that I > am now creating, I can add a reference to the listen socket credential > -- I //do not need// a process credential in order to do an accept. > > There is a lot of this type of fuzzy thinking, asking "how can I > propagate the process credential that I used to use for this operation > down to the underlying code?", when the real question should be "what is > the appropriate credential to use for this operation, and is the process > credential really what I want to use in this case?". I agree there has been a lot of fuzzy thinking. I also agree that, in every case, we need to carefully consider the credential used. In particular, this is true in the 'new world order' of td_ucred, where we'll now often have three credentials to decide from: (1) Mutable p_ucred (requires proc lock) (2) Cached td_ucred (requires no lock) (3) Cached so->so_cred, file->f_cred, et al. In most cases, (2) or (3) will be appropriate. In some situations, particularly when it comes to credential update, (1) will be appropriate. > I think it's possible to get rid of most of the process credential > references -- and therefore, most of the proc references -- at all > points below the /sys/kern/uipc_socket*.c level. No, it's not, in a number of very important cases, of which I've identified at least three above. Structuring code to have a notion of "but the kernel asked" vs. "but a user asked" is difficult, and something I'm not sure we have a grasp on how to approach. Sometimes, for example, FSCRED or NOCRED is used as a "special-case" credential to say "do it anyway". This is often broken when it comes to distributed file systems where a client system may not simply be able to assert "because I said so", and probably reflects unclear thinking on the topic. > > I don't pretend to have a grasp of all the issues here, so the purpose of > > this message is to raise the issues so that I can understand them. I have > > a tree where I've eliminated many references to curproc; however, I'm now > > wondering if it wouldn't simply be more useful to eliminate many of the > > references to struct proc in the function arguments, and use curproc > > instead, and add references to ucred (and related ref-counted structures) > > as needed for delegation types of situations. In particular, that would > > suggest the following changes: > > I think this is the wrong direction, but if you wanted to do this, I > think that you would need to put the cur* symbols into the per CPU > private pages. This is problematic in the extreme, because it means > that you must set these values each time going down, in order to be able > to substitute a per CPU global for the stack reference. > > I think this is a bad thing, in general, and will lead only to trouble > later. > > I would much rather that the credentials be object referenced off of > non-process, non-thread objects, based on whatever the correct scoping > really is, for the security model you want to enforce. My "accept" > example is only one of a class of changes that could facilitate this. I think everyone agrees that the 'cached credential' model is the right approach for many of these cases, but I think it's over-reaching to claim it's appropriate in all cases. The question then becomes, how do we access the relevant 'subject' credential to authorize the operation: is it something that is passed down via the call stack (possibly via 'struct thread *td'), or is it something implicit to the run-time environmenta ('curproc'/'curthread'), which is precisely the question I was trying to resolve through my post. If 'curproc'/'curthread' is truly undesirable, then we can simply eliminate its use, and replace that with almost universal passing of 'struct thread' (for the purposes of authorization, but also for other purposes: target of copyin/copyout/aio, scheduling, ktrace, ...). If it is acceptable to maintain the use of curproc, we may want to change some of our primitives to represent it being available. Right now, we're in a state of limbo: the official policy (if you will) is 'XXX'. We should either eliminate it from general use, or we should use it where it's appropriate :-). Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:56:58 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id 05D5337B416 for ; Mon, 12 Nov 2001 15:56:56 -0800 (PST) Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fACNudB37043; Mon, 12 Nov 2001 18:56:39 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 12 Nov 2001 18:56:38 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Terry Lambert Cc: Matthew Dillon , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <3BF05877.B9E886D8@mindspring.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Terry Lambert wrote: > Matthew Dillon wrote: > > Yes, I believe this is how credentials work. I looked at > > the code about 6 months ago. We should not have to do any > > locking of the credential stuff, only simple mutexing > > around the ref counter. That is how it should work > > is how I believe it currently works. > > FWIW: > > Robert had implied that more heavyweight locking of the process (or > thread) structure was necessary to access the credential, which is > correct, if you are referencing it that was. In the proposed model, there are two relevant subject credentials: the thread credential, and the process credential. The thread credential is static for the lifetime of the system call, and while the call is on-going, it can be used without any locking/atomic primitives (with the exception of when additional references are added to be cached in objects). The process credential is shared, and, if you will, the 'real' copy. This reference is changed as the process's notion of credential is updated, and requires locks, as it might be changed by multiple threads (potentially in parallel), as well as inspected by other processes for the purposes of reporting (to ps, for example), or for access control (signal delivery, debugging, ...) One of the many nice things about the model, which should be credited to John, is that it doesn't require locking operations in most usage situations. > The part of me you quoted here was a conclusion based on using direct > references to value-stable credentials rather than value-colatile proc > or thread structs. It only works to refute Roberts argument if you > include that; it's not correct to conclude that the way it currently > works is sufficient in the face of the proc/thread dereference issues > that Robert was trying to address (and which I tried to address by > avoiding entirely). ... Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:57:42 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 02BEE37B417; Mon, 12 Nov 2001 15:57:39 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACNvc507188; Mon, 12 Nov 2001 15:57:38 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 15:57:38 -0800 (PST) From: Matthew Dillon Message-Id: <200111122357.fACNvc507188@apollo.backplane.com> To: John Baldwin Cc: Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :http://www.freebsd.org/~jhb/patches/refcount.patch : :It's slightly different than this in that refcount_drop() returns a boolean Ok, I've read it. Ick. Could you reorgranize it a bit to do something slightly different? Make sys/refcount.h provide a machine portable set of routines. Allow the machine/refcount.h headers to override the portable set. This way an architecture does *NOT* need to implement routines for yet another header file (or duplicate a lot of code over and over again). This business about INVARIANTS makes no sense to me. INVARIANTS should not totally change the way the refcount API works. It certainly should not result in different structures! If we are embedding ref counts in every structure in the system simply setting or clearing INVARIANTS blows up our compatibility, which is bad. Also, I don't see any reason to embed yet another mutex in a structure. The ref count should be a simple int. Use a pool of mutexes. If you like I'll commit a set of generic pool mutexes that you can simply call. How about that? -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:58: 3 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id E40DC37B416 for ; Mon, 12 Nov 2001 15:57:59 -0800 (PST) Received: (qmail 10400 invoked from network); 12 Nov 2001 23:57:58 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 12 Nov 2001 23:57:58 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200111122350.fACNojg07127@apollo.backplane.com> Date: Mon, 12 Nov 2001 15:57:52 -0800 (PST) From: John Baldwin To: Matthew Dillon Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.org, Robert Watson , Terry Lambert Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 12-Nov-01 Matthew Dillon wrote: > You want to be very careful not to bloat the concept. We > already have severe bloatage in the mutex code and that has > led to a lot of unnecessary complexity. A huge amount, > in fact. We have so many types of mutexes it makes my > head spin and I'm not very happy about it. Forget about > 'shared' verses 'exclusive'. A reference count is a > reference count, that's all. If you keep the concept > simple you can implement more functionality horizontally > rather then implementing more complexity vertically. Err, hang on. I wasn't doing shared counts. refcount_shared() would be a simple primitive to return true if the refcount was > 1. I was trying to see how the current API would fit with ucred mutexes, for example. If you had looked at the patch, you would find that the API is very simple. What I really should do is add atomic_fetchadd() (fetchadd on ia64, xadd on 486+, locked load /conditional store loop on alpha, simualted with atomic_cmpset() on opther archs if needed) and refcount_drop() can just be atomic_fetchadd(). This will change refcount_drop() to return the current value rather than if the value is zero. Please reread my mail and the patch itself. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 15:59: 1 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 6CBD137B41B; Mon, 12 Nov 2001 15:58:59 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fACNwxq07227; Mon, 12 Nov 2001 15:58:59 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 15:58:59 -0800 (PST) From: Matthew Dillon Message-Id: <200111122358.fACNwxq07227@apollo.backplane.com> To: John Baldwin Cc: freebsd-arch@FreeBSD.ORG, Robert Watson , Terry Lambert Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG : : :On 12-Nov-01 Matthew Dillon wrote: :> You want to be very careful not to bloat the concept. We :> already have severe bloatage in the mutex code and that has :> led to a lot of unnecessary complexity. A huge amount, :> in fact. We have so many types of mutexes it makes my : :Err, hang on. I wasn't doing shared counts. refcount_shared() would be a :simple primitive to return true if the refcount was > 1. I was trying to see Sorry. Posted that before I read the patch. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16: 0:26 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id A175C37B417 for ; Mon, 12 Nov 2001 16:00:20 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95484 for ; Mon, 12 Nov 2001 15:48:49 -0800 (PST) Date: Mon, 12 Nov 2001 15:48:47 -0800 (PST) From: Julian Elischer To: arch@freebsd.org Subject: Thread scheduling in the kernel Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG In an attempt to get the next part of the KSE work designed (design before code you know.. a strange new concept) I've been trying to work out the "correct" scheduling methods for such a system. There are a few 'tricks' that need to be taken into account.. a few notes.. 1/ Since threads running a syscall hit 'sleep' events the entities on teh sleep queues must be the threads. 2/ the entity that is scheduled onto the run queues is the KSE. (as the name suggests). 3/ If we have only one run queue, then KSEs for several processors from the same process, may be on the same queue. 4/ If threads 'wake up' they are hung of a list of runnable threads somewhere. This list could be hanging off the process, or the KSE. (actually more likely the KSEgroup than the process but...) 5/ If a KSE reaches teh front of the queue, but the process that is running is not that for which that KSE has some affinity, does it get out of the way to allow another KSE in the queue to get run? or does it just run and 'switch' everything over to the new available processor? Maybe the scheduler looks for the KSE from the same group, that was assigned to that processor, and runs that, leaving the original KSE at the head of the queue? Maybe that happens until all the KSEs in the queue that were from that group have been run? In this case it becomes possible to always have a KSE from that group ready... Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that group are put on all processors that look for work, until all of them have been run? (this would ensure that threads from the same process would all be run at the same time which is sometimes good, and sometimes bad, depending on the application. 6/ When a Thread is made runnable it gets (in the present system) a priority. What priority does a KSE in the run queues have when it has threads of several differnt priorities? Do we sort them in priority order and drop the priority of the KSE(group) as we go through them until we have less priority than some other kse? 7/ when a KSE runs out of work, how does it decide whether there is work that should be stolen from a fellow KSE? How does processor affinity effect this? 8/ If we had per-processor scheduling queues, How would that effect it? Which element get's put on the queues? Does a KSE stay on the run queue if it has un=run threads, even when it's running? How do we handle the arrival of new runnable threads with a KSE when it's running but a fellow KSE is not runnable. Do we bump the priority of the other KSE and hand it the new threads? remember: here are the 4 structures: proc - owner of all resources (FDs, memory, user creds) except cpu Ksegroup - owner of all scheduler controlling characteristics (e.g. nice, realtime, number of processors), N per process. Owner of stats used for scheduling calculations. kse - kind of a placeholder. It gets scheduled onto a processor (by a yet un-named mechaninsm) and provides cpu-cycles for the execution of 'threads' (see next). Max. of one per processor per KSE-group. thread - The in-kernel incarnation of a user thread that is presently in the kernel for some reason (e.g. syscall, pagefault, etc) Holds ALL the state needed to resume after sleeping, and is the entity that is suspended when the thread hits a 'sleep'. "unlimmitted" per KSEgroup. probably have a short-term "favourite" KSE/processor. When a thread blocks, the KSE looks for another thread to run, and if it doesn't find one, it will create one, and upcall back to the userland to see if there are more userland threads to run. (if not, it returns to yield the processor) The question that has been giving me headaches is the relationship between these elements, and the definitions of how these structures are linked up and moved around to provide fair efficient scheduling. If a KSE has a high priority thread and a low priority thread runnable in the kernel, but in reverse order, should it take the high priority from the higher prio. thread and process both, or should it order the threads and run teh high prio one first. In this case what happens whan a higher prio. thread becomes runnable while one is already running, and if the highest prio thread returns to userland, should teh processor move to userland to follow it, or switch to the next priority thread in the kernel.? Do all threads in the kernel have priority over all threads in userland? (this might be a reasonable decision). These and other questions are in need of real discussion here on -arch. We need to somewhere develope a document as to how we want this to work. If we can have a good discussion here on these topics over a coupel of days I'll attempt to produce such a document and submit it for comment as the basis of a second round of discussions. Julian To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16: 0:39 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 75DE337B41C; Mon, 12 Nov 2001 16:00:32 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA95502; Mon, 12 Nov 2001 15:53:29 -0800 (PST) Date: Mon, 12 Nov 2001 15:53:28 -0800 (PST) From: Julian Elischer To: John Baldwin Cc: Matthew Dillon , freebsd-arch@FreeBSD.ORG, Robert Watson , Terry Lambert Subject: Re: cur{thread/proc}, or not. In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG we should re-examine teh 'refcount' API it's a very basic type and gettin gmore-so all the time.. we can affort to have a 'standard' 'safe' way of doing reference counts. On Mon, 12 Nov 2001, John Baldwin wrote: > > On 12-Nov-01 Matthew Dillon wrote: > >:The point is that if the credentials are granted, then a > >:change in credential is not a change of the credential itself, > >:but is instead a copy-on-write proposition. In other words, > >:credentials, once granted, are priviledge stable. > >: > >:If this is the case, then they are written when they are > >:instanced, cloned before they are modified (indeed, it seems > >:that the clone/modify operation must be made atomic), and > >:thus are never written once instanced -- only destroyed on > >:the 1->0 reference transition. > >: > >:If so, then no locking is required, since the LCK CMPXCHG can > >:be utilized to do atomic increment and decrement on the > >:reference counting, without needing locks. > >:... > >: > >:-- Terry > > > > Yes, I believe this is how credentials work. I looked at > > the code about 6 months ago. We should not have to do any > > locking of the credential stuff, only simple mutexing > > around the ref counter. That is how it should work > > is how I believe it currently works. > > Yep. They use a mutex for the refcount for now, but I still have patches that > some people don't like for implementing a simple refcount API just using atomic > operations. > > -- > > John Baldwin -- http://www.FreeBSD.org/~jhb/ > PGP Key: http://www.baldwin.cx/~john/pgpkey.asc > "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:20:18 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 8B62137B41B; Mon, 12 Nov 2001 16:20:10 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA95611; Mon, 12 Nov 2001 16:10:26 -0800 (PST) Date: Mon, 12 Nov 2001 16:10:25 -0800 (PST) From: Julian Elischer To: Matthew Dillon Cc: John Baldwin , Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. In-Reply-To: <200111122350.fACNojg07127@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Matthew Dillon wrote: > You want to be very careful not to bloat the concept. We > already have severe bloatage in the mutex code and that has > led to a lot of unnecessary complexity. A huge amount, > in fact. We have so many types of mutexes it makes my > head spin and I'm not very happy about it. Forget about > 'shared' verses 'exclusive'. A reference count is a > reference count, that's all. If you keep the concept > simple you can implement more functionality horizontally > rather then implementing more complexity vertically. > > For example, consider this API for pool mutexes. [...] weren't you just complaining that there were too many kinds of mutex? I'm not sure how this fits under "reference counting API" ANyhow can you explain the idea of a pool mutex more clearly? > > > And there you have it. An utterly simple API of four > routines (refcnt routines and pool routines), with a huge > amount of capability. > > -Matt > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:23:29 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 55EC337B416; Mon, 12 Nov 2001 16:23:27 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAD0Msb07370; Mon, 12 Nov 2001 16:22:54 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 16:22:54 -0800 (PST) From: Matthew Dillon Message-Id: <200111130022.fAD0Msb07370@apollo.backplane.com> To: Julian Elischer Cc: John Baldwin , freebsd-arch@FreeBSD.ORG, Robert Watson , Terry Lambert Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG : :we should re-examine teh 'refcount' API : :it's a very basic type and gettin gmore-so all the time.. :we can affort to have a 'standard' 'safe' way of doing reference counts. : Well, the question we face here is: should a refcount API be self contained - apply only to ref counts, or should it be interlockable with other functionality? The best example of what I'm asking here can be found by observing the existing vnode interlock. A single interlock mutex in each vnode currently handles a bunch of chores: (1) It locks v_usecount flags, (2) it interlocks the higher-level lockmgr lock, and (3) it interlocks certain combined operations. The current refcount API that John proposes would not be sufficient to be useful for the vnode v_usecount, but it probably would be sufficient for something like the ucred cr_ref count. What about other structures in the system? Do we need self-contained ref counts ala ucred, or do we need interlocking ref counts ala vnode? -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:24:46 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212]) by hub.freebsd.org (Postfix) with ESMTP id 27D6F37B41B for ; Mon, 12 Nov 2001 16:24:38 -0800 (PST) Received: (qmail 70062 invoked from network); 13 Nov 2001 00:24:37 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 00:24:37 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 12 Nov 2001 16:24:32 -0800 (PST) From: John Baldwin To: Julian Elischer Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.org, Robert Watson , Terry Lambert , Matthew Dillon Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 13-Nov-01 Julian Elischer wrote: > > > On Mon, 12 Nov 2001, Matthew Dillon wrote: > >> You want to be very careful not to bloat the concept. We >> already have severe bloatage in the mutex code and that has >> led to a lot of unnecessary complexity. A huge amount, >> in fact. We have so many types of mutexes it makes my >> head spin and I'm not very happy about it. Forget about >> 'shared' verses 'exclusive'. A reference count is a >> reference count, that's all. If you keep the concept >> simple you can implement more functionality horizontally >> rather then implementing more complexity vertically. >> >> For example, consider this API for pool mutexes. > > [...] > > weren't you just complaining that there were too many kinds of mutex? > I'm not sure how this fits under "reference counting API" > > ANyhow can you explain the idea of a pool mutex more clearly? Heh, think of it as a pool of mutexes, not a different type of mutex. Instead of having 1 mutex for each object, you use a hash table of mutexes for a set of objects. Thus, if you have 50 objects vs. 500 objects, if you embed 1 mutex for each object, you bloat each object and have 500 locks instead of 50 locks. Using pool mutexes, you only have N number of mutexes regardless of the number of mutexes. Note that if pool mutexes are non-recursive, they can't be safely used when you might have more than one object of a given set locked at a time. For example, process locks are the only object we do this with currently. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:24:45 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id BA7CD37B405 for ; Mon, 12 Nov 2001 16:24:34 -0800 (PST) Received: (qmail 2691 invoked from network); 13 Nov 2001 00:24:34 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 00:24:34 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200111122357.fACNvc507188@apollo.backplane.com> Date: Mon, 12 Nov 2001 16:24:28 -0800 (PST) From: John Baldwin To: Matthew Dillon Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.ORG, Robert Watson , Terry Lambert Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 12-Nov-01 Matthew Dillon wrote: >:http://www.freebsd.org/~jhb/patches/refcount.patch >: >:It's slightly different than this in that refcount_drop() returns a boolean > > Ok, I've read it. Ick. Could you reorgranize it a bit to do something > slightly different? > > Make sys/refcount.h provide a machine portable set of routines. Allow > the machine/refcount.h headers to override the portable set. This way > an architecture does *NOT* need to implement routines for yet another > header file (or duplicate a lot of code over and over again). Actually, if I add atomic_fetchadd(), the whole thing becomes MI and can just live in sys/refcount.h. > This business about INVARIANTS makes no sense to me. INVARIANTS should > not totally change the way the refcount API works. It certainly should > not result in different structures! If we are embedding ref counts > in every structure in the system simply setting or clearing INVARIANTS > blows up our compatibility, which is bad. It could use a static system-wide mutex for all I care. The invariants need the mutex so they can safely read the value for the purposes of the KASSERT's, that is all. A pool would be better than a single mutex possibly. My question is how does your pool work? Do you pick a mutex out of the pool at init time like the lockmgr locks work? Or do you use a hash on the object address? > Also, I don't see any reason to embed yet another mutex in a structure. > The ref count should be a simple int. Use a pool of mutexes. If you > like > I'll commit a set of generic pool mutexes that you can simply call. How > about that? Well, there are different ways of doing lock pools. :) How about something like this: /* * Returns lock for address 'ptr'. * mtx_pool_find(void *ptr) { } #define mtx_pool_lock(p) mtx_lock(mtx_pool_find((p))) #define mtx_pool_unlock(p) mtx_unlock(mtx_pool_find((p)) Then if a structure (like lockmgr locks or sx locks) wants to cache the lock pointer instead of doing the hash all the time, it can just do foo->f_lock = mtx_pool_find(foo); This actually isn't all that difficult, it just adds the ability to lookup and cache the mutex associated with an address. I would also like it under mtx_* so it's clear what type of locks are in the pool, but that's just me. :) > -Matt -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:24:51 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212]) by hub.freebsd.org (Postfix) with ESMTP id 859D737B41A for ; Mon, 12 Nov 2001 16:24:36 -0800 (PST) Received: (qmail 70027 invoked from network); 13 Nov 2001 00:24:35 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 00:24:35 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 12 Nov 2001 16:24:29 -0800 (PST) From: John Baldwin To: Julian Elischer Subject: RE: Thread scheduling in the kernel Cc: arch@freebsd.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 12-Nov-01 Julian Elischer wrote: > > In an attempt to get the next part of the KSE work designed (design before > code you know.. a strange new concept) I've been trying to work out > the "correct" scheduling methods for such a system. > > There are a few 'tricks' that need to be taken into account.. > > a few notes.. > > > 1/ Since threads running a syscall hit 'sleep' events > the entities on teh sleep queues must be the threads. > > 2/ the entity that is scheduled onto the run queues is the KSE. > (as the name suggests). > > 3/ If we have only one run queue, then KSEs for several processors > from the same process, may be on the same queue. > > 4/ If threads 'wake up' they are hung of a list of runnable threads > somewhere. This list could be hanging off the process, or the KSE. > (actually more likely the KSEgroup than the process but...) It should hang off the group. > 5/ If a KSE reaches teh front of the queue, but the process > that is running is not that for which that KSE has some affinity, > does it get out of the way to allow another KSE in the queue > to get run? or does it just run and 'switch' everything over to the new > available processor? Maybe the scheduler looks for the KSE from the same > group, that was assigned to that processor, and runs that, leaving > the original KSE at the head of the queue? > Maybe that happens until all the KSEs in the queue > that were from that group have been run? In this case it becomes possible > to always have a KSE from that group ready... Actually, I would remove the concept of affinities from the KSE itself. Rather I would let each thread have lastcpu like it does now, and when a KSE goes to choose a thread, it chooses one that has the lastcpu == current cpuid. > Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that > group are put on all processors that look for work, until all of them > have been run? (this would ensure that threads from the same process > would all be run at the same time which is sometimes good, and sometimes > bad, depending on the application. I wouldn't do this. I would just put KSE's on the queue's. However, I think that KSE's actually can be even smaller than they are now. AFAICT they are basically placeholders to sit on the runqueue's and not good for much else. :) > 6/ When a Thread is made runnable it gets (in the present system) a > priority. What priority does a KSE in the run queues have when it has > threads of several differnt priorities? Do we sort them in priority order > and drop the priority of the KSE(group) as we go through them > until we have less priority than some other kse? Actually, in theory the prioities are supposed to be per-KSE group right? In that case, changign the priority of an individual thread for the purposes of priority propagation/inheritance or other shenanigans results in creating a new group for that thread. > 7/ when a KSE runs out of work, how does it decide whether there is work > that should be stolen from a fellow KSE? How does processor affinity > effect this? If the list is per-ksegroup, then you just make a first pass preferring threads that last ran on the current CPU. If you don't find anything, you just grab the first thing on the list. > 8/ If we had per-processor scheduling queues, How would that effect it? > Which element get's put on the queues? Does a KSE > stay on the run queue if it has un=run threads, even when it's running? > How do we handle the arrival of new runnable threads with a KSE > when it's running but a fellow KSE is not runnable. Do we > bump the priority of the other KSE and hand it the new threads? I'm not sure how this fits in that model unless you bind KSE's to CPU's or something similar. Only threads really have affinity, KSE's don't really care if they migrate as they have no execution context that gets affected. If the priorities are per-KSEgroup, then you get to assume that all threads in a group are equal in priority, which is true unless a particular thread temporiarly gets a bump from priority propagation or the process assigns a thread to a realtime priority or some such. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:26:45 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id 6667D37B419 for ; Mon, 12 Nov 2001 16:26:33 -0800 (PST) Received: (qmail 4112 invoked from network); 13 Nov 2001 00:26:32 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 00:26:32 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 12 Nov 2001 16:26:27 -0800 (PST) From: John Baldwin To: John Baldwin Subject: Re: cur{thread/proc}, or not. Cc: Terry Lambert , Cc: Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.ORG, Matthew Dillon Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 13-Nov-01 John Baldwin wrote: > Then if a structure (like lockmgr locks or sx locks) wants to cache the lock > pointer instead of doing the hash all the time, it can just do > > foo->f_lock = mtx_pool_find(foo); > > This actually isn't all that difficult, it just adds the ability to lookup > and > cache the mutex associated with an address. I would also like it under mtx_* > so it's clear what type of locks are in the pool, but that's just me. :) s/difficult/different/ It's not difficult either, but that wasn't my point. :) -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:31:23 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id EA1DA37B405; Mon, 12 Nov 2001 16:31:20 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAD0Unn07434; Mon, 12 Nov 2001 16:30:49 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 16:30:49 -0800 (PST) From: Matthew Dillon Message-Id: <200111130030.fAD0Unn07434@apollo.backplane.com> To: Julian Elischer Cc: John Baldwin , Terry Lambert , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :weren't you just complaining that there were too many kinds of mutex? :I'm not sure how this fits under "reference counting API" : :ANyhow can you explain the idea of a pool mutex more clearly? A pool mutex is the BSDI concept, similar to the wait address when you tsleep(). You get the mutex via a rendezvous point which is an arbitrary pointer, and release it the same way. Just as with the wait address the pointer you pass is arbitrary. It need not represent any sort of structure and the structures you use need not embed any actual mutex. Instead the pool code would obtain a mutex out of a pool of mutexes based on a hash of the supplied pointer. pool_mtx_lock(void *ptr); pool_mtx_unlock(void *ptr); Pool mutexes could be used just about *everywhere* where a mutex is used in a non-reentrant fashion now. i.e. where you obtain a mutex, do a bunch of stuff that does not require obtaining any additional mutexes, and then release the mutex (which is how most mutexes are supposed to work anyway). There are two huge advantages to using pool mutexes: * No structural overhead. Zip. Zero. Zilch. Nada. * The mutex itself is stable storage, even if the address is not, so you can use it to verify the second pointer when you have a pointer to a (stable) structure containing a field which is a pointer to an (unstable) structure. while ((ptr = stable->pointer) != NULL) { pool_mtx_lock(ptr); if (ptr == stable->pointer) break; pool_mtx_unlock(ptr); } /* * stable->pointer, if not NULL, is now locked and itself stable * until you release the mutex */ There are two disadvantages: * Possible non-optimal cache mastership behavior. However, this is not a major disadvantage since it can be addressed by increasing the pool size. * Slightly greater overhead to calculate the hash index and obtain the address of the pool mutex before obtaining or releasing it. The pool mutex hash function would be something simple based on (int)ptr. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:34:54 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 2E4B937B416; Mon, 12 Nov 2001 16:34:52 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAD0YqV07450; Mon, 12 Nov 2001 16:34:52 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 16:34:52 -0800 (PST) From: Matthew Dillon Message-Id: <200111130034.fAD0YqV07450@apollo.backplane.com> To: John Baldwin Cc: freebsd-arch@FreeBSD.org, Robert Watson , Terry Lambert Subject: Re: cur{thread/proc}, or not. References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :is how does your pool work? Do you pick a mutex out of the pool at init time :like the lockmgr locks work? Or do you use a hash on the object address? I was thinking non-chained hash on the object address. Real simple. (((int)ptr >> 5) ^ (int)ptr) & MASK or something like that. Or something even simpler... basically something we can play around with and optimize later without breaking the API we've constructed. :Well, there are different ways of doing lock pools. :) How about something :like this: : :/* : * Returns lock for address 'ptr'. : * :mtx_pool_find(void *ptr) :{ :} : :#define mtx_pool_lock(p) mtx_lock(mtx_pool_find((p))) :#define mtx_pool_unlock(p) mtx_unlock(mtx_pool_find((p)) : :Then if a structure (like lockmgr locks or sx locks) wants to cache the lock :pointer instead of doing the hash all the time, it can just do : : foo->f_lock = mtx_pool_find(foo); : :This actually isn't all that difficult, it just adds the ability to lookup and :cache the mutex associated with an address. I would also like it under mtx_* :so it's clear what type of locks are in the pool, but that's just me. :) Yes I think the addition of a mtx_pool_find() call is excellent! A wonderful example of horizontal expansion (rather then vertical complexity, or vertical complication if I'm being cute). -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:35:31 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id 3128637B405; Mon, 12 Nov 2001 16:35:26 -0800 (PST) Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAD0ZGB37659; Mon, 12 Nov 2001 19:35:16 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 12 Nov 2001 19:35:15 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Terry Lambert Cc: John Baldwin , Matthew Dillon , freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. In-Reply-To: <3BF05D4C.55A9A459@mindspring.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, Terry Lambert wrote: > John Baldwin wrote: > > the refcount for now, but I still have patches that > > some people don't like for implementing a simple refcount API just using > > atomic operations. > > Please commit these. Using mutexes in this instance is just a happy way > to put the performance in the toilet. My recollection is that there was some concern about the size of the unit of atomic operation across platforms. I may not recall correctly, but my understanding was that some platforms substantially limited the potential size of the target of the atomic operation to less than the normal arithmetic unit size. Again, subject to the fallibility of my recollection, the maximum unit for atomic operations on Sparc64 was 24-bit, despite the native register size being 64-bit. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:38:23 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id ACBBC37B416 for ; Mon, 12 Nov 2001 16:38:20 -0800 (PST) Received: (qmail 11614 invoked from network); 13 Nov 2001 00:38:19 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 00:38:19 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 12 Nov 2001 16:38:14 -0800 (PST) From: John Baldwin To: Robert Watson Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.org, Matthew Dillon , Terry Lambert Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 13-Nov-01 Robert Watson wrote: > > On Mon, 12 Nov 2001, Terry Lambert wrote: > >> John Baldwin wrote: >> > the refcount for now, but I still have patches that >> > some people don't like for implementing a simple refcount API just using >> > atomic operations. >> >> Please commit these. Using mutexes in this instance is just a happy way >> to put the performance in the toilet. > > My recollection is that there was some concern about the size of the unit > of atomic operation across platforms. I may not recall correctly, but my > understanding was that some platforms substantially limited the potential > size of the target of the atomic operation to less than the normal > arithmetic unit size. Again, subject to the fallibility of my > recollection, the maximum unit for atomic operations on Sparc64 was > 24-bit, despite the native register size being 64-bit. No, that was on sparc32, not sparc64. All of our current architectures would be fine with it. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 16:42:50 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id 72EDC37B417; Mon, 12 Nov 2001 16:42:44 -0800 (PST) Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAD0gWB37739; Mon, 12 Nov 2001 19:42:32 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 12 Nov 2001 19:42:31 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: John Baldwin Cc: freebsd-arch@FreeBSD.org, Matthew Dillon , Terry Lambert Subject: Re: cur{thread/proc}, or not. In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 12 Nov 2001, John Baldwin wrote: > > My recollection is that there was some concern about the size of the unit > > of atomic operation across platforms. I may not recall correctly, but my > > understanding was that some platforms substantially limited the potential > > size of the target of the atomic operation to less than the normal > > arithmetic unit size. Again, subject to the fallibility of my > > recollection, the maximum unit for atomic operations on Sparc64 was > > 24-bit, despite the native register size being 64-bit. > > No, that was on sparc32, not sparc64. All of our current architectures > would be fine with it. Oh, good. I couldn't remember (hence some waffling) -- I have no problem with this. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 18:17: 3 2001 Delivered-To: freebsd-arch@freebsd.org Received: from gull.prod.itd.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84]) by hub.freebsd.org (Postfix) with ESMTP id 542E637B416; Mon, 12 Nov 2001 18:16:54 -0800 (PST) Received: from dialup-209.247.141.234.dial1.sanjose1.level3.net ([209.247.141.234] helo=mindspring.com) by gull.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 163T87-0001fz-00; Mon, 12 Nov 2001 18:16:52 -0800 Message-ID: <3BF082C6.BA7CA05D@mindspring.com> Date: Mon, 12 Nov 2001 18:17:42 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Robert Watson Cc: freebsd-arch@FreeBSD.org Subject: Re: cur{thread/proc}, or not. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Robert Watson wrote: > > I think that the majority of the netinet code can be handled by using > > the socket credential, instead of the process credential. > > The majority, yes, but not all. In particular, there are a number of > desirable behaviors where you *do* want to use the process credential. In > particular, relating to binding activities, where current semantics permit > a 'privileged process' to create and bind sockets such that they have > access to otherwise restricted ports, transfer them to unprivileged > processes, but not grant the full scope of privilege to those processes. A > primary example of this in use in practice might be a situation where an > I/O socket is handed off from a network daemon to an unprivileged process, > such as inetd handing off to fingerd: fingerd should not retain inetd's > privileges regarding many aspects of the socket's behavior. This argument > might be seen more convincingly from the perspective of UDP sockets. Yes, > it is true that in most cases use of the socket credential is desirable, > but in a number of important cases, it is not. I think that this case implies that the socket creation and binding are seperated, or that it's possible to re-bind a socket, once bound. I think the model needs to be "reliquish priviledges"; in other words, there is an explicit handoff, at which point this is an allowable thing. Putting bits like "may bind to privileged port" on unbound sockets is, I think, a bad thing. The easiest way to deal with this is to replace the socket credential when the handoff takes place. However, I think that in most cases, the priviledge handoff associated with the handdof of a priviledged object is _intentional_, in order to have a process with full privileges (e.g. "root") hand off only partial privileges to another, otherwise unprivileged process. Specifically, it's a workaround for not having high granularity control over privileges and/or a capabilities model (capabilities models are, by definition, impossible to initialize without invoking some implicit privilege, so we can ignore them as academic curiousities for now). If I had off access to something by handing off a descriptor, rather than handing off a reference and forcing you to create your own descriptor, then my handoff of rights is intentional, and not something which needs to be blocked. > There are some related cases in VFS, where we consider a per-jail > securelevel based on the acting process, not the file-opening process. I don't like these, but I accept that they must exist for jail code to function. > Similarly, there are some ioctl's on tty devices that are subject > (process) credential authorized: these are in general present to handle > the case where descriptors to these objects are (and must be) inherited. I think this and the previous case can be folded together as "user option", similar to not being able to have simultaneous use of your X server, or the ability to load kernel modules, and secure level 2 at the same time: it's a trade off, and it is a conscious one make at user discretion. > There are some related cases, such as fd passing via unix domain sockets, > where the same properties can prove very useful: the ability to transfer > access to sockets/files via LPC as 'rights' rather than delegating all > rights. The read/write rights for object opened by another process, or opened in an SUID case, with a subsequent relinquishing of the credentials that permitted the operation in the first place are the interesting cases, I think. The others fall into exception and administrative fiat. > > > With the eventual addition > > > of td->td_ucred, it will be desirable to use the credential for the > > > current thread, rather than the proc, which will require locking to use. > > > > I think locking credential instances is bad. > > That is not what we're talking about. We're talking about locking the > process structure. No one is suggesting this. I think locking the process structure/thread structure is bad, particularly when you are only doing it to get at the credential, and it's probably the wrong credential anyway. > > There is a lot of this type of fuzzy thinking, asking "how can I > > propagate the process credential that I used to use for this operation > > down to the underlying code?", when the real question should be "what is > > the appropriate credential to use for this operation, and is the process > > credential really what I want to use in this case?". > > I agree there has been a lot of fuzzy thinking. I also agree that, in > every case, we need to carefully consider the credential used. In > particular, this is true in the 'new world order' of td_ucred, where we'll > now often have three credentials to decide from: > > (1) Mutable p_ucred (requires proc lock) > (2) Cached td_ucred (requires no lock) > (3) Cached so->so_cred, file->f_cred, et al. > > In most cases, (2) or (3) will be appropriate. In some situations, > particularly when it comes to credential update, (1) will be appropriate. I consider caching of mutable data harmful. Here, you inply that there will be cached mutable data in scope at the time that the decision to use the mutable data must be made. I think this is incredibly messy, and will only lead to mistakes about what's being used. I think that if a right is granted, it's granted, and only if you define a specific revocation protocol that can be procedurally linked so as to notify those people who need to make the assumption of non-mutability for performance reasons, is it OK to change it. I would be very tempted to: 1) const the credentials that are non-mutable; this is hard, but manageable through a cast after the reference count adjustment. 2) leave all unnecessary credentials out of scope, so that the decision as to which to use is obvious. 3) Discourage the implementation of a revocation protocol. I realize that this is a tradeoff between explicit and implicit, and that it results in irrevokable grant of priviledges, in so far as the credential reference granted grants such priviledge, but the cases where this is bas are incredible exceptions, such as revocation of a clearance to someone formerly having clearance on a machine where you are going to trust their processes to continue to run, at the lowered clearance level. Continuing to let the code run in this situation will probably happen when hell freezes over. > > I think it's possible to get rid of most of the process credential > > references -- and therefore, most of the proc references -- at all > > points below the /sys/kern/uipc_socket*.c level. > > No, it's not, in a number of very important cases, of which I've > identified at least three above. I disagree with two of them (see above), and thingk the third is an incredible exception. If you don't think so, then perhaps it's time we rethink the underlying problem being solved, and change the solution to be more rational so as to not require that. The problem here is that you are trying to do something as an afterthought (add security features not previously present), and avoid some of the redesign that should happen, at the cost of a performance penalty. > Structuring code to have a notion of "but the kernel asked" vs. "but a > user asked" is difficult, and something I'm not sure we have a grasp on > how to approach. Sometimes, for example, FSCRED or NOCRED is used as a > "special-case" credential to say "do it anyway". This is often broken > when it comes to distributed file systems where a client system may not > simply be able to assert "because I said so", and probably reflects > unclear thinking on the topic. Most distributed FS's have this issue. You're not going to resolve it by fiat, since it's impossible to do that without an enforcible distributed cache coherency protocol,. such that when the cached data gets to the client, it can be forecefully updated by the server, should it become necessary. I think you are concentrating too much on the revocation of granted rights issue, rather than on the grant of nonrevokable right issue, which is what I think should be the tack taken. > > I would much rather that the credentials be object referenced off of > > non-process, non-thread objects, based on whatever the correct scoping > > really is, for the security model you want to enforce. My "accept" > > example is only one of a class of changes that could facilitate this. > > I think everyone agrees that the 'cached credential' model is the right > approach for many of these cases, but I think it's over-reaching to claim > it's appropriate in all cases. The question then becomes, how do we > access the relevant 'subject' credential to authorize the operation: is it > something that is passed down via the call stack (possibly via 'struct > thread *td'), or is it something implicit to the run-time environmenta > ('curproc'/'curthread'), which is precisely the question I was trying to > resolve through my post. If 'curproc'/'curthread' is truly undesirable, > then we can simply eliminate its use, and replace that with almost > universal passing of 'struct thread' (for the purposes of authorization, > but also for other purposes: target of copyin/copyout/aio, scheduling, > ktrace, ...). If it is acceptable to maintain the use of curproc, we may > want to change some of our primitives to represent it being available. I think it's truly undesirable, since it limits the scalability of number of CPUs, and the ability to create clusters resonably, by putting a lot of bus contention into operations which should not involve inter-CPU cache coherency issues in the first place. I don't believe you will be able to grant priviledge on one node of a NUMA cluster, translate the process to another node, and then revoke the privilege on a third node, and have that revocation take effect without leaving a race window in which the putatively de-credentialed process is still able to act with the granted credentials before the node on which it is running receives the revocation. This is exactly the X.509 certificate revocation problem, and it'd be nice if everyone could afford to check with the certificate authority to see the revocation list each and every time that they wanted to invoke the privilege granted by holding the ceritificate, but that's just not scalable to real world application. If you want to do this, then you need to change the way you handle it entirely; for X.509, this is generally done by providing for a time based expiriation, and a recertification requirement. No one really looks at the CRLs, in practice. In the limit, this scales by granting the rights for longer and longer windows, as utilization increases. It's not very satisfying. > Right now, we're in a state of limbo: the official policy (if you will) is > 'XXX'. We should either eliminate it from general use, or we should use > it where it's appropriate :-). I definitely agree that there should be an uambiguous policy in place... I just think I disagree wih you about what it should be. :-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 18:30:23 2001 Delivered-To: freebsd-arch@freebsd.org Received: from gull.prod.itd.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84]) by hub.freebsd.org (Postfix) with ESMTP id 7720B37B416; Mon, 12 Nov 2001 18:30:20 -0800 (PST) Received: from dialup-209.247.141.234.dial1.sanjose1.level3.net ([209.247.141.234] helo=mindspring.com) by gull.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 163TL9-0007BQ-00; Mon, 12 Nov 2001 18:30:19 -0800 Message-ID: <3BF085EC.AEE7DE9C@mindspring.com> Date: Mon, 12 Nov 2001 18:31:08 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: John Baldwin Cc: Julian Elischer , freebsd-arch@FreeBSD.org, Robert Watson , Matthew Dillon Subject: Re: cur{thread/proc}, or not. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG John Baldwin wrote: > > ANyhow can you explain the idea of a pool mutex more clearly? > > Heh, think of it as a pool of mutexes, not a different type of > mutex. Instead of having 1 mutex for each object, you use a hash > table of mutexes for a set of objects. Thus, if you have 50 objects > vs. 500 objects, if you embed 1 mutex for each object, you bloat > each object and have 500 locks instead of 50 locks. Using pool > mutexes, you only have N number of mutexes regardless of the number > of mutexes. Note that if pool mutexes are non-recursive, they can't > be safely used when you might have more than one object of a given > set locked at a time. For example, process locks are the only object > we do this with currently. Pool mutexes are evil, if not implemented exactly right, and "exactly right" will vary over time. We need only look at the allocation unit optimization for things like struct socket allocations, which weren't updated when kevent came in and changed the size of the structure and therefore made the previous optimal cluster allocation block pessimal instead. Pool mutexes have the same problem that the fixed hash size for TCP connections has, in that you end up with relatively large collision domains when you get to a relatively large number of objects being hashed. Increasing the hash is not an answer, since it means that the default tuned case tries to handle the max for everything and ends up taking up so much memory you get the max for nothing. You might be able to keep a "pool ratio"; e.g. "for every N objects, there will be 1 mutex bucket", but then you get into the problem of refactoring the existing buckets. There is also the issue of collision domain; we tend to see this with an incredible number of client connections to HTTP servers with the in_pcbhash code (to keep the same example), because the hash values for port 80 on a particular IP tend to be pretty limited. In other words, I think that you will run into locality issues which will give you a hash that results in a particular bucket being inordinately busy, while another one is idle. Unless you address the locality balancing issue up front, it is a bad idea to use this for mutexes for objects, even if each object type gets its own mutex pool to avoid collision multiplication when multiple object types are referenced from the same pool. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 18:35: 5 2001 Delivered-To: freebsd-arch@freebsd.org Received: from gull.prod.itd.earthlink.net (gull.mail.pas.earthlink.net [207.217.120.84]) by hub.freebsd.org (Postfix) with ESMTP id A7C8E37B417; Mon, 12 Nov 2001 18:34:58 -0800 (PST) Received: from dialup-209.247.141.234.dial1.sanjose1.level3.net ([209.247.141.234] helo=mindspring.com) by gull.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 163TPc-0003xn-00; Mon, 12 Nov 2001 18:34:56 -0800 Message-ID: <3BF08702.84DDFFE0@mindspring.com> Date: Mon, 12 Nov 2001 18:35:46 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Matthew Dillon Cc: Julian Elischer , John Baldwin , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: <200111130030.fAD0Unn07434@apollo.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Matthew Dillon wrote: > There are two huge advantages to using pool mutexes: > > * No structural overhead. Zip. Zero. Zilch. Nada. > > * The mutex itself is stable storage, even if the address > is not, so you can use it to verify the second pointer when you > have a pointer to a (stable) structure containing a field which > is a pointer to an (unstable) structure. They are a solution to the retrofit problem. I.e. you use them when you would rather kludge around the problem instead of having to refactor the code. > There are two disadvantages: > > * Possible non-optimal cache mastership behavior. However, this > is not a major disadvantage since it can be addressed by > increasing the pool size. See my other post... this looks like a fix, but it doesn't scale, and it limits the system by default, and grossly complicates tuning for optimal performance for a particular task. > * Slightly greater overhead to calculate the hash index and obtain > the address of the pool mutex before obtaining or releasing it. > > The pool mutex hash function would be something simple based on > (int)ptr. You could pick a computationally trivial hash to avoid this; it's fairly irrelevant to the argument, either way, I think. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Nov 12 19:17:21 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id BD55D37B405; Mon, 12 Nov 2001 19:17:19 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAD3HIE07916; Mon, 12 Nov 2001 19:17:18 -0800 (PST) (envelope-from dillon) Date: Mon, 12 Nov 2001 19:17:18 -0800 (PST) From: Matthew Dillon Message-Id: <200111130317.fAD3HIE07916@apollo.backplane.com> To: Terry Lambert Cc: John Baldwin , Julian Elischer , freebsd-arch@FreeBSD.ORG, Robert Watson Subject: Re: cur{thread/proc}, or not. References: <3BF085EC.AEE7DE9C@mindspring.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :things like struct socket allocations, which weren't updated :when kevent came in and changed the size of the structure :and therefore made the previous optimal cluster allocation :block pessimal instead. : :Pool mutexes have the same problem that the fixed hash size :for TCP connections has, in that you end up with relatively :large collision domains when you get to a relatively large :number of objects being hashed. Well, I have to disagree. The primary scaling issue for pool mutexes is against the number of cpu's, not the number of structures, and the number of cpu's is relatively static. I agree that the hash function needs to be chosen carefully to maximize performance, but the advantage is that this (and other tricks) can be done inside the API, without having to mess around with anything outside the API. I think we have a far worse problem with structural bloat right now. Far, far worse. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 10:20:32 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 2644337B417; Tue, 13 Nov 2001 10:20:17 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id KAA99031; Tue, 13 Nov 2001 10:04:09 -0800 (PST) Date: Tue, 13 Nov 2001 10:04:07 -0800 (PST) From: Julian Elischer To: John Baldwin Cc: arch@freebsd.org Subject: RE: Thread scheduling in the kernel In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG (I notice you only comented on the first half, but that's a lot better than the complete lack of interest from everyone else.....) On Mon, 12 Nov 2001, John Baldwin wrote: > > On 12-Nov-01 Julian Elischer wrote: > > > > In an attempt to get the next part of the KSE work designed (design before > > code you know.. a strange new concept) I've been trying to work out > > the "correct" scheduling methods for such a system. > > > > There are a few 'tricks' that need to be taken into account.. > > > > a few notes.. > > > > > > 1/ Since threads running a syscall hit 'sleep' events > > the entities on teh sleep queues must be the threads. > > > > 2/ the entity that is scheduled onto the run queues is the KSE. > > (as the name suggests). > > > > 3/ If we have only one run queue, then KSEs for several processors > > from the same process, may be on the same queue. > > > > 4/ If threads 'wake up' they are hung of a list of runnable threads > > somewhere. This list could be hanging off the process, or the KSE. > > (actually more likely the KSEgroup than the process but...) > > It should hang off the group. This was my original idea. However I ended up splitting that queue up so that it was on each KSE and allowed a KSE with no work to steal work from another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to processors lightly, then I bind threads at the same time. (lightly) The idea is that threads are put on the queue for the KSE on which they last ran. Only when a KSE runs out of runnable threads on its own list and still has teh CPU, will it try steal work from another in the same group. The downside is that there is no overall priority between threads in a group.. This is one thing I want o discuss... the queueing model. > > > 5/ If a KSE reaches teh front of the queue, but the process > > that is running is not that for which that KSE has some affinity, > > does it get out of the way to allow another KSE in the queue > > to get run? or does it just run and 'switch' everything over to the new > > available processor? Maybe the scheduler looks for the KSE from the same > > group, that was assigned to that processor, and runs that, leaving > > the original KSE at the head of the queue? > > Maybe that happens until all the KSEs in the queue > > that were from that group have been run? In this case it becomes possible > > to always have a KSE from that group ready... > > Actually, I would remove the concept of affinities from the KSE > itself. Rather I would let each thread have lastcpu like it does now, > and when a KSE goes to choose a thread, it chooses one that has the > lastcpu == current cpuid. That is another possible way of tackling the problem. How deep in the group's queue does the KSE look before it decides to just take 'any' thread? > > > Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that > > group are put on all processors that look for work, until all of them > > have been run? (this would ensure that threads from the same process > > would all be run at the same time which is sometimes good, and sometimes > > bad, depending on the application. > > I wouldn't do this. I would just put KSE's on the queue's. However, I think > that KSE's actually can be even smaller than they are now. AFAICT they are > basically placeholders to sit on the runqueue's and not good for much else. :) You may notice that that's approximatly what I have done now... It's a "Kernel Schedulable Entity". It does leave some unfairness to the advantage of processes that have multiple KSEs. The KSe's job is to be scheduled on a run queue, and to provide a linkage point for other elements. It doesn't have very much else in it. (maybe a state variable). > > > 6/ When a Thread is made runnable it gets (in the present system) a > > priority. What priority does a KSE in the run queues have when it has > > threads of several differnt priorities? Do we sort them in priority order > > and drop the priority of the KSE(group) as we go through them > > until we have less priority than some other kse? > > Actually, in theory the prioities are supposed to be per-KSE group > right? In that case, changign the priority of an individual thread > for the purposes of priority propagation/inheritance or other > shenanigans results in creating a new group for that thread. Static priority inputs, (e.g. nice), yes. It is quite possible tha the KSEs in the group might have private priorities that diverge from this according to inputs from the threads they are running at that time.... My guess is that a kse from the group is elevated in priority when a thread with elevated priority comes runnable. This brings up questions of pre-emption. > > > 7/ when a KSE runs out of work, how does it decide whether there is work > > that should be stolen from a fellow KSE? How does processor affinity > > effect this? > > If the list is per-ksegroup, then you just make a first pass > preferring threads that last ran on the current CPU. If you don't > find anything, you just grab the first thing on the list. The queue might be quite long.. maybe only scan the first N entries... > > > 8/ If we had per-processor scheduling queues, How would that effect it? > > Which element get's put on the queues? Does a KSE > > stay on the run queue if it has un=run threads, even when it's running? > > How do we handle the arrival of new runnable threads with a KSE > > when it's running but a fellow KSE is not runnable. Do we > > bump the priority of the other KSE and hand it the new threads? > > I'm not sure how this fits in that model unless you bind KSE's to > CPU's or something similar. Only threads really have affinity, KSE's > don't really care if they migrate as they have no execution context > that gets affected. If you can bind KSes to processors a bit and have an affinity to a particular KSE, then you reduce the amount of work you have to do to select thte next thread to run. It's a tradeoff. The selection might get very heavyweight if there are a LOT of threads to select from. this could make a scheme that does a selection between each thread to be run, rather unscalable. If we had the affinity 'built in' to the structures/lists then it would be an order(1) operation.. and more scalable. (just an idea) > > If the priorities are per-KSEgroup, then you get to assume that all threads in > a group are equal in priority, which is true unless a particular thread > temporiarly gets a bump from priority propagation or the process assigns a > thread to a realtime priority or some such. I don;t think that the priority of all teh threads are the same, but rather, the priority ifor them all is based upon the same BASE priority and statistics.. i.e. the KSEG collects recent CPU usage and it's base priority degrades, taking all it's KSE's and threads with it. However I think that when a thread wakes up with an elevated priority, (as they do now) then a KSE needs to be boosted in priority to run it. After that thread has returned to user mode, the next highest priority in the list is run, etc. This sort-of suggests per-KSEG priority queues... As the KSE runs lower and lower priority threads, it's own priority could be lowered. When its priority is lower than another KSE onthe run queues it loses the CPU.. The question is whether the returning syscall follows the execution path back into userland. If it doesn't immediatly, then it may never get to userland if it lowers it's priority on another thread, and loses the CPU. This could lead to a case where a proces has completed all it's syscalls but is not able to proceed in userland.. Maybe this is what should happen, but I doubt it.. [lots of my other comments missing.....] > > -- > > John Baldwin -- http://www.FreeBSD.org/~jhb/ > PGP Key: http://www.baldwin.cx/~john/pgpkey.asc > "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 10:59:15 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206]) by hub.freebsd.org (Postfix) with ESMTP id 5C52E37B416 for ; Tue, 13 Nov 2001 10:59:05 -0800 (PST) Received: (qmail 8730 invoked from network); 13 Nov 2001 18:58:31 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail6.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 18:58:31 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Tue, 13 Nov 2001 10:59:04 -0800 (PST) From: John Baldwin To: Julian Elischer Subject: RE: Thread scheduling in the kernel Cc: arch@freebsd.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 13-Nov-01 Julian Elischer wrote: > (I notice you only comented on the first half, but that's a lot better > than the complete lack of interest from everyone else.....) Well, that's cause I think that there are some basic things that need to be decided before we can make the decisisons at the bottom of your e-mail. I think the first thing is that priorities need to be decided. The real question there is do we want per-thread priorities or per-ksegroup priorities? If you go totally with per-thread priorities which you seem to be favoring now and just use ksegroup for nice and fixed priorities, then that makes kse groups simpler at the expense of complicating KSE scheduling. :) If we let each thread have a priority and maintain its own scheduling parameters then I would be tempted to put threads on the runqueue's rather than kse's primarily because you then have the problem of having to go update the priorities of KSE's all the time when thread priorites change. And since you want a thread to run as soon as its priority allows, this means changign the prioritiy of all KSE's in its group so it gets to run on the first one that becomes available. This would point to a single priority in the KSE group that all KSE's share that is the highest priority of all runnable threads. If the list of runnable threads in the KSE group is priority sorted (as it should be) this isn't but so difficult as you look at the priority of the thread at the head of the list. However, every time that priority changes, you have to go shuffle KSE's around on the queue's potentially, rather than just moving that one thread around on the queue's (or putting it on the queue as the case may be). One comment about preemption: probably what we will go with is only preempting for real time threads (including interrupt threads) and not preempt time sharing threads until their quantum is up or they block. The entire concept of KSE's as I understand it, is to serve as a holder for the quantum so that we can give a multithreaded process it's full quantum each go-around even if threads block in which case we split it across multiple threads. In that case, I think this might be a resonable model: - Put threads on the runqueues. - During choosethread, we use the following algorithm: - If the highest priority thread is a time sharing or idle thread, our current process is a KSE process, and we still have quantum left (I am foreseeing a KEF_FORCESWITCH for forcing a KSE switch when quantum expires) then we will look for another thread in this kse group in priority order with a bias for threads that last run on the current cpu for affinity purposes. This may mean that we don't run the strictly highest priority thread in the system for the purposes of preserving quanta for time-shared processes. - Otherwise, we simply run the highest priority thread. I think this will achieve the desired goal of a KSE (preserve quanta for multithreading time-sharing processes across threads) while still allowing things like priority propagation and preemption to work smoothly. It's also fairly simple. If you use a priority bias for affinity, then that means you basically have a constant, say 4 (that is random, prolly not the real value) then you will basically artificially bump the priority of threads with lastcpu == cpuid by 4 during your comparison. This means you can stop walking the ksegroup list of threads when you hit a thread whose priority is more than 4 levels less than that of the highest priority thread. Also, the first thread you hit that meets the affinity requirement is the one you run, this should keep a (hopefully) decent bound on the amount of list walking done. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 12:47:24 2001 Delivered-To: freebsd-arch@freebsd.org Received: from barry.mail.mindspring.net (barry.mail.mindspring.net [207.69.200.25]) by hub.freebsd.org (Postfix) with ESMTP id E022237B41A for ; Tue, 13 Nov 2001 12:47:15 -0800 (PST) Received: from src-fvzagy98ow5 (pool-63.49.205.54.troy.grid.net [63.49.205.54]) by barry.mail.mindspring.net (8.9.3/8.8.5) with SMTP id PAA10581 for ; Tue, 13 Nov 2001 15:47:13 -0500 (EST) Message-Id: <3.0.6.32.20011113154711.009793e0@imatowns.com> X-Sender: ggombert@imatowns.com X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32) Date: Tue, 13 Nov 2001 15:47:11 -0500 To: freebsd-arch@FreeBSD.org From: Glenn Gombert Subject: freebsd-arch@FreeBSD.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG A couple of questions -- >1/ Since threads running a syscall hit 'sleep' events >the entities on teh sleep queues must be the threads. Will the sleep queues (which mix threads from multiple CPUs) impact performance as the number of threads dramatically increase .. > 2/ the entity that is scheduled onto the run queues is the KSE. > (as the name suggests). Is there a number of threads per KSE that is optimum for performance? will this be impacted by the UpCalls that are made between the Kernel and User land=85..what determines the optimum number of threads to be created pe= r KSE (before another one is created for a particular application).. > 3/ If we have only one run queue, then KSEs for several processors > from the same process, may be on the same queue. > 4/ If threads 'wake up' they are hung of a list of runnable threads > somewhere. This list could be hanging off the process, or the KSE. > actually more likely the KSEgroup than the process but...) .. does not one process serve as a 'container' for one KSEG and multiple KSE and Threads ?? does this process share the time quanta between all its member(s) or is it the job of the UTS to make these type of decisions?? > 5/ If a KSE reaches teh front of the queue, but the process > that is running is not that for which that KSE has some affinity, > does it get out of the way to allow another KSE in the queue > to get run? or does it just run and 'switch' everything over to the new > available processor? Maybe the scheduler looks for the KSE from the same > group, that was assigned to that processor, and runs that, leaving > the original KSE at the head of the queue?=20 > Maybe that happens until all the KSEs in the queue > that were from that group have been run? In this case it becomes possible > to always have a KSE from that group ready... Does the kernel scheduler make the decisions about scheduling (once a Thread has been created) ?..what is the relationship between the UTS and the Kernel Scheduler (from the standpoint of time allocation when it comes to Use's and individual threads)=85 > Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from that > group are put on all processors that look for work, until all of them=20 > have been run? (this would ensure that threads from the same process > would all be run at the same time which is sometimes good, and sometimes > bad, depending on the application. How is the time quanta divided up between KSE's and Threads ??=85who makes the decision when each should be placed on the runqueue and run at a particular time when the responsibility is devided up between the UTS and kernel scheduler=85 > 6/ When a Thread is made runnable it gets (in the present system) a > priority. What priority does a KSE in the run queues have when it has > threads of several differnt priorities? Do we sort them in priority order > and drop the priority of the KSE(group) as we go through them > until we have less priority than some other kse? > 7/ when a KSE runs out of work, how does it decide whether there is work > that should be stolen from a fellow KSE? How does processor affinity > effect this? Is a KSE not bound to a particular processor with the KSEG able to allocate resources between multiple processors? > 8/ If we had per-processor scheduling queues, How would that effect it? > Which element get's put on the queues? Does a KSE > stay on the run queue if it has un=3Drun threads, even when it's running? > How do we handle the arrival of new runnable threads with a KSE > when it's running but a fellow KSE is not runnable. Do we=20 > bump the priority of the other KSE and hand it the new threads? > remember: here are the 4 structures: > proc - owner of all resources (FDs, memory, user creds) except cpu > Ksegroup - owner of all scheduler controlling characteristics > (e.g. nice, realtime, number of processors), N per process. > Owner of stats used for scheduling calculations.=09 > kse - kind of a placeholder. It gets scheduled onto=20 > a processor (by a yet un-named mechaninsm) and provides > cpu-cycles for the execution of 'threads' (see next). > Max. Of one per processor per KSE-group. > thread - The in-kernel incarnation of a user thread that is presently > in the kernel for some reason (e.g. syscall, pagefault, etc) > Holds ALL the state needed to resume after sleeping, and is the > entity that is suspended when the thread hits a 'sleep'. > "unlimmitted" per KSEgroup. probably have a short-term > "favourite" KSE/processor. What is the relationship between processors and processes?? Does not one KSEG distribute multiple KSE's between multiple CPU's? > When a thread blocks, the KSE looks for another thread to run, and if it > doesn't find one, it will create one, and upcall back to the=20 > userland to see if there are more userland threads to run. > (if not, it returns to yield the processor) > The question that has been giving me headaches is the=20 > relationship between these elements, and > the definitions of how these structures are linked up and moved > around to provide fair efficient scheduling. > If a KSE has a high priority thread and a low priority thread > runnable in the kernel, but in reverse order, should it take > the high priority from the higher prio. thread and process both, > or should it order the threads and run teh high prio one first. > In this case what happens whan a higher prio. thread becomes runnable > while one is already running, and if the highest prio thread returns to > userland, should teh processor move to userland to follow it, or > switch to the next priority thread in the kernel.? > Do all threads in the kernel have priority over all threads in userland? > (this might be a reasonable decision). Does the UTS have any input into the priority of how time is apportioned to each individual KSE / Thread in the kernel runqueue??..or is that entirely up to the kernel scheduler =85 In general does the memory allocation/recrimination scheme seem adequate for all the KSE's/Threads that will be created and destroyed with the new implementation=85 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 13:30:15 2001 Delivered-To: freebsd-arch@freebsd.org Received: from blount.mail.mindspring.net (blount.mail.mindspring.net [207.69.200.226]) by hub.freebsd.org (Postfix) with ESMTP id 2005D37B418 for ; Tue, 13 Nov 2001 13:30:09 -0800 (PST) Received: from src-fvzagy98ow5 (pool-63.49.207.166.troy.grid.net [63.49.207.166]) by blount.mail.mindspring.net (8.9.3/8.8.5) with SMTP id QAA01085 for ; Tue, 13 Nov 2001 16:30:06 -0500 (EST) Message-Id: <3.0.6.32.20011113163004.009803c0@imatowns.com> X-Sender: ggombert@imatowns.com (Unverified) X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32) Date: Tue, 13 Nov 2001 16:30:04 -0500 To: arch@freebsd.org From: Glenn Gombert Subject: RE: Thread scheduling in the kernel Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > Well, that's cause I think that there are some basic things that need to= be > decided before we can make the decisisons at the bottom of your e-mail. I > think the first thing is that priorities need to be decided. The real question > there is do we want per-thread priorities or per-ksegroup priorities? If you > go totally with per-thread priorities which you seem to be favoring now= and > just use ksegroup for nice and fixed priorities, then that makes kse= groups > simpler at the expense of complicating KSE scheduling. :) Is not a KSE 'bound' to a particular CPU, with each thread in the KSE given a specific amount of time by the kernel scheduler ??. how does the UTS play in this (other than to sleep and wakeup threads) =85 > If we let each thread have a priority and maintain its own scheduling > parameters then I would be tempted to put threads on the runqueue's rather than > kse's primarily because you then have the problem of having to go update= the > priorities of KSE's all the time when thread priorites change. And since you > want a thread to run as soon as its priority allows, this means changign= the > prioritiy of all KSE's in its group so it gets to run on the first one= that what is the mechanism for this (kernel scheduling ) or does the UTS become involve as well ? What is the impact on performance (if re-scheduling is done on a per-thread basis)=85 > becomes available. This would point to a single priority in the KSE group that > all KSE's share that is the highest priority of all runnable threads. If the > list of runnable threads in the KSE group is priority sorted (as it should be) > this isn't but so difficult as you look at the priority of the thread at= the > head of the list. However, every time that priority changes, you have to= go > shuffle KSE's around on the queue's potentially, rather than just moving that > one thread around on the queue's (or putting it on the queue as the case= may > be). Is not time allocated between Threads in a KSE based upon the total amount of time available to the KSE.. if it is not this way , does not Threads associated with a particular application gain an 'unfair' advantage when it come to running =85 > One comment about preemption: probably what we will go with is only preempting > for real time threads (including interrupt threads) and not preempt time > sharing threads until their quantum is up or they block. The entire concept of > KSE's as I understand it, is to serve as a holder for the quantum so that= we > can give a multithreaded process it's full quantum each go-around even if > threads block in which case we split it across multiple threads. In that case, > I think this might be a resonable model: > I think this will achieve the desired goal of a KSE (preserve quanta for > multithreading time-sharing processes across threads) while still allowing > things like priority propagation and preemption to work smoothly. It's= also > fairly simple. > If you use a priority bias for affinity, then that means you basically have a > constant, say 4 (that is random, prolly not the real value) then you will > basically artificially bump the priority of threads with lastcpu =3D=3D cp= uid by 4 > during your comparison. This means you can stop walking the ksegroup list of > threads when you hit a thread whose priority is more than 4 levels less= than > that of the highest priority thread. Also, the first thread you hit that meets > the affinity requirement is the one you run, this should keep a= (hopefully) > decent bound on the amount of list walking done. If KSE's are bound to a particular CPU, how does this affect KSE's & Threads on different CPU' To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 14: 4:47 2001 Delivered-To: freebsd-arch@freebsd.org Received: from net2.gendyn.com (nat2.gendyn.com [204.60.171.12]) by hub.freebsd.org (Postfix) with ESMTP id D071D37B405; Tue, 13 Nov 2001 14:04:42 -0800 (PST) Received: from [153.11.11.3] (helo=plunger.gdeb.com) by net2.gendyn.com with esmtp (Exim 2.12 #1) id 163lfW-000KuP-00; Tue, 13 Nov 2001 17:04:34 -0500 Received: from clcrtr.gdeb.com ([153.11.109.11]) by plunger.gdeb.com with SMTP id QAA01515; Tue, 13 Nov 2001 16:54:11 -0500 (EST) Received: from gdeb.com (gpz.clc.gdeb.com [192.168.3.12]) by clcrtr.gdeb.com (8.11.4/8.11.4) with ESMTP id fADMAHK47646; Tue, 13 Nov 2001 17:10:17 -0500 (EST) (envelope-from deischen@gdeb.com) Message-ID: <3BF198E2.24EE658F@gdeb.com> Date: Tue, 13 Nov 2001 17:04:18 -0500 From: Daniel Eischen X-Mailer: Mozilla 4.78 [en] (X11; U; SunOS 5.8 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Julian Elischer Cc: John Baldwin , arch@FreeBSD.ORG Subject: Re: Thread scheduling in the kernel References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Julian Elischer wrote: > > (I notice you only comented on the first half, but that's a lot better > than the complete lack of interest from everyone else.....) > > On Mon, 12 Nov 2001, John Baldwin wrote: > > > > > On 12-Nov-01 Julian Elischer wrote: > > > > > > In an attempt to get the next part of the KSE work designed (design before > > > code you know.. a strange new concept) I've been trying to work out > > > the "correct" scheduling methods for such a system. > > > > > > There are a few 'tricks' that need to be taken into account.. > > > > > > a few notes.. > > > > > > > > > 1/ Since threads running a syscall hit 'sleep' events > > > the entities on teh sleep queues must be the threads. > > > > > > 2/ the entity that is scheduled onto the run queues is the KSE. > > > (as the name suggests). > > > > > > 3/ If we have only one run queue, then KSEs for several processors > > > from the same process, may be on the same queue. > > > > > > 4/ If threads 'wake up' they are hung of a list of runnable threads > > > somewhere. This list could be hanging off the process, or the KSE. > > > (actually more likely the KSEgroup than the process but...) > > > > It should hang off the group. > > This was my original idea. However I ended up splitting that queue up so > that it was on each KSE and allowed a KSE with no work to steal work from > another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to > processors lightly, then I bind threads at the same time. (lightly) > > The idea is that threads are put on the queue for the KSE on which they > last ran. Only when a KSE runs out of runnable threads on its own list and > still has teh CPU, will it try steal work from another in the same group. > > The downside is that there is no overall priority between threads in a > group.. This is one thing I want o discuss... the queueing model. I just want to make a couple comments without getting too involved in how the kernel deals with threads, KSEs, and KSE groups. I think that at first there will probably be only 1 UTS run queue per KSE group. This probably means that the UTS will also hang blocked threads off its version of the KSE group. I guess in this case, unblock events from the kernel can be sent to any KSE within the group. But if the UTS wants to have a run queue for each KSE, then the kernel should only be handling the blocking and unblocking of threads within the same KSE in which the thread originally entered the kernel. I think the UTS will only set priorities for the KSE group. It doesn't make sense to me for the (application visible) priority to be anywhere other than the KSE group. If the kernel needs to temporarily play with priorities for its own purposes (inheriting priority when holding a mutex), then each thread probably needs an active priority which is MAX(kse->inherited, kseg->prio). -- Dan Eischen To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 14:23: 6 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id 5718A37B405 for ; Tue, 13 Nov 2001 14:22:53 -0800 (PST) Received: (qmail 4143 invoked from network); 13 Nov 2001 22:22:51 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 13 Nov 2001 22:22:51 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <3BF198E2.24EE658F@gdeb.com> Date: Tue, 13 Nov 2001 14:22:51 -0800 (PST) From: John Baldwin To: Daniel Eischen Subject: Re: Thread scheduling in the kernel Cc: arch@FreeBSD.ORG, Julian Elischer Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 13-Nov-01 Daniel Eischen wrote: > Julian Elischer wrote: >> >> (I notice you only comented on the first half, but that's a lot better >> than the complete lack of interest from everyone else.....) >> >> On Mon, 12 Nov 2001, John Baldwin wrote: >> >> > >> > On 12-Nov-01 Julian Elischer wrote: >> > > >> > > In an attempt to get the next part of the KSE work designed (design >> > > before >> > > code you know.. a strange new concept) I've been trying to work out >> > > the "correct" scheduling methods for such a system. >> > > >> > > There are a few 'tricks' that need to be taken into account.. >> > > >> > > a few notes.. >> > > >> > > >> > > 1/ Since threads running a syscall hit 'sleep' events >> > > the entities on teh sleep queues must be the threads. >> > > >> > > 2/ the entity that is scheduled onto the run queues is the KSE. >> > > (as the name suggests). >> > > >> > > 3/ If we have only one run queue, then KSEs for several processors >> > > from the same process, may be on the same queue. >> > > >> > > 4/ If threads 'wake up' they are hung of a list of runnable threads >> > > somewhere. This list could be hanging off the process, or the KSE. >> > > (actually more likely the KSEgroup than the process but...) >> > >> > It should hang off the group. >> >> This was my original idea. However I ended up splitting that queue up so >> that it was on each KSE and allowed a KSE with no work to steal work from >> another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to >> processors lightly, then I bind threads at the same time. (lightly) >> >> The idea is that threads are put on the queue for the KSE on which they >> last ran. Only when a KSE runs out of runnable threads on its own list and >> still has teh CPU, will it try steal work from another in the same group. >> >> The downside is that there is no overall priority between threads in a >> group.. This is one thing I want o discuss... the queueing model. > > I just want to make a couple comments without getting too involved > in how the kernel deals with threads, KSEs, and KSE groups. > > I think that at first there will probably be only 1 UTS run > queue per KSE group. This probably means that the UTS will > also hang blocked threads off its version of the KSE group. I > guess in this case, unblock events from the kernel can be sent > to any KSE within the group. But if the UTS wants to have a > run queue for each KSE, then the kernel should only be handling > the blocking and unblocking of threads within the same KSE > in which the thread originally entered the kernel. > > I think the UTS will only set priorities for the KSE group. It > doesn't make sense to me for the (application visible) priority > to be anywhere other than the KSE group. If the kernel needs > to temporarily play with priorities for its own purposes (inheriting > priority when holding a mutex), then each thread probably needs > an active priority which is MAX(kse->inherited, kseg->prio). What about the priorities passed in to condition variables and msleep/tsleep? That is why I think Julian wanted per-thread priorities. Also, the priority propagation priority is _defintiely_ a thread and not a KSE property, since the thread owns teh lock that has the assoiated priority, not the KSE. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 14:40:17 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id AFF2537B418; Tue, 13 Nov 2001 14:40:10 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id OAA00143; Tue, 13 Nov 2001 14:25:13 -0800 (PST) Date: Tue, 13 Nov 2001 14:25:11 -0800 (PST) From: Julian Elischer To: Daniel Eischen Cc: John Baldwin , arch@FreeBSD.ORG Subject: Re: Thread scheduling in the kernel In-Reply-To: <3BF198E2.24EE658F@gdeb.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tue, 13 Nov 2001, Daniel Eischen wrote: > I think the UTS will only set priorities for the KSE group. It > doesn't make sense to me for the (application visible) priority > to be anywhere other than the KSE group. If the kernel needs > to temporarily play with priorities for its own purposes (inheriting > priority when holding a mutex), then each thread probably needs > an active priority which is MAX(kse->inherited, kseg->prio). MAX(thread->inherited, kseg->prio) ? > > -- > Dan Eischen > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 14:46:35 2001 Delivered-To: freebsd-arch@freebsd.org Received: from net2.gendyn.com (nat2.gendyn.com [204.60.171.12]) by hub.freebsd.org (Postfix) with ESMTP id 62EEA37B405; Tue, 13 Nov 2001 14:46:30 -0800 (PST) Received: from [153.11.11.3] (helo=plunger.gdeb.com) by net2.gendyn.com with esmtp (Exim 2.12 #1) id 163mJv-000M2p-00; Tue, 13 Nov 2001 17:46:19 -0500 Received: from clcrtr.gdeb.com ([153.11.109.11]) by plunger.gdeb.com with SMTP id RAA02907; Tue, 13 Nov 2001 17:35:55 -0500 (EST) Received: from gdeb.com (gpz.clc.gdeb.com [192.168.3.12]) by clcrtr.gdeb.com (8.11.4/8.11.4) with ESMTP id fADMq6K47675; Tue, 13 Nov 2001 17:52:06 -0500 (EST) (envelope-from deischen@gdeb.com) Message-ID: <3BF1A2B0.A0BC7469@gdeb.com> Date: Tue, 13 Nov 2001 17:46:08 -0500 From: Daniel Eischen X-Mailer: Mozilla 4.78 [en] (X11; U; SunOS 5.8 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: John Baldwin Cc: arch@FreeBSD.org, Julian Elischer Subject: Re: Thread scheduling in the kernel References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG John Baldwin wrote: > > On 13-Nov-01 Daniel Eischen wrote: > > Julian Elischer wrote: > >> > >> (I notice you only comented on the first half, but that's a lot better > >> than the complete lack of interest from everyone else.....) > >> > >> On Mon, 12 Nov 2001, John Baldwin wrote: > >> > >> > > >> > On 12-Nov-01 Julian Elischer wrote: > >> > > > >> > > In an attempt to get the next part of the KSE work designed (design > >> > > before > >> > > code you know.. a strange new concept) I've been trying to work out > >> > > the "correct" scheduling methods for such a system. > >> > > > >> > > There are a few 'tricks' that need to be taken into account.. > >> > > > >> > > a few notes.. > >> > > > >> > > > >> > > 1/ Since threads running a syscall hit 'sleep' events > >> > > the entities on teh sleep queues must be the threads. > >> > > > >> > > 2/ the entity that is scheduled onto the run queues is the KSE. > >> > > (as the name suggests). > >> > > > >> > > 3/ If we have only one run queue, then KSEs for several processors > >> > > from the same process, may be on the same queue. > >> > > > >> > > 4/ If threads 'wake up' they are hung of a list of runnable threads > >> > > somewhere. This list could be hanging off the process, or the KSE. > >> > > (actually more likely the KSEgroup than the process but...) > >> > > >> > It should hang off the group. > >> > >> This was my original idea. However I ended up splitting that queue up so > >> that it was on each KSE and allowed a KSE with no work to steal work from > >> another. i.e. a virtual single queue, with KSE affinity. If I bind KSEs to > >> processors lightly, then I bind threads at the same time. (lightly) > >> > >> The idea is that threads are put on the queue for the KSE on which they > >> last ran. Only when a KSE runs out of runnable threads on its own list and > >> still has teh CPU, will it try steal work from another in the same group. > >> > >> The downside is that there is no overall priority between threads in a > >> group.. This is one thing I want o discuss... the queueing model. > > > > I just want to make a couple comments without getting too involved > > in how the kernel deals with threads, KSEs, and KSE groups. > > > > I think that at first there will probably be only 1 UTS run > > queue per KSE group. This probably means that the UTS will > > also hang blocked threads off its version of the KSE group. I > > guess in this case, unblock events from the kernel can be sent > > to any KSE within the group. But if the UTS wants to have a > > run queue for each KSE, then the kernel should only be handling > > the blocking and unblocking of threads within the same KSE > > in which the thread originally entered the kernel. > > > > I think the UTS will only set priorities for the KSE group. It > > doesn't make sense to me for the (application visible) priority > > to be anywhere other than the KSE group. If the kernel needs > > to temporarily play with priorities for its own purposes (inheriting > > priority when holding a mutex), then each thread probably needs ^^^^^^ > > an active priority which is MAX(kse->inherited, kseg->prio). ^^^ s/kse/thread Sorry, I meant thread above, not kse. > What about the priorities passed in to condition variables and msleep/tsleep? The KSE group has the base priority from which all member threads inherit. The active priority is stored in each thread and is the maximum of the KSE groups (base) priority and any priority that the thread inherits from synchronization objects. > That is why I think Julian wanted per-thread priorities. Also, the priority > propagation priority is _defintiely_ a thread and not a KSE property, since the > thread owns teh lock that has the assoiated priority, not the KSE. Yep, sorry I did mean thread above, not KSE. -- Dan Eischen To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 16:20:28 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 87C7E37B405 for ; Tue, 13 Nov 2001 16:20:16 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA00597; Tue, 13 Nov 2001 16:13:53 -0800 (PST) Date: Tue, 13 Nov 2001 16:13:51 -0800 (PST) From: Julian Elischer To: Glenn Gombert Cc: freebsd-arch@FreeBSD.org Subject: Re: freebsd-arch@FreeBSD.org In-Reply-To: <3.0.6.32.20011113154711.009793e0@imatowns.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN Content-Transfer-Encoding: QUOTED-PRINTABLE Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tue, 13 Nov 2001, Glenn Gombert wrote: > A couple of questions -- Ok, bu tremember this stuff is still under discussion so any answer give here may be wrong :-) >=20 > >1/ Since threads running a syscall hit 'sleep' events > >the entities on teh sleep queues must be the threads. >=20 > Will the sleep queues (which mix threads from multiple CPUs) impact > performance as the number of threads dramatically increase .. Not really.. For several reasons... There are an awfull lot of sleep queues and they are manipulated using O(1) operations. "Wakeup_one()" is also independent of the number of entries, and we don't expect the average number of threads-per-process to be much more than 1. > > > 2/ the entity that is scheduled onto the run queues is the KSE. > > (as the name suggests). >=20 > Is there a number of threads per KSE that is optimum for > performance? will this be impacted by the UpCalls that are made > between the Kernel and User land..what determines the optimum number > of threads to be created per KSE (before another one is created for a > particular application).. Up to a sane limit, the number of threads per KSE/KSEGROUP id unlimitted and controlled by the UTS. The kernel will always ask the UTS it it has another thread to run whenever the KSE discovers it has no work to do. The UTS has the option of either runing a new thread, or retunring a 'yield()' to the kernel. >=20 > > 3/ If we have only one run queue, then KSEs for several processors=20 > > from the same process, may be on the same queue. >=20 > > 4/ If threads 'wake up' they are hung of a list of runnable threads > > somewhere. This list could be hanging off the process, or the KSE. > > actually more likely the KSEgroup than the process but...) >=20 > .. does not one process serve as a 'container' for one KSEG and > multiple KSE and Threads ?? does this process share the time quanta > between all its member(s) or is it the job of the UTS to make these > type of decisions?? There is a one-to-many relationship on each step.. 1 process to N KSE groups (a process may start several proces groups and assign different scheduling characteristics for each) 1 KSEGRP to N KSEs (Each KSE can be used to reserve soem cycles on a processor). It makes no sense to have more KSEs per KSEGRP than there are processors. 1 KSEGRP to N threads.. In fact it makes no sense to have more active KSEs than threads and there tends to be a thread assigned to each KSE at minimum. (A yielded KSE may not have a thread assigned to it, but then I said ACTIVE.. :-) >=20 >=20 > > 5/ If a KSE reaches teh front of the queue, but the process > > that is running is not that for which that KSE has some affinity, > > does it get out of the way to allow another KSE in the queue > > to get run? or does it just run and 'switch' everything over to the new > > available processor? Maybe the scheduler looks for the KSE from the sam= e > > group, that was assigned to that processor, and runs that, leaving > > the original KSE at the head of the queue?=20 > > Maybe that happens until all the KSEs in the queue > > that were from that group have been run? In this case it becomes possib= le > > to always have a KSE from that group ready... >=20 > Does the kernel scheduler make the decisions about scheduling (once a > Thread has been created) ?..what is the relationship between the UTS > and the Kernel Scheduler (from the standpoint of time allocation when > it comes to Use's and individual threads) The kernel shceduler decides what threads (which are currently IN THE KERNEL) shuold be run, but as soon as control passes up to userland, the UTS can decide which thread is run. The UTS can probably influence the=20 kernelscheduler's decision. Probably the rule "All threads in the kernel have priority over all threads in the userland" will be the default behaviour, though we might be able to adjust this on a per KSEGRP basis. If we can think of an alternative.. The result of this woudl be that on teh starting of a scheduler quantum all runnable completing syscalls would be completed before the upcall is made to the UTS. The UTS can then select which of the returned threads it wants to run... >=20 > > Maybe the KSE-GROUP is what is put unto the run queue, and KSEs from th= at > > group are put on all processors that look for work, until all of them= =20 > > have been run? (this would ensure that threads from the same process > > would all be run at the same time which is sometimes good, and sometime= s > > bad, depending on the application. >=20 > How is the time quanta divided up between KSE's and Threads ??=85who mak= es > the decision when each should be placed on the runqueue and run at a > particular time when the responsibility is devided up between the UTS and > kernel scheduler=85 ^^^ what's with the ^E's?? A KSE get's a quantum when it's active priority is the highest among runnable KSEs. It will run each thread it has until completion, in turn. In this case "completion" is one of: 1/ returns to userland 2/ blocks 3/ self destructs 4/ quantum ends. When it has no runnable threads in the kernel to do, then the next action for the KSE is to upcal to the UTS. >=20 > > 6/ When a Thread is made runnable it gets (in the present system) a > > priority. What priority does a KSE in the run queues have when it has > > threads of several differnt priorities? Do we sort them in priority ord= er > > and drop the priority of the KSE(group) as we go through them > > until we have less priority than some other kse? >=20 > > 7/ when a KSE runs out of work, how does it decide whether there is wor= k > > that should be stolen from a fellow KSE? How does processor affinity > > effect this? >=20 > Is a KSE not bound to a particular processor with the KSEG able to > allocate resources between multiple processors? This is open to debate. It makes no sense to have more KSEs than processors. It may also be useful to bind a KSe to a particular processor. Since threads can migrate between KSEs in the same KSEGRP, it may mean that you's have to make a special KSEGRP with a single KSE to confine the threads to a single CPU. >=20 > > 8/ If we had per-processor scheduling queues, How would that effect it? > > Which element get's put on the queues? Does a KSE > > stay on the run queue if it has un=3Drun threads, even when it's runnin= g? > > How do we handle the arrival of new runnable threads with a KSE > > when it's running but a fellow KSE is not runnable. Do we=20 > > bump the priority of the other KSE and hand it the new threads? >=20 >=20 > > remember: here are the 4 structures: >=20 > > proc - owner of all resources (FDs, memory, user creds) except cpu >=20 > > Ksegroup - owner of all scheduler controlling characteristics > > =09(e.g. nice, realtime, number of processors), N per process. > > =09Owner of stats used for scheduling calculations.=09 >=20 > > kse -=09kind of a placeholder. It gets scheduled onto=20 > > =09a processor (by a yet un-named mechaninsm) and provides > > =09cpu-cycles for the execution of 'threads' (see next). > > =09Max. Of one per processor per KSE-group. >=20 > > thread - The in-kernel incarnation of a user thread that is presently > > =09in the kernel for some reason (e.g. syscall, pagefault, etc) > > =09Holds ALL the state needed to resume after sleeping, and is the > > =09entity that is suspended when the thread hits a 'sleep'. > > =09"unlimmitted" per KSEgroup. probably have a short-term > > =09"favourite" KSE/processor. >=20 >=20 > What is the relationship between processors and processes?? Does not one > KSEG distribute multiple KSE's between multiple CPU's? Yes.. KSEs are the vehicle of concurancy, as they can be runnig at the same time on different processors. Theoretically KSEs in the same KSEGRP should not directly compete with each other as there can never be more of them than there are procerssors. KSEs from a different KSEGRP compete in the same way that processes now compete. >=20 > > When a thread blocks, the KSE looks for another thread to run, and if i= t > > doesn't find one, it will create one, and upcall back to the=20 > > userland to see if there are more userland threads to run. > > (if not, it returns to yield the processor) >=20 > > The question that has been giving me headaches is the=20 > > relationship between these elements, and > > the definitions of how these structures are linked up and moved > > around to provide fair efficient scheduling. >=20 > > If a KSE has a high priority thread and a low priority thread > > runnable in the kernel, but in reverse order, should it take > > the high priority from the higher prio. thread and process both, > > or should it order the threads and run teh high prio one first. > > In this case what happens whan a higher prio. thread becomes runnable > > while one is already running, and if the highest prio thread returns to > > userland, should teh processor move to userland to follow it, or > > switch to the next priority thread in the kernel.? > > Do all threads in the kernel have priority over all threads in userland= ? > > (this might be a reasonable decision). >=20 > Does the UTS have any input into the priority of how time is apportione= d > to each individual KSE / Thread in the kernel runqueue??..or is that > entirely up to the kernel scheduler =85 The KSE is a placeholder to which quanta are assigned for the purpose of running any available and runnable threads. If there are no runnable threads (including the one in userland) teh KSE will not request any=20 cycles. The KSE applies for these cycles at the priority of the highest priority thread waiting, where the priority of the thread is a combination of inherrited priority and KSEGROUP-wide=20 general priority characteristics (e.g. nice, etc.). Possibly, as the highest priority threads are "completed" teh priority of the KSE may effectlively drop to that of the next highest thread. It is conceivable that it may drop below that of a competing KSE, which may under some circumstances produce a pre-emption. (this is still under discussion). It's not completely obvious at which point the UTS is called to allow processing in userland to continue, but I suspect that it is when there are no more runnable threads in the kernel, and there are no more KSEs of higher priority requesting CPU. The question of whether the raised priority from IO that is present in current UNIX systems should be carried over to the UTS when it finally gets control, or whether it should be run at it's BASE priority. Presently if you do IO, your process get's a bost in priority when the IO completes to allow you to be quickly scheduled to process the results of the IO quickly, and then request more IO (at which time you=20 probably sleep again). This is to help interactive processes vs. batch proceses. It is not obvious what the right thing to do for=20 a UTS that has both batch and Interractive threads is.... >=20 > In general does the memory allocation/recrimination scheme seem adequate > for all the KSE's/Threads that will be created and destroyed with the new > implementation=85 >=20 >=20 >=20 > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message >=20 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Nov 13 17: 0:26 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id C539C37B405 for ; Tue, 13 Nov 2001 17:00:17 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA00694; Tue, 13 Nov 2001 16:44:28 -0800 (PST) Date: Tue, 13 Nov 2001 16:44:26 -0800 (PST) From: Julian Elischer To: Glenn Gombert Cc: arch@freebsd.org Subject: RE: Thread scheduling in the kernel In-Reply-To: <3.0.6.32.20011113163004.009803c0@imatowns.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN Content-Transfer-Encoding: QUOTED-PRINTABLE Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tue, 13 Nov 2001, Glenn Gombert wrote: > > Well, that's cause I think that there are some basic things that need t= o be > > decided before we can make the decisisons at the bottom of your e-mail.= I > > think the first thing is that priorities need to be decided. The real > question > > there is do we want per-thread priorities or per-ksegroup priorities? = If > you > > go totally with per-thread priorities which you seem to be favoring now= and > > just use ksegroup for nice and fixed priorities, then that makes kse gr= oups > > simpler at the expense of complicating KSE scheduling. :) >=20 > Is not a KSE 'bound' to a particular CPU, with each thread in the KSE > given a specific amount of time by the kernel scheduler ??. how does > the UTS play in this (other than to sleep and wakeup threads) =85 Threads become runnable when whatever they were blocking on allows them to run. A KSE (this still open to discussion) becomes runnable when there is at least one runnable thread that it could provide cycles to. The KSE may not be bound to a single processor (though it MIGHT be) but just able to hop in to take any cycles on any CPU available. (actually since threads can migrate between KSEs in the same group, they are actually equivalent, so you might select the KSE that was last on this processor if you wanted, but it may not gain you much.) you could put the KSEGRP on the run queue but hte difficulty comes with=20 the accounting. If you take it off to run a KSE on it's behalf, then what happens if another processor becomes available...? It's not on the=20 run queue.. so even though it may be able to use the extra horsepower it isn't going to be asked.. If it stays on the run queue head until it has run out of threads, then it may never leave the head, as new threads may continually be coming available. By puting KSEs on the run queues and removing them when they are run, you can ensure that when their quantum is completed, they are placed back onto the queue at the tail end.. there are aother answers to theses problems.. that's what we need to discuss..=20 "who has priority?" - it seems clear there is a component from both the=20 =09thread and the KSEGRP.. selected by the KSE.. "how do we do the queueing to maintain fairness and resposiveness" =09- I think by queueing KSEs but hopefully someone else =09has a REALLY SNEAKY and CUTE solution :-) >=20 > > If we let each thread have a priority and maintain its own scheduling > > parameters then I would be tempted to put threads on the runqueue's > rather than > > kse's primarily because you then have the problem of having to go updat= e the > > priorities of KSE's all the time when thread priorites change. And sin= ce > you > > want a thread to run as soon as its priority allows, this means changig= n the > > prioritiy of all KSE's in its group so it gets to run on the first one = that >=20 > what is the mechanism for this (kernel scheduling ) or does the UTS > become involve as well ? What is the impact on performance (if > re-scheduling is done on a per-thread basis)=85 >=20 >=20 > > becomes available. This would point to a single priority in the KSE > group that > > all KSE's share that is the highest priority of all runnable threads. = If > the > > list of runnable threads in the KSE group is priority sorted (as it > should be) > > this isn't but so difficult as you look at the priority of the thread a= t the > > head of the list. However, every time that priority changes, you have = to go > > shuffle KSE's around on the queue's potentially, rather than just movin= g > that > > one thread around on the queue's (or putting it on the queue as the cas= e may > > be). >=20 > Is not time allocated between Threads in a KSE based upon the total > amount of time available to the KSE.. if it is not this way , does not > Threads associated with a particular application gain an 'unfair' > advantage when it come to running =85 Fairness is a very important criteria. Time is allocatged between therads in a KSE in a priority order, with no real pre-emption between them. We are in the kernel. We control the code.. We can be sure that within the kernel, codepaths are short before a "completion" of some sort occurs. (Even if that event is actually the thread re-blocking). When all kernel activity has completed, teh UTS can be called.. I cannot imagine that it would be wise to call teh UTS when there are still runnable threads stuck in a semi-completed state within the kernel. >=20 > > One comment about preemption: probably what we will go with is only > preempting > > for real time threads (including interrupt threads) and not preempt tim= e > > sharing threads until their quantum is up or they block. The entire > concept of > > KSE's as I understand it, is to serve as a holder for the quantum so th= at we > > can give a multithreaded process it's full quantum each go-around even = if > > threads block in which case we split it across multiple threads. In th= at > case, > > I think this might be a resonable model: >=20 >=20 > > I think this will achieve the desired goal of a KSE (preserve quanta fo= r > > multithreading time-sharing processes across threads) while still allow= ing > > things like priority propagation and preemption to work smoothly. It's= also > > fairly simple. >=20 >=20 >=20 > > If you use a priority bias for affinity, then that means you basically > have a > > constant, say 4 (that is random, prolly not the real value) then you wi= ll > > basically artificially bump the priority of threads with lastcpu =3D=3D= cpuid > by 4 > > during your comparison. This means you can stop walking the ksegroup > list of > > threads when you hit a thread whose priority is more than 4 levels less= than > > that of the highest priority thread. Also, the first thread you hit th= at > meets > > the affinity requirement is the one you run, this should keep a (hopefu= lly) > > decent bound on the amount of list walking done. >=20 > If KSE's are bound to a particular CPU, how does this affect KSE's & > Threads on different CPU' Threads are like water. They flow between any available KSEs for their KSEGRP. (with a slight preference for one on their last processor) KSEs from the same KSEGRP have the same priority and can therefore never pre-empt each other on the same processor, this it makes no sense to have more of them than there are processors. Binding them to procesors is also dubious.. if KSE A runs on processor A, then KSE B must run on Processor B unless it is already busy. If KSE A finishes, abd processor B is still busy, then KSE B can run on procesor A, but this si functionally identical to the case where the 3rd KSE (that was keeping B busy) finished and KSE ran on processor B, since both KSE A and KSE B are drawing from the same pool opf runnable threads.. It MAY make some small sense to say "Hey that KSEGRP has another quantum and since Processor B is busy, we'll run the KSE for Processor A again, in KSE B's place" but there is no real gain in doing so. One case to consider is as follows: KSE A is running at raised priority '3' (1 is more priority) because it is running a thread (T) that holds a mutex needed by a high priority process. Thread (S) becomes runnable at a higher priority(2). If there is a KSE (B) for that KSEGRP available, it is made runnable and its priority is set to (2). Is it possible that it might pre-empt the KSE from the same process group (A) on the same processor if there is a thread of priority (1) running on th eother procesor?. How is this differnt from pre-empting= =20 the thread (T) within (A), and running (S) wihtin (A) instead. This all needs thrashing out. and is what I'm trying to achieve here on -arch.. >=20 >=20 >=20 > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message >=20 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 16: 9:39 2001 Delivered-To: freebsd-arch@freebsd.org Received: from relay.gnf.org (relay.gnf.org [208.44.31.36]) by hub.freebsd.org (Postfix) with ESMTP id E962637B417 for ; Wed, 14 Nov 2001 16:09:35 -0800 (PST) Received: from mail.gnf.org (smtp.gnf.org [10.0.0.11]) by relay.gnf.org (8.11.6/8.11.6) with ESMTP id fAF09YJ15216 for ; Wed, 14 Nov 2001 16:09:34 -0800 Received: by mail.gnf.org (Postfix, from userid 888) id A436511E504; Wed, 14 Nov 2001 16:06:37 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.gnf.org (Postfix) with ESMTP id A019B11A572 for ; Wed, 14 Nov 2001 16:06:37 -0800 (PST) Date: Wed, 14 Nov 2001 16:06:37 -0800 (PST) From: Gordon Tetlow To: Subject: rc.d issues Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG There are a couple of issues with porting the rc.d infrastructure that need to be addressed before going forward. Most notably is NetBSD's use of (for example) $ipfilter while FreeBSD uses $ipfilter_enable. Not wanting to break POLA, I was thinking about hacking /etc/rc.subr to check $ and if that is unset, check $_enable. Any thoughts? I have yet to see any thoughts, criticisms, critiques or anything of the like for the initial patch that I posted (plug http://hobbes.melthusia.org/~gordont/rc_ng.diff). So I'm going to continue working along my current path. I've just moved so it's slowed up a bit, but hopefully I'll be able to return it shortly. -gordon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 16:17:25 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 1E35737B405 for ; Wed, 14 Nov 2001 16:15:47 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAF0Flb09186; Wed, 14 Nov 2001 16:15:47 -0800 (PST) (envelope-from dillon) Date: Wed, 14 Nov 2001 16:15:47 -0800 (PST) From: Matthew Dillon Message-Id: <200111150015.fAF0Flb09186@apollo.backplane.com> To: freebsd-arch@freebsd.org Subject: Need review - patch for socket locking and ref counting Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG This patch adds a reference count to the socket structure and cleans up & encapulates the API calls. I do not yet attempt to use sxlocks to lock the socket structure (to allow us to multi-thread the network stack), but that is the direction I am headed. soalloc()/sofree() - no reference counter adjustments (so_count must be 0 or sofree() panics) (soalloc initializes so_count to 0) socreate()/soclose() - socreate inits ref counter to 1, soclose decrements ref counter. soref() - bump ref counter sorele() - decrement ref counter, calls sofree() when the ref counter hits 0 holdsock() removed, fgetsock() added in a manner similar to fget() and fgetvp(). I would like a review. Also, I noticed there are two calls to soisdisconnected() *AFTER* the code (originally) calls sofree(), which sounds bogus to me. Could someone review the original code and give me an opinion? (see the last two XXX's in the patch set). Thanks, -Matt Matthew Dillon Index: compat/svr4/svr4_stream.c =================================================================== RCS file: /home/ncvs/src/sys/compat/svr4/svr4_stream.c,v retrieving revision 1.22 diff -u -r1.22 svr4_stream.c --- compat/svr4/svr4_stream.c 2001/09/12 08:36:58 1.22 +++ compat/svr4/svr4_stream.c 2001/11/14 22:10:24 @@ -150,7 +150,6 @@ register struct msghdr *mp; int flags; { - struct file *fp; struct uio auio; register struct iovec *iov; register int i; @@ -163,8 +162,7 @@ struct uio ktruio; #endif - error = holdsock(td->td_proc->p_fd, s, &fp); - if (error) + if ((error = fgetsock(td, s, &so, NULL)) != 0) return (error); auio.uio_iov = mp->msg_iov; auio.uio_iovcnt = mp->msg_iovlen; @@ -176,16 +174,14 @@ iov = mp->msg_iov; for (i = 0; i < mp->msg_iovlen; i++, iov++) { if ((auio.uio_resid += iov->iov_len) < 0) { - fdrop(fp, td); - return (EINVAL); + error = EINVAL; + goto done1; } } if (mp->msg_name) { error = getsockaddr(&to, mp->msg_name, mp->msg_namelen); - if (error) { - fdrop(fp, td); - return (error); - } + if (error) + goto done1; } else { to = 0; } @@ -211,7 +207,6 @@ } #endif len = auio.uio_resid; - so = (struct socket *)fp->f_data; error = so->so_proto->pr_usrreqs->pru_sosend(so, to, &auio, 0, control, flags, td); if (error) { @@ -239,7 +234,8 @@ bad: if (to) FREE(to, M_SONAME); - fdrop(fp, td); +done1: + fputsock(so); return (error); } @@ -250,7 +246,6 @@ register struct msghdr *mp; caddr_t namelenp; { - struct file *fp; struct uio auio; register struct iovec *iov; register int i; @@ -264,8 +259,7 @@ struct uio ktruio; #endif - error = holdsock(td->td_proc->p_fd, s, &fp); - if (error) + if ((error = fgetsock(td, s, &so, NULL)) != 0) return (error); auio.uio_iov = mp->msg_iov; auio.uio_iovcnt = mp->msg_iovlen; @@ -277,8 +271,8 @@ iov = mp->msg_iov; for (i = 0; i < mp->msg_iovlen; i++, iov++) { if ((auio.uio_resid += iov->iov_len) < 0) { - fdrop(fp, td); - return (EINVAL); + error = EINVAL; + goto done1; } } #ifdef KTRACE @@ -365,7 +359,8 @@ FREE(fromsa, M_SONAME); if (control) m_freem(control); - fdrop(fp, td); +done1: + fputsock(so); return (error); } Index: kern/kern_descrip.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_descrip.c,v retrieving revision 1.111 diff -u -r1.111 kern_descrip.c --- kern/kern_descrip.c 2001/11/14 06:30:35 1.111 +++ kern/kern_descrip.c 2001/11/14 23:42:17 @@ -60,6 +60,8 @@ #include #include #include +#include +#include #include @@ -1423,6 +1425,51 @@ fgetvp_write(struct thread *td, int fd, struct vnode **vpp) { return(_fgetvp(td, fd, vpp, FWRITE)); +} + +/* + * Like fget() but loads the underlying socket, or returns an error if + * the descriptor does not represent a socket. + * + * We bump the ref count on the returned socket. XXX Also obtain the SX lock in + * the future. + */ +int +fgetsock(struct thread *td, int fd, struct socket **spp, u_int *fflagp) +{ + struct filedesc *fdp; + struct file *fp; + struct socket *so; + + GIANT_REQUIRED; + fdp = td->td_proc->p_fd; + *spp = NULL; + if (fflagp) + *fflagp = 0; + if ((u_int)fd >= fdp->fd_nfiles) + return(EBADF); + if ((fp = fdp->fd_ofiles[fd]) == NULL) + return(EBADF); + if (fp->f_type != DTYPE_SOCKET) + return(ENOTSOCK); + if (fp->f_data == NULL) + return(EINVAL); + so = (struct socket *)fp->f_data; + if (fflagp) + *fflagp = fp->f_flag; + soref(so); + *spp = so; + return(0); +} + +/* + * Drop the reference count on the the socket and XXX release the SX lock in + * the future. The last reference closes the socket. + */ +void +fputsock(struct socket *so) +{ + sorele(so); } int Index: kern/kern_mtxpool.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_mtxpool.c,v retrieving revision 1.1 diff -u -r1.1 kern_mtxpool.c --- kern/kern_mtxpool.c 2001/11/13 21:55:12 1.1 +++ kern/kern_mtxpool.c 2001/11/14 04:06:48 @@ -35,9 +35,10 @@ #include #ifndef MTX_POOL_SIZE -#define MTX_POOL_SIZE 128 +#define MTX_POOL_SIZE 128 /* must be a multiple of 4 */ #endif -#define MTX_POOL_MASK (MTX_POOL_SIZE-1) +#define MTX_POOL_MASK (MTX_POOL_SIZE - 1) +#define MTX_POOL_XMASK (MTX_POOL_MASK & ~3) static struct mtx mtx_pool_ary[MTX_POOL_SIZE]; @@ -54,6 +55,34 @@ return(&mtx_pool_ary[((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_MASK]); } +static __inline +struct mtx * +_mtx_pool1_find(void *ptr) +{ + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 0]); +} + +static __inline +struct mtx * +_mtx_pool2_find(void *ptr) +{ + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 1]); +} + +static __inline +struct mtx * +_mtx_pool3_find(void *ptr) +{ + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 2]); +} + +static __inline +struct mtx * +_mtx_pool4_find(void *ptr) +{ + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 3]); +} + static void mtx_pool_setup(void *dummy __unused) { @@ -88,6 +117,30 @@ return(_mtx_pool_find(ptr)); } +struct mtx * +mtx_pool1_find(void *ptr) +{ + return(_mtx_pool1_find(ptr)); +} + +struct mtx * +mtx_pool2_find(void *ptr) +{ + return(_mtx_pool2_find(ptr)); +} + +struct mtx * +mtx_pool3_find(void *ptr) +{ + return(_mtx_pool3_find(ptr)); +} + +struct mtx * +mtx_pool4_find(void *ptr) +{ + return(_mtx_pool4_find(ptr)); +} + /* * Combined find/lock operation. Lock the pool mutex associated with * the specified address. @@ -98,6 +151,30 @@ mtx_lock(_mtx_pool_find(ptr)); } +void +mtx_pool1_lock(void *ptr) +{ + mtx_lock(_mtx_pool1_find(ptr)); +} + +void +mtx_pool2_lock(void *ptr) +{ + mtx_lock(_mtx_pool2_find(ptr)); +} + +void +mtx_pool3_lock(void *ptr) +{ + mtx_lock(_mtx_pool3_find(ptr)); +} + +void +mtx_pool4_lock(void *ptr) +{ + mtx_lock(_mtx_pool4_find(ptr)); +} + /* * Combined find/unlock operation. Unlock the pool mutex associated with * the specified address. @@ -106,6 +183,30 @@ mtx_pool_unlock(void *ptr) { mtx_unlock(_mtx_pool_find(ptr)); +} + +void +mtx_pool1_unlock(void *ptr) +{ + mtx_unlock(_mtx_pool1_find(ptr)); +} + +void +mtx_pool2_unlock(void *ptr) +{ + mtx_unlock(_mtx_pool2_find(ptr)); +} + +void +mtx_pool3_unlock(void *ptr) +{ + mtx_unlock(_mtx_pool3_find(ptr)); +} + +void +mtx_pool4_unlock(void *ptr) +{ + mtx_unlock(_mtx_pool4_find(ptr)); } SYSINIT(mtxpooli, SI_SUB_MUTEX, SI_ORDER_FIRST, mtx_pool_setup, NULL) Index: kern/sys_socket.c =================================================================== RCS file: /home/ncvs/src/sys/kern/sys_socket.c,v retrieving revision 1.35 diff -u -r1.35 sys_socket.c --- kern/sys_socket.c 2001/09/12 08:37:46 1.35 +++ kern/sys_socket.c 2001/11/14 23:48:45 @@ -182,6 +182,12 @@ return ((*so->so_proto->pr_usrreqs->pru_sense)(so, ub)); } +/* + * API socket close on file pointer. We call soclose() to close the + * socket (including initiating closing protocols). soclose() will + * sorele() the file reference but the actual socket will not go away + * until the socket's ref count hits 0. + */ /* ARGSUSED */ int soo_close(fp, td) @@ -189,10 +195,12 @@ struct thread *td; { int error = 0; + struct socket *so; fp->f_ops = &badfileops; - if (fp->f_data) - error = soclose((struct socket *)fp->f_data); - fp->f_data = 0; + if ((so = fp->f_data) != NULL) { + fp->f_data = NULL; + error = soclose(so); + } return (error); } Index: kern/uipc_socket.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_socket.c,v retrieving revision 1.105 diff -u -r1.105 uipc_socket.c --- kern/uipc_socket.c 2001/11/12 20:51:40 1.105 +++ kern/uipc_socket.c 2001/11/15 00:03:25 @@ -106,6 +106,8 @@ * Note that it would probably be better to allocate socket * and PCB at the same time, but I'm not convinced that all * the protocols can be easily modified to do this. + * + * soalloc() returns a socket with a ref count of 0. */ struct socket * soalloc(waitok) @@ -119,11 +121,16 @@ bzero(so, sizeof *so); so->so_gencnt = ++so_gencnt; so->so_zone = socket_zone; + /* sx_init(&so->so_sxlock, "socket sxlock"); */ TAILQ_INIT(&so->so_aiojobq); } return so; } +/* + * socreate returns a socket with a ref count of 1. The socket should be + * closed with soclose(). + */ int socreate(dom, aso, type, proto, td) int dom; @@ -162,10 +169,11 @@ so->so_type = type; so->so_cred = crhold(td->td_proc->p_ucred); so->so_proto = prp; + soref(so); error = (*prp->pr_usrreqs->pru_attach)(so, proto, td); if (error) { so->so_state |= SS_NOFDREF; - sofree(so); + sorele(so); return (error); } *aso = so; @@ -186,11 +194,12 @@ return (error); } -void -sodealloc(so) - struct socket *so; +static void +sodealloc(struct socket *so) { + KASSERT(so->so_count == 0, ("sodealloc(): so_count %d", so->so_count)); + so->so_count = 0; so->so_gencnt = ++so_gencnt; if (so->so_rcv.sb_hiwat) (void)chgsbsize(so->so_cred->cr_uidinfo, @@ -210,6 +219,7 @@ } #endif crfree(so->so_cred); + /* sx_destroy(&so->so_sxlock); */ zfree(so->so_zone, so); } @@ -242,6 +252,8 @@ { struct socket *head = so->so_head; + KASSERT(so->so_count == 0, ("socket %p so_count not 0", so)); + if (so->so_pcb || (so->so_state & SS_NOFDREF) == 0) return; if (head != NULL) { @@ -272,6 +284,10 @@ * Close a socket on last file table reference removal. * Initiate disconnect if connected. * Free socket when disconnect complete. + * + * This function will sorele() the socket. Note that soclose() may be + * called prior to the ref count reaching zero. The actual socket + * structure will not be freed until the ref count reaches zero. */ int soclose(so) @@ -329,7 +345,7 @@ if (so->so_state & SS_NOFDREF) panic("soclose: NOFDREF"); so->so_state |= SS_NOFDREF; - sofree(so); + sorele(so); splx(s); return (error); } @@ -345,7 +361,7 @@ error = (*so->so_proto->pr_usrreqs->pru_abort)(so); if (error) { - sofree(so); + sotryfree(so); /* note: does not decrement the ref count */ return error; } return (0); Index: kern/uipc_socket2.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_socket2.c,v retrieving revision 1.76 diff -u -r1.76 uipc_socket2.c --- kern/uipc_socket2.c 2001/10/11 23:38:15 1.76 +++ kern/uipc_socket2.c 2001/11/14 23:59:33 @@ -210,6 +210,8 @@ * then we allocate a new structure, propoerly linked into the * data structure of the original socket, and return this. * Connstatus may be 0, or SO_ISCONFIRMING, or SO_ISCONNECTED. + * + * note: the ref count on the socket is 0 on return */ struct socket * sonewconn(head, connstatus) @@ -246,7 +248,7 @@ so->so_cred = crhold(head->so_cred); if (soreserve(so, head->so_snd.sb_hiwat, head->so_rcv.sb_hiwat) || (*so->so_proto->pr_usrreqs->pru_attach)(so, 0, NULL)) { - sodealloc(so); + sotryfree(so); return ((struct socket *)0); } Index: kern/uipc_syscalls.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_syscalls.c,v retrieving revision 1.98 diff -u -r1.98 uipc_syscalls.c --- kern/uipc_syscalls.c 2001/11/14 06:30:35 1.98 +++ kern/uipc_syscalls.c 2001/11/14 23:09:34 @@ -139,7 +139,7 @@ fdrop(fp, td); } } else { - fp->f_data = (caddr_t)so; + fp->f_data = (caddr_t)so; /* already has ref count */ fp->f_flag = FREAD|FWRITE; fp->f_ops = &socketops; fp->f_type = DTYPE_SOCKET; @@ -164,22 +164,19 @@ int namelen; } */ *uap; { - struct file *fp; struct sockaddr *sa; + struct socket *sp; int error; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error) + if ((error = fgetsock(td, uap->s, &sp, NULL)) != 0) goto done2; - error = getsockaddr(&sa, uap->name, uap->namelen); - if (error) { - fdrop(fp, td); - goto done2; - } - error = sobind((struct socket *)fp->f_data, sa, td); + if ((error = getsockaddr(&sa, uap->name, uap->namelen)) != 0) + goto done1; + error = sobind(sp, sa, td); FREE(sa, M_SONAME); - fdrop(fp, td); +done1: + fputsock(sp); done2: mtx_unlock(&Giant); return (error); @@ -197,14 +194,13 @@ int backlog; } */ *uap; { - struct file *fp; + struct socket *sp; int error; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error == 0) { - error = solisten((struct socket *)fp->f_data, uap->backlog, td); - fdrop(fp, td); + if ((error = fgetsock(td, uap->s, &sp, NULL)) == 0) { + error = solisten(sp, uap->backlog, td); + fputsock(sp); } mtx_unlock(&Giant); return(error); @@ -225,13 +221,12 @@ int compat; { struct filedesc *fdp; - struct file *lfp = NULL; struct file *nfp = NULL; struct sockaddr *sa; int namelen, error, s; struct socket *head, *so; int fd; - short fflag; /* type must match fp->f_flag */ + u_int fflag; mtx_lock(&Giant); fdp = td->td_proc->p_fd; @@ -241,11 +236,10 @@ if(error) goto done2; } - error = holdsock(fdp, uap->s, &lfp); + error = fgetsock(td, uap->s, &head, &fflag); if (error) goto done2; s = splnet(); - head = (struct socket *)lfp->f_data; if ((head->so_options & SO_ACCEPTCONN) == 0) { splx(s); error = EINVAL; @@ -286,7 +280,6 @@ TAILQ_REMOVE(&head->so_comp, so, so_list); head->so_qlen--; - fflag = lfp->f_flag; error = falloc(td, &nfp, &fd); if (error) { /* @@ -312,7 +305,7 @@ if (head->so_sigio != NULL) fsetown(fgetown(head->so_sigio), &so->so_sigio); - nfp->f_data = (caddr_t)so; + nfp->f_data = (caddr_t)so; /* already has ref count */ nfp->f_flag = fflag; nfp->f_ops = &socketops; nfp->f_type = DTYPE_SOCKET; @@ -375,7 +368,7 @@ done: if (nfp != NULL) fdrop(nfp, td); - fdrop(lfp, td); + fputsock(head); done2: mtx_unlock(&Giant); return (error); @@ -420,35 +413,31 @@ int namelen; } */ *uap; { - struct file *fp; - register struct socket *so; + struct socket *so; struct sockaddr *sa; int error, s; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error) + if ((error = fgetsock(td, uap->s, &so, NULL)) != 0) goto done2; - so = (struct socket *)fp->f_data; if ((so->so_state & SS_NBIO) && (so->so_state & SS_ISCONNECTING)) { error = EALREADY; - goto done; + goto done1; } error = getsockaddr(&sa, uap->name, uap->namelen); if (error) - goto done; + goto done1; error = soconnect(so, sa, td); if (error) goto bad; if ((so->so_state & SS_NBIO) && (so->so_state & SS_ISCONNECTING)) { FREE(sa, M_SONAME); error = EINPROGRESS; - goto done; + goto done1; } s = splnet(); while ((so->so_state & SS_ISCONNECTING) && so->so_error == 0) { - error = tsleep((caddr_t)&so->so_timeo, PSOCK | PCATCH, - "connec", 0); + error = tsleep((caddr_t)&so->so_timeo, PSOCK | PCATCH, "connec", 0); if (error) break; } @@ -462,8 +451,8 @@ FREE(sa, M_SONAME); if (error == ERESTART) error = EINTR; -done: - fdrop(fp, td); +done1: + fputsock(so); done2: mtx_unlock(&Giant); return (error); @@ -499,12 +488,12 @@ goto free2; fhold(fp1); sv[0] = fd; - fp1->f_data = (caddr_t)so1; + fp1->f_data = (caddr_t)so1; /* so1 already has ref count */ error = falloc(td, &fp2, &fd); if (error) goto free3; fhold(fp2); - fp2->f_data = (caddr_t)so2; + fp2->f_data = (caddr_t)so2; /* so2 already has ref count */ sv[1] = fd; error = soconnect2(so1, so2); if (error) @@ -552,12 +541,11 @@ register struct msghdr *mp; int flags; { - struct file *fp; struct uio auio; register struct iovec *iov; register int i; struct mbuf *control; - struct sockaddr *to; + struct sockaddr *to = NULL; int len, error; struct socket *so; #ifdef KTRACE @@ -565,8 +553,7 @@ struct uio ktruio; #endif - error = holdsock(td->td_proc->p_fd, s, &fp); - if (error) + if ((error = fgetsock(td, s, &so, NULL)) != 0) return (error); auio.uio_iov = mp->msg_iov; auio.uio_iovcnt = mp->msg_iovlen; @@ -578,18 +565,14 @@ iov = mp->msg_iov; for (i = 0; i < mp->msg_iovlen; i++, iov++) { if ((auio.uio_resid += iov->iov_len) < 0) { - fdrop(fp, td); - return (EINVAL); + error = EINVAL; + goto bad; } } if (mp->msg_name) { error = getsockaddr(&to, mp->msg_name, mp->msg_namelen); - if (error) { - fdrop(fp, td); - return (error); - } - } else { - to = 0; + if (error) + goto bad; } if (mp->msg_control) { if (mp->msg_controllen < sizeof(struct cmsghdr) @@ -633,7 +616,6 @@ } #endif len = auio.uio_resid; - so = (struct socket *)fp->f_data; error = so->so_proto->pr_usrreqs->pru_sosend(so, to, &auio, 0, control, flags, td); if (error) { @@ -659,7 +641,7 @@ } #endif bad: - fdrop(fp, td); + fputsock(so); if (to) FREE(to, M_SONAME); return (error); @@ -834,7 +816,6 @@ register struct msghdr *mp; caddr_t namelenp; { - struct file *fp; struct uio auio; register struct iovec *iov; register int i; @@ -848,8 +829,7 @@ struct uio ktruio; #endif - error = holdsock(td->td_proc->p_fd, s, &fp); - if (error) + if ((error = fgetsock(td, s, &so, NULL)) != 0) return (error); auio.uio_iov = mp->msg_iov; auio.uio_iovcnt = mp->msg_iovlen; @@ -861,7 +841,7 @@ iov = mp->msg_iov; for (i = 0; i < mp->msg_iovlen; i++, iov++) { if ((auio.uio_resid += iov->iov_len) < 0) { - fdrop(fp, td); + fputsock(so); return (EINVAL); } } @@ -875,7 +855,6 @@ } #endif len = auio.uio_resid; - so = (struct socket *)fp->f_data; error = so->so_proto->pr_usrreqs->pru_soreceive(so, &fromsa, &auio, (struct mbuf **)0, mp->msg_control ? &control : (struct mbuf **)0, &mp->msg_flags); @@ -975,7 +954,7 @@ mp->msg_controllen = ctlbuf - (caddr_t)mp->msg_control; } out: - fdrop(fp, td); + fputsock(so); if (fromsa) FREE(fromsa, M_SONAME); if (control) @@ -1196,14 +1175,13 @@ int how; } */ *uap; { - struct file *fp; + struct socket *so; int error; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error == 0) { - error = soshutdown((struct socket *)fp->f_data, uap->how); - fdrop(fp, td); + if ((error = fgetsock(td, uap->s, &so, NULL)) == 0) { + error = soshutdown(so, uap->how); + fputsock(so); } mtx_unlock(&Giant); return(error); @@ -1224,7 +1202,7 @@ int valsize; } */ *uap; { - struct file *fp; + struct socket *so; struct sockopt sopt; int error; @@ -1234,16 +1212,15 @@ return (EINVAL); mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error == 0) { + if ((error = fgetsock(td, uap->s, &so, NULL)) == 0) { sopt.sopt_dir = SOPT_SET; sopt.sopt_level = uap->level; sopt.sopt_name = uap->name; sopt.sopt_val = uap->val; sopt.sopt_valsize = uap->valsize; sopt.sopt_td = td; - error = sosetopt((struct socket *)fp->f_data, &sopt); - fdrop(fp, td); + error = sosetopt(so, &sopt); + fputsock(so); } mtx_unlock(&Giant); return(error); @@ -1265,24 +1242,20 @@ } */ *uap; { int valsize, error; - struct file *fp; + struct socket *so; struct sockopt sopt; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error) + if ((error = fgetsock(td, uap->s, &so, NULL)) != 0) goto done2; if (uap->val) { error = copyin((caddr_t)uap->avalsize, (caddr_t)&valsize, sizeof (valsize)); - if (error) { - fdrop(fp, td); - goto done2; - } + if (error) + goto done1; if (valsize < 0) { - fdrop(fp, td); error = EINVAL; - goto done2; + goto done1; } } else { valsize = 0; @@ -1295,13 +1268,14 @@ sopt.sopt_valsize = (size_t)valsize; /* checked non-negative above */ sopt.sopt_td = td; - error = sogetopt((struct socket *)fp->f_data, &sopt); + error = sogetopt(so, &sopt); if (error == 0) { valsize = sopt.sopt_valsize; error = copyout((caddr_t)&valsize, (caddr_t)uap->avalsize, sizeof (valsize)); } - fdrop(fp, td); +done1: + fputsock(so); done2: mtx_unlock(&Giant); return (error); @@ -1323,21 +1297,16 @@ } */ *uap; int compat; { - struct file *fp; - register struct socket *so; + struct socket *so; struct sockaddr *sa; int len, error; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->fdes, &fp); - if (error) + if ((error = fgetsock(td, uap->fdes, &so, NULL)) != 0) goto done2; error = copyin((caddr_t)uap->alen, (caddr_t)&len, sizeof (len)); - if (error) { - fdrop(fp, td); - goto done2; - } - so = (struct socket *)fp->f_data; + if (error) + goto done1; sa = 0; error = (*so->so_proto->pr_usrreqs->pru_sockaddr)(so, &sa); if (error) @@ -1360,7 +1329,8 @@ bad: if (sa) FREE(sa, M_SONAME); - fdrop(fp, td); +done1: + fputsock(so); done2: mtx_unlock(&Giant); return (error); @@ -1408,26 +1378,20 @@ } */ *uap; int compat; { - struct file *fp; - register struct socket *so; + struct socket *so; struct sockaddr *sa; int len, error; mtx_lock(&Giant); - error = holdsock(td->td_proc->p_fd, uap->fdes, &fp); - if (error) + if ((error = fgetsock(td, uap->fdes, &so, NULL)) != 0) goto done2; - so = (struct socket *)fp->f_data; if ((so->so_state & (SS_ISCONNECTED|SS_ISCONFIRMING)) == 0) { - fdrop(fp, td); error = ENOTCONN; - goto done2; + goto done1; } error = copyin((caddr_t)uap->alen, (caddr_t)&len, sizeof (len)); - if (error) { - fdrop(fp, td); - goto done2; - } + if (error) + goto done1; sa = 0; error = (*so->so_proto->pr_usrreqs->pru_peeraddr)(so, &sa); if (error) @@ -1450,7 +1414,8 @@ bad: if (sa) FREE(sa, M_SONAME); - fdrop(fp, td); +done1: + fputsock(so); done2: mtx_unlock(&Giant); return (error); @@ -1550,33 +1515,6 @@ } /* - * holdsock() - load the struct file pointer associated - * with a socket into *fpp. If an error occurs, non-zero - * will be returned and *fpp will be set to NULL. - */ -int -holdsock(fdp, fdes, fpp) - struct filedesc *fdp; - int fdes; - struct file **fpp; -{ - register struct file *fp = NULL; - int error = 0; - - if ((unsigned)fdes >= fdp->fd_nfiles || - (fp = fdp->fd_ofiles[fdes]) == NULL) { - error = EBADF; - } else if (fp->f_type != DTYPE_SOCKET) { - error = ENOTSOCK; - fp = NULL; - } else { - fhold(fp); - } - *fpp = fp; - return(error); -} - -/* * Allocate a pool of sf_bufs (sendfile(2) or "super-fast" if you prefer. :-)) * XXX - The sf_buf functions are currently private to sendfile(2), so have * been made static, but may be useful in the future for doing zero-copy in @@ -1678,10 +1616,9 @@ int sendfile(struct thread *td, struct sendfile_args *uap) { - struct file *fp = NULL; struct vnode *vp; struct vm_object *obj; - struct socket *so; + struct socket *so = NULL; struct mbuf *m; struct sf_buf *sf; struct vm_page *pg; @@ -1701,10 +1638,8 @@ error = EINVAL; goto done; } - error = holdsock(td->td_proc->p_fd, uap->s, &fp); - if (error) + if ((error = fgetsock(td, uap->s, &so, NULL)) != 0) goto done; - so = (struct socket *)fp->f_data; if (so->so_type != SOCK_STREAM) { error = EINVAL; goto done; @@ -1988,8 +1923,9 @@ } if (vp) vrele(vp); - if (fp) - fdrop(fp, td); + if (so) + fputsock(so); mtx_unlock(&Giant); return (error); } + Index: kern/uipc_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_usrreq.c,v retrieving revision 1.76 diff -u -r1.76 uipc_usrreq.c --- kern/uipc_usrreq.c 2001/11/08 02:13:16 1.76 +++ kern/uipc_usrreq.c 2001/11/14 23:59:42 @@ -935,7 +935,7 @@ if (unp->unp_addr) FREE(unp->unp_addr, M_SONAME); zfree(unp_zone, unp); - sofree(so); + sotryfree(so); } } Index: net/raw_cb.c =================================================================== RCS file: /home/ncvs/src/sys/net/raw_cb.c,v retrieving revision 1.16 diff -u -r1.16 raw_cb.c --- net/raw_cb.c 1999/08/28 00:48:27 1.16 +++ net/raw_cb.c 2001/11/14 23:59:49 @@ -97,7 +97,7 @@ struct socket *so = rp->rcb_socket; so->so_pcb = 0; - sofree(so); + sotryfree(so); LIST_REMOVE(rp, list); #ifdef notdef if (rp->rcb_laddr) Index: net/raw_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/net/raw_usrreq.c,v retrieving revision 1.20 diff -u -r1.20 raw_usrreq.c --- net/raw_usrreq.c 2001/09/12 08:37:51 1.20 +++ net/raw_usrreq.c 2001/11/14 23:59:56 @@ -142,8 +142,8 @@ if (rp == 0) return EINVAL; raw_disconnect(rp); - sofree(so); - soisdisconnected(so); + sotryfree(so); + soisdisconnected(so); /* XXX huh? called after the sofree()? */ return 0; } Index: netatalk/ddp_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/netatalk/ddp_usrreq.c,v retrieving revision 1.21 diff -u -r1.21 ddp_usrreq.c --- netatalk/ddp_usrreq.c 2001/09/12 08:37:52 1.21 +++ netatalk/ddp_usrreq.c 2001/11/15 00:00:03 @@ -441,7 +441,7 @@ { soisdisconnected( so ); so->so_pcb = 0; - sofree( so ); + sotryfree(so); /* remove ddp from ddp_ports list */ if ( ddp->ddp_lsat.sat_port != ATADDR_ANYPORT && Index: netatm/atm_socket.c =================================================================== RCS file: /home/ncvs/src/sys/netatm/atm_socket.c,v retrieving revision 1.8 diff -u -r1.8 atm_socket.c --- netatm/atm_socket.c 2000/12/07 22:19:04 1.8 +++ netatm/atm_socket.c 2001/11/14 23:58:01 @@ -176,7 +176,7 @@ * Break links and free control blocks */ so->so_pcb = NULL; - sofree(so); + sotryfree(so); atm_free((caddr_t)atp); Index: netinet/in_pcb.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/in_pcb.c,v retrieving revision 1.92 diff -u -r1.92 in_pcb.c --- netinet/in_pcb.c 2001/11/06 00:48:01 1.92 +++ netinet/in_pcb.c 2001/11/14 23:58:11 @@ -563,7 +563,7 @@ inp->inp_gencnt = ++ipi->ipi_gencnt; in_pcbremlists(inp); so->so_pcb = 0; - sofree(so); + sotryfree(so); if (inp->inp_options) (void)m_free(inp->inp_options); if (rt) { Index: netinet6/in6_pcb.c =================================================================== RCS file: /home/ncvs/src/sys/netinet6/in6_pcb.c,v retrieving revision 1.21 diff -u -r1.21 in6_pcb.c --- netinet6/in6_pcb.c 2001/10/17 18:07:05 1.21 +++ netinet6/in6_pcb.c 2001/11/14 23:58:15 @@ -606,7 +606,7 @@ inp->inp_gencnt = ++ipi->ipi_gencnt; in_pcbremlists(inp); sotoinpcb(so) = 0; - sofree(so); + sotryfree(so); if (inp->in6p_options) m_freem(inp->in6p_options); Index: netipx/ipx_pcb.c =================================================================== RCS file: /home/ncvs/src/sys/netipx/ipx_pcb.c,v retrieving revision 1.21 diff -u -r1.21 ipx_pcb.c --- netipx/ipx_pcb.c 2001/09/12 08:37:56 1.21 +++ netipx/ipx_pcb.c 2001/11/14 23:58:22 @@ -268,7 +268,7 @@ struct socket *so = ipxp->ipxp_socket; so->so_pcb = 0; - sofree(so); + sotryfree(so); if (ipxp->ipxp_route.ro_rt != NULL) rtfree(ipxp->ipxp_route.ro_rt); remque(ipxp); Index: netipx/ipx_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/netipx/ipx_usrreq.c,v retrieving revision 1.29 diff -u -r1.29 ipx_usrreq.c --- netipx/ipx_usrreq.c 2001/09/12 08:37:56 1.29 +++ netipx/ipx_usrreq.c 2001/11/14 23:58:25 @@ -426,7 +426,7 @@ s = splnet(); ipx_pcbdetach(ipxp); splx(s); - sofree(so); + sotryfree(so); soisdisconnected(so); return (0); } Index: netnatm/natm.c =================================================================== RCS file: /home/ncvs/src/sys/netnatm/natm.c,v retrieving revision 1.13 diff -u -r1.13 natm.c --- netnatm/natm.c 2001/04/05 04:20:48 1.13 +++ netnatm/natm.c 2001/11/14 23:58:41 @@ -133,7 +133,7 @@ */ npcb_free(npcb, NPCB_DESTROY); /* drain */ so->so_pcb = NULL; - sofree(so); + sotryfree(so); out: splx(s); return (error); @@ -481,7 +481,7 @@ npcb_free(npcb, NPCB_DESTROY); /* drain */ so->so_pcb = NULL; - sofree(so); + sotryfree(so); break; Index: netns/idp_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/netns/idp_usrreq.c,v retrieving revision 1.9 diff -u -r1.9 idp_usrreq.c --- netns/idp_usrreq.c 1999/08/28 00:49:47 1.9 +++ netns/idp_usrreq.c 2001/11/14 23:58:57 @@ -491,8 +491,8 @@ case PRU_ABORT: ns_pcbdetach(nsp); - sofree(so); - soisdisconnected(so); + sotryfree(so); + soisdisconnected(so); /* XXX huh, called after sofree()? */ break; case PRU_SOCKADDR: Index: netns/ns_pcb.c =================================================================== RCS file: /home/ncvs/src/sys/netns/ns_pcb.c,v retrieving revision 1.9 diff -u -r1.9 ns_pcb.c --- netns/ns_pcb.c 1999/08/28 00:49:51 1.9 +++ netns/ns_pcb.c 2001/11/14 23:59:03 @@ -232,7 +232,7 @@ struct socket *so = nsp->nsp_socket; so->so_pcb = 0; - sofree(so); + sotryfree(so); if (nsp->nsp_route.ro_rt) rtfree(nsp->nsp_route.ro_rt); remque(nsp); Index: nfsserver/nfs_syscalls.c =================================================================== RCS file: /home/ncvs/src/sys/nfsserver/nfs_syscalls.c,v retrieving revision 1.72 diff -u -r1.72 nfs_syscalls.c --- nfsserver/nfs_syscalls.c 2001/09/28 04:37:08 1.72 +++ nfsserver/nfs_syscalls.c 2001/11/14 22:30:42 @@ -143,9 +143,12 @@ error = copyin(uap->argp, (caddr_t)&nfsdarg, sizeof(nfsdarg)); if (error) goto done2; - error = holdsock(td->td_proc->p_fd, nfsdarg.sock, &fp); - if (error) + if ((error = fget(td, nfsdarg.sock, &fp)) != 0) goto done2; + if (fp->f_type != DTYPE_SOCKET) { + fdrop(fp, td); + goto done2; + } /* * Get the client address for connected sockets. */ Index: sys/file.h =================================================================== RCS file: /home/ncvs/src/sys/sys/file.h,v retrieving revision 1.32 diff -u -r1.32 file.h --- sys/file.h 2001/11/14 06:30:36 1.32 +++ sys/file.h 2001/11/14 21:57:21 @@ -50,6 +50,7 @@ struct uio; struct knote; struct vnode; +struct socket; /* * Kernel descriptor table. @@ -118,6 +119,9 @@ int fgetvp __P((struct thread *td, int fd, struct vnode **vpp)); int fgetvp_read __P((struct thread *td, int fd, struct vnode **vpp)); int fgetvp_write __P((struct thread *td, int fd, struct vnode **vpp)); + +int fgetsock __P((struct thread *td, int fd, struct socket **spp, u_int *fflagp)); +void fputsock __P((struct socket *sp)); static __inline void fhold(fp) Index: sys/socketvar.h =================================================================== RCS file: /home/ncvs/src/sys/sys/socketvar.h,v retrieving revision 1.63 diff -u -r1.63 socketvar.h --- sys/socketvar.h 2001/10/25 02:03:37 1.63 +++ sys/socketvar.h 2001/11/15 00:07:07 @@ -38,6 +38,7 @@ #define _SYS_SOCKETVAR_H_ #include /* for TAILQ macros */ +#include /* SX locks */ #include /* for struct selinfo */ /* @@ -52,6 +53,7 @@ struct socket { struct vm_zone *so_zone; /* zone we were allocated from */ + int so_count; /* reference count */ short so_type; /* generic type, see socket.h */ short so_options; /* from socket call, see socket.h */ short so_linger; /* time to linger while closing */ @@ -244,6 +246,24 @@ } \ } +/* + * soref()/sorele() ref-count the socket structure. Note that you must + * still explicitly close the socket, but the last ref count will free + * the structure. + */ + +#define soref(so) ++so->so_count + +#define sorele(so) do { \ + if (--so->so_count == 0)\ + sofree(so); \ + } while (0) + +#define sotryfree(so) do { \ + if (so->so_count == 0) \ + sofree(so); \ + } while(0) + #define sorwakeup(so) do { \ if (sb_notify(&(so)->so_rcv)) \ sowakeup((so), &(so)->so_rcv); \ @@ -360,7 +380,7 @@ int soconnect2 __P((struct socket *so1, struct socket *so2)); int socreate __P((int dom, struct socket **aso, int type, int proto, struct thread *td)); -void sodealloc __P((struct socket *so)); +/*void sodealloc __P((struct socket *so));*/ int sodisconnect __P((struct socket *so)); void sofree __P((struct socket *so)); int sogetopt __P((struct socket *so, struct sockopt *sopt)); To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 17:20:16 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 358B937B417 for ; Wed, 14 Nov 2001 17:20:12 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id RAA05599; Wed, 14 Nov 2001 17:17:02 -0800 (PST) Date: Wed, 14 Nov 2001 17:17:00 -0800 (PST) From: Julian Elischer To: Matthew Dillon Cc: freebsd-arch@freebsd.org Subject: Re: Need review - patch for socket locking and ref counting In-Reply-To: <200111150015.fAF0Flb09186@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG how does it cope with the old "unix domain socket being passed across itself" case? (I'm guessing it's references on the pcb that are tricky there and not references on the sockets) On Wed, 14 Nov 2001, Matthew Dillon wrote: > This patch adds a reference count to the socket structure > and cleans up & encapulates the API calls. I do not yet > attempt to use sxlocks to lock the socket structure (to allow > us to multi-thread the network stack), but that is the > direction I am headed. > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 17:22:11 2001 Delivered-To: freebsd-arch@freebsd.org Received: from elvis.mu.org (elvis.mu.org [216.33.66.196]) by hub.freebsd.org (Postfix) with ESMTP id 36BFB37B416 for ; Wed, 14 Nov 2001 17:22:09 -0800 (PST) Received: by elvis.mu.org (Postfix, from userid 1192) id DFA1381D05; Wed, 14 Nov 2001 19:22:03 -0600 (CST) Date: Wed, 14 Nov 2001 19:22:03 -0600 From: Alfred Perlstein To: Julian Elischer Cc: Matthew Dillon , freebsd-arch@freebsd.org Subject: Re: Need review - patch for socket locking and ref counting Message-ID: <20011114192203.H13393@elvis.mu.org> References: <200111150015.fAF0Flb09186@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from julian@elischer.org on Wed, Nov 14, 2001 at 05:17:00PM -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > On Wed, 14 Nov 2001, Matthew Dillon wrote: > > > This patch adds a reference count to the socket structure > > and cleans up & encapulates the API calls. I do not yet > > attempt to use sxlocks to lock the socket structure (to allow > > us to multi-thread the network stack), but that is the > > direction I am headed. * Julian Elischer [011114 19:20] wrote: > how does it cope with the old > "unix domain socket being passed across itself" case? > > > (I'm guessing it's references on the pcb that are tricky there and not > references on the sockets) That's handled in the "struct file" handling code. -- -Alfred Perlstein [alfred@freebsd.org] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' http://www.morons.org/rants/gpl-harmful.php3 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 18:59:52 2001 Delivered-To: freebsd-arch@freebsd.org Received: from monorchid.lemis.com (monorchid.lemis.com [192.109.197.75]) by hub.freebsd.org (Postfix) with ESMTP id 0090637B417; Wed, 14 Nov 2001 18:59:48 -0800 (PST) Received: by monorchid.lemis.com (Postfix, from userid 1004) id EC3FF786E1; Thu, 15 Nov 2001 13:29:45 +1030 (CST) Date: Thu, 15 Nov 2001 13:29:45 +1030 From: Greg Lehey To: Bruce Evans , Matthew Dillon Cc: Peter Wemm , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. Message-ID: <20011115132945.C33267@monorchid.lemis.com> References: <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011111191735.00D053807@overcee.netplex.com.au> <20011112165530.B34657-100000@delplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200111121009.fACA9SI75024@apollo.backplane.com> <20011112165530.B34657-100000@delplex.bde.org> User-Agent: Mutt/1.3.23i Organization: The FreeBSD Project Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.FreeBSD.org/ X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Monday, 12 November 2001 at 17:32:12 +1100, Bruce Evans wrote: > On Sun, 11 Nov 2001, Peter Wemm wrote: > >> Robert Watson wrote: >> >>> It seems to me that unless a very strong argument exists against using >>> curproc/curthread (and I don't preclude one existing), using them would >>> actually be an improvement, as it would assert that this class of > >> My gripe is that on i386, it creates a LOT of work for the compiler. > > That's just an implementation detail for one arch. I did strongly object > to the implementation, but... I must say that I don't have much sympathy for the compiler. If it also creates a lot of work for the processors, that's a different matter. >> Count me in the 'curproc considered harmful' camp. (or curthread). > > Count me ouside of it. Agreed (for once). On Monday, 12 November 2001 at 2:09:28 -0800, Matthew Dillon wrote: >> Passing the pointer down through 20 subroutines (some of which don't >> even use it except to pass it along) may add up to much. >> >> Bruce > > I agree that it is kind of silly to pass a global down through N levels > of procedures. Just on principle. On the otherhand I don't expect > the performance to be better or worse, or even for there to be any > real difference in code size. Fewer instructions per routine in > more routines, with more memory writes (pass as argument on stack), > verses more instructions in fewer routines, with only memory reads > (access as global). Without there being a clear winner there isn't > much of a reason to change the existing code. OK, I've just got back from a conference to find several thousand messages, many of them requiring to be read, so I haven't had much time to look at this, but wouldn't it make more sense to pass the proc or thread pointer (or whatever substructure is really needed) in a structure which is being handed from function to function anyway? struct buf would appear to be the correct one in this case. I would also expect this to make it easier for exceptions like NFS code. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 19:44:47 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id DEDFE37B41A for ; Wed, 14 Nov 2001 19:44:38 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAF3ibT11896; Wed, 14 Nov 2001 19:44:37 -0800 (PST) (envelope-from dillon) Date: Wed, 14 Nov 2001 19:44:37 -0800 (PST) From: Matthew Dillon Message-Id: <200111150344.fAF3ibT11896@apollo.backplane.com> To: Alfred Perlstein Cc: Julian Elischer , freebsd-arch@FreeBSD.ORG Subject: Re: Need review - patch for socket locking and ref counting References: <200111150015.fAF0Flb09186@apollo.backplane.com> <20011114192203.H13393@elvis.mu.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :* Julian Elischer [011114 19:20] wrote: :> how does it cope with the old :> "unix domain socket being passed across itself" case? :> :> :> (I'm guessing it's references on the pcb that are tricky there and not :> references on the sockets) : :That's handled in the "struct file" handling code. : :-- :-Alfred Perlstein [alfred@freebsd.org] Yah. Hopefully I will never have to touch the GC code. Again. What this stuff is (and by the way, don't bother trying to test it, I haven't tested it myself yet)... what this stuff is is basically the infrastructure that we will be building the MP locking system for the network stack on top of. Amoung other things. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 21:11: 4 2001 Delivered-To: freebsd-arch@freebsd.org Received: from elvis.mu.org (elvis.mu.org [216.33.66.196]) by hub.freebsd.org (Postfix) with ESMTP id 8F88E37B41B; Wed, 14 Nov 2001 21:11:02 -0800 (PST) Received: by elvis.mu.org (Postfix, from userid 1192) id 30ECC81D01; Wed, 14 Nov 2001 23:10:57 -0600 (CST) Date: Wed, 14 Nov 2001 23:10:57 -0600 From: Alfred Perlstein To: Greg Lehey Cc: Bruce Evans , Matthew Dillon , Peter Wemm , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. Message-ID: <20011114231057.K13393@elvis.mu.org> References: <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011111191735.00D053807@overcee.netplex.com.au> <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011115132945.C33267@monorchid.lemis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20011115132945.C33267@monorchid.lemis.com>; from grog@FreeBSD.org on Thu, Nov 15, 2001 at 01:29:45PM +1030 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG If you want to see why curproc sucks then please investigate what happens when you NDINIT a nameidata with another thread pointer other than your own, then perform a vn_open. kablooey! My recent addition of vn_open_cred and modification of nfs_lock.c was to get around this badness of the API. -- -Alfred Perlstein [alfred@freebsd.org] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' http://www.morons.org/rants/gpl-harmful.php3 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 14 21:18:25 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id C3D1137B416; Wed, 14 Nov 2001 21:18:23 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAF5IMW18730; Wed, 14 Nov 2001 21:18:22 -0800 (PST) (envelope-from dillon) Date: Wed, 14 Nov 2001 21:18:22 -0800 (PST) From: Matthew Dillon Message-Id: <200111150518.fAF5IMW18730@apollo.backplane.com> To: Alfred Perlstein Cc: Greg Lehey , Bruce Evans , Peter Wemm , Robert Watson , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011111191735.00D053807@overcee.netplex.com.au> <20011112165530.B34657-100000@delplex.bde.org> <200111121009.fACA9SI75024@apollo.backplane.com> <20011115132945.C33267@monorchid.lemis.com> <20011114231057.K13393@elvis.mu.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :If you want to see why curproc sucks then please investigate what :happens when you NDINIT a nameidata with another thread pointer :other than your own, then perform a vn_open. kablooey! : :My recent addition of vn_open_cred and modification of nfs_lock.c :was to get around this badness of the API. : :-- :-Alfred Perlstein [alfred@freebsd.org] I'm not sure this is a fair argument. Just about all the code in the system taking a struct thread * pointer assumes that the thread is the current thread and so avoid much of the locking that they would normally have to do on it. Passing some other thread to a good chunk of this code will have very weird broken results. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 6: 6:40 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id AF1E037B416; Thu, 15 Nov 2001 06:06:36 -0800 (PST) Received: from fledge.watson.org (ak82hjs7hex92j@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAFE6Li87788; Thu, 15 Nov 2001 09:06:21 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Thu, 15 Nov 2001 09:06:20 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Matthew Dillon Cc: Alfred Perlstein , Greg Lehey , Bruce Evans , Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: <200111150518.fAF5IMW18730@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Wed, 14 Nov 2001, Matthew Dillon wrote: > > :If you want to see why curproc sucks then please investigate what > :happens when you NDINIT a nameidata with another thread pointer > :other than your own, then perform a vn_open. kablooey! > : > :My recent addition of vn_open_cred and modification of nfs_lock.c > :was to get around this badness of the API. > : > :-- > :-Alfred Perlstein [alfred@freebsd.org] > > I'm not sure this is a fair argument. Just about all the code > in the system taking a struct thread * pointer assumes that the > thread is the current thread and so avoid much of the locking that > they would normally have to do on it. Passing some other thread > to a good chunk of this code will have very weird broken results. In my mind, that is in fact the primary argument *to* use curproc instead of passing around process and thread pointers. If the routine implicitly assumes curproc or curthread for locking/referencing purposes, either there needs to be a way to assert that: int foobar(struct thread *td, int arg) { PROMISE_ME_ITS_CURTHREAD_OR_DIE_HORRIBLY(td); arg += td->td_only_safe_to_read_without_lock_if_curthread; /* * Contrived example a little less contrived: * return (td->td_ucred->cr_uid == arg); */ return (arg); } or, we simply need to use curthread and curproc, and not allow anything else to be passed in. int foobar(int arg) { arg += curthread->td_only_safe_to_read_without_lock_if_curthread; return (arg); } Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 6:14:50 2001 Delivered-To: freebsd-arch@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id 8661537B405; Thu, 15 Nov 2001 06:14:45 -0800 (PST) Received: by flood.ping.uio.no (Postfix, from userid 2602) id D3A9614C40; Thu, 15 Nov 2001 15:14:41 +0100 (CET) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Robert Watson Cc: Matthew Dillon , Alfred Perlstein , Greg Lehey , Bruce Evans , Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: From: Dag-Erling Smorgrav Date: 15 Nov 2001 15:14:41 +0100 In-Reply-To: Message-ID: Lines: 16 User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Robert Watson writes: > In my mind, that is in fact the primary argument *to* use curproc instead > of passing around process and thread pointers. If the routine implicitly > assumes curproc or curthread for locking/referencing purposes, either > there needs to be a way to assert that: > [example of PROMISE_ME_ITS_CURTHREAD_OR_DIE_HORRIBLY(td)] > or, we simply need to use curthread and curproc, and not allow anything > else to be passed in. I greatly prefer the first approach, as it allows us to gradually fix parts of the kernel to be curthread-agnostic without the hassle and breakage that inevitably follow from massive API changes. DES -- Dag-Erling Smorgrav - des@ofug.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 10:54:58 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id 560BB37B405; Thu, 15 Nov 2001 10:54:50 -0800 (PST) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.5) with SMTP id fAFIsPi02832; Thu, 15 Nov 2001 13:54:25 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Thu, 15 Nov 2001 13:54:25 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Dag-Erling Smorgrav Cc: Matthew Dillon , Alfred Perlstein , Greg Lehey , Bruce Evans , Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 15 Nov 2001, Dag-Erling Smorgrav wrote: > Robert Watson writes: > > In my mind, that is in fact the primary argument *to* use curproc instead > > of passing around process and thread pointers. If the routine implicitly > > assumes curproc or curthread for locking/referencing purposes, either > > there needs to be a way to assert that: > > [example of PROMISE_ME_ITS_CURTHREAD_OR_DIE_HORRIBLY(td)] > > or, we simply need to use curthread and curproc, and not allow anything > > else to be passed in. > > I greatly prefer the first approach, as it allows us to gradually fix > parts of the kernel to be curthread-agnostic without the hassle and > breakage that inevitably follow from massive API changes. The implicit question behind that, though, is: are there places in the kernel that will always be locked into using curproc/curthread, simply due to the structure and behavior of the kernel environment. For example, I would generally think that 'borrowing' a proc or thread structure is a bad idea, rather, you want that proc or thread to 'loan' you references to supporting ref-counted structures (vmspaces, creds, ...). On a small scale, routines like 'copyin' and 'copyout' already follow the "must use curproc/curthread, so don't bother taking one on the command line" strategy. If we were to assert that a certain class of functions always acted on behalf of the calling thread or process, that's not necessarily bad. It might allow us to substantially simplify locking and reference handling, for example. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 11:26:53 2001 Delivered-To: freebsd-arch@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id DFF5537B405; Thu, 15 Nov 2001 11:26:48 -0800 (PST) Received: by flood.ping.uio.no (Postfix, from userid 2602) id 0558414C2E; Thu, 15 Nov 2001 20:26:46 +0100 (CET) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Robert Watson Cc: Matthew Dillon , Alfred Perlstein , Greg Lehey , Bruce Evans , Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: From: Dag-Erling Smorgrav Date: 15 Nov 2001 20:26:45 +0100 In-Reply-To: Message-ID: Lines: 34 User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Robert Watson writes: > The implicit question behind that, though, is: are there places in the > kernel that will always be locked into using curproc/curthread, simply due > to the structure and behavior of the kernel environment. There's a number of cases here: 1) the thread in question is curthread, and it is locked. 2) the thread may be any thread, but it is locked. 3) the thread may be any thread, and is not locked. (am I correct in assuming that curthread is *always* locked in code called from syscalls?) In some cases it doesn't make sense to assume anything but 1), because it is the case in 99% of the situations where the code is invoked and assuming 2) or 3) would involve a severe performance penalty for the common case. Copyin() is one example of this; for the rare cases where you need to copy data from a non-current thread's address space (mainly ptrace() and procfs), there is proc_rwmem(). In some cases it is desirable for an API to handle non-current threads. In those cases, it is the responsibility of the API functions to make sure the thread they're manipulating is properly locked. In some cases it is desirable for an API to handle non-current threads, but assume that the thread is locked, to save the overhead of mutex operations. In those cases, the code should be protected by mutex assertions. DES -- Dag-Erling Smorgrav - des@ofug.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 11:29:12 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail11.speakeasy.net (mail11.speakeasy.net [216.254.0.211]) by hub.freebsd.org (Postfix) with ESMTP id 5FB3D37B442 for ; Thu, 15 Nov 2001 11:28:57 -0800 (PST) Received: (qmail 69254 invoked from network); 15 Nov 2001 19:28:56 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail11.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 15 Nov 2001 19:28:56 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Thu, 15 Nov 2001 11:28:56 -0800 (PST) From: John Baldwin To: Dag-Erling Smorgrav Subject: Re: cur{thread/proc}, or not. Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm , Bruce Evans , Greg Lehey , Alfred Perlstein , Matthew Dillon , Robert Watson Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 15-Nov-01 Dag-Erling Smorgrav wrote: > Robert Watson writes: >> The implicit question behind that, though, is: are there places in the >> kernel that will always be locked into using curproc/curthread, simply due >> to the structure and behavior of the kernel environment. > > There's a number of cases here: > > 1) the thread in question is curthread, and it is locked. > 2) the thread may be any thread, but it is locked. > 3) the thread may be any thread, and is not locked. > > (am I correct in assuming that curthread is *always* locked in code > called from syscalls?) Err, no. curthread doesn't even have a lock. Look at sys/proc.h. There are some fields we don't use any locks on, because we assume that only curthread messes with its own copy, or some such. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 11:52:28 2001 Delivered-To: freebsd-arch@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id C509137B421; Thu, 15 Nov 2001 11:52:21 -0800 (PST) Received: by flood.ping.uio.no (Postfix, from userid 2602) id 2B99114C2E; Thu, 15 Nov 2001 20:52:20 +0100 (CET) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: John Baldwin Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm , Bruce Evans , Greg Lehey , Alfred Perlstein , Matthew Dillon , Robert Watson Subject: Re: cur{thread/proc}, or not. References: From: Dag-Erling Smorgrav Date: 15 Nov 2001 20:52:19 +0100 In-Reply-To: Message-ID: Lines: 10 User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG John Baldwin writes: > Err, no. curthread doesn't even have a lock. Look at sys/proc.h. There are > some fields we don't use any locks on, because we assume that only curthread > messes with its own copy, or some such. Hmm, then you need to lock the entire process, don't you? DES -- Dag-Erling Smorgrav - des@ofug.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 12: 2: 5 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206]) by hub.freebsd.org (Postfix) with ESMTP id E337137B428 for ; Thu, 15 Nov 2001 12:01:51 -0800 (PST) Received: (qmail 11881 invoked from network); 15 Nov 2001 20:01:22 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail6.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 15 Nov 2001 20:01:22 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Thu, 15 Nov 2001 12:01:50 -0800 (PST) From: John Baldwin To: Dag-Erling Smorgrav Subject: Re: cur{thread/proc}, or not. Cc: Robert Watson , Cc: Robert Watson , Matthew Dillon , Alfred Perlstein , Greg Lehey , Bruce Evans , Peter Wemm , freebsd-arch@FreeBSD.ORG Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 15-Nov-01 Dag-Erling Smorgrav wrote: > John Baldwin writes: >> Err, no. curthread doesn't even have a lock. Look at sys/proc.h. There >> are >> some fields we don't use any locks on, because we assume that only curthread >> messes with its own copy, or some such. > > Hmm, then you need to lock the entire process, don't you? Only for certain things. We don't actually lock the process unless we need to down inside of a syscall. > DES > -- > Dag-Erling Smorgrav - des@ofug.org -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 12:11:40 2001 Delivered-To: freebsd-arch@freebsd.org Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180]) by hub.freebsd.org (Postfix) with ESMTP id 6748037B419 for ; Thu, 15 Nov 2001 12:11:17 -0800 (PST) Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3]) by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id fAFKBHM24661 for ; Thu, 15 Nov 2001 12:11:17 -0800 (PST) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by overcee.netplex.com.au (Postfix) with ESMTP id 34216380A; Thu, 15 Nov 2001 12:11:17 -0800 (PST) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: Matthew Dillon Cc: freebsd-arch@FreeBSD.ORG Subject: Re: Need review - patch for socket locking and ref counting In-Reply-To: <200111150015.fAF0Flb09186@apollo.backplane.com> Date: Thu, 15 Nov 2001 12:11:17 -0800 From: Peter Wemm Message-Id: <20011115201117.34216380A@overcee.netplex.com.au> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Matthew Dillon wrote: > +static __inline > +struct mtx * > +_mtx_pool1_find(void *ptr) > +{ > + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | 0 ]); > +} At the very least, this is not going to compile very well on 64 bit machines. You cannot cast a pointer to an int. At needs to be uintptr_t at minimum. Cheers, -Peter -- Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 12:24:12 2001 Delivered-To: freebsd-arch@freebsd.org Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122]) by hub.freebsd.org (Postfix) with ESMTP id 3C58E37B417; Thu, 15 Nov 2001 12:24:06 -0800 (PST) Received: from dialup-209.245.139.20.dial1.sanjose1.level3.net ([209.245.139.20] helo=mindspring.com) by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 164T3M-0005mO-00; Thu, 15 Nov 2001 12:24:05 -0800 Message-ID: <3BF4248D.1735C282@mindspring.com> Date: Thu, 15 Nov 2001 12:24:45 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Robert Watson Cc: Dag-Erling Smorgrav , Matthew Dillon , Alfred Perlstein , Greg Lehey , Bruce Evans , Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: cur{thread/proc}, or not. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Robert Watson wrote: > The implicit question behind that, though, is: are there places in the > kernel that will always be locked into using curproc/curthread, simply due > to the structure and behavior of the kernel environment. For example, I > would generally think that 'borrowing' a proc or thread structure is a bad > idea, rather, you want that proc or thread to 'loan' you references to > supporting ref-counted structures (vmspaces, creds, ...). On a small > scale, routines like 'copyin' and 'copyout' already follow the "must use > curproc/curthread, so don't bother taking one on the command line" > strategy. If we were to assert that a certain class of functions always > acted on behalf of the calling thread or process, that's not necessarily > bad. It might allow us to substantially simplify locking and reference > handling, for example. Regardless of how many angels that can dance on this pin, it would be a good idea to document lock assumtions in and out of all functions, using both comments and assert(). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 12:46: 4 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail12.speakeasy.net (mail12.speakeasy.net [216.254.0.212]) by hub.freebsd.org (Postfix) with ESMTP id D2A3737B419 for ; Thu, 15 Nov 2001 12:46:00 -0800 (PST) Received: (qmail 2677 invoked from network); 15 Nov 2001 20:46:00 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail12.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 15 Nov 2001 20:46:00 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <20011115201117.34216380A@overcee.netplex.com.au> Date: Thu, 15 Nov 2001 12:45:57 -0800 (PST) From: John Baldwin To: Peter Wemm Subject: Re: Need review - patch for socket locking and ref counting Cc: freebsd-arch@FreeBSD.ORG, Matthew Dillon Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 15-Nov-01 Peter Wemm wrote: > Matthew Dillon wrote: > >> +static __inline >> +struct mtx * >> +_mtx_pool1_find(void *ptr) >> +{ >> + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | >> 0 > ]); >> +} > > At the very least, this is not going to compile very well on 64 bit machines. > You cannot cast a pointer to an int. At needs to be uintptr_t at minimum. I would also prefer a generic mechanism for multiple pools with a struct mtx_pool containing a count, index for alloc, and pointer to the array of locks and pass it as the first arg to mtx_pool_foo(). This would also entail a mtx_pool_init(struct mtx_pool *mp, int size); and a mtx_pool_destroy(struct mtx_pool *mp); This is much cleaner and extensible than hardcoding 4 pools of equal size. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 14: 0:31 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mta04.onebox.com (mta04.onebox.com [64.68.77.147]) by hub.freebsd.org (Postfix) with ESMTP id 74E2237B41D for ; Thu, 15 Nov 2001 14:00:26 -0800 (PST) Received: from onebox.com ([10.1.111.7]) by mta04.onebox.com (InterMail vM.4.01.03.23 201-229-121-123-20010418) with SMTP id <20011115220026.SYKD12575.mta04.onebox.com@onebox.com> for ; Thu, 15 Nov 2001 14:00:26 -0800 Received: from [63.49.208.149] by onebox.com with HTTP; Thu, 15 Nov 2001 14:00:26 -0800 Date: Thu, 15 Nov 2001 14:00:26 -0800 Subject: KSE Mail-List Archive Summary From: "Glenn Gombert" To: arch@FreeBSD.ORG Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit MIME-Version: 1.0 Message-Id: <20011115220026.SYKD12575.mta04.onebox.com@onebox.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I put together a summary of some of the important KSE & Mutex discussions (threads) from the last few months on my freebsd web site at "freebsd.imatowns.com" .. I mainly did if for my own reference..(I did not try and include everything but the major themes and topics covered) but thought that others might find them useful as well.. -- Glenn Gombert glenngombert@onebox.com - email (513) 587-2643 x2263 - voicemail/fax __________________________________________________ FREE voicemail, email, and fax...all in one place. Sign Up Now! http://www.onebox.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 14:53:55 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mta08.onebox.com (mta08.onebox.com [64.68.76.143]) by hub.freebsd.org (Postfix) with ESMTP id 6E4C537B417 for ; Thu, 15 Nov 2001 14:53:53 -0800 (PST) Received: from onebox.com ([10.1.101.5]) by mta08.onebox.com (InterMail vM.4.01.03.23 201-229-121-123-20010418) with SMTP id <20011115225353.UUBN16107.mta08.onebox.com@onebox.com>; Thu, 15 Nov 2001 14:53:53 -0800 Received: from [165.121.195.182] by onebox.com with HTTP; Thu, 15 Nov 2001 14:53:53 -0800 Date: Thu, 15 Nov 2001 14:53:53 -0800 Subject: Re: KSE Mail-List Archive Summary From: "Glenn Gombert" To: sandeepj@research.bell-labs.com (Sandeep Joshi) Cc: glenngombert@onebox.com Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit MIME-Version: 1.0 Message-Id: <20011115225353.UUBN16107.mta08.onebox.com@onebox.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sorry...this should now be fixed....I had to do it on my 'Windoz' machine at work :) -- Glenn Gombert glenngombert@onebox.com - email (513) 587-2643 x2263 - voicemail/fax ---- sandeepj@research.bell-labs.com (Sandeep Joshi) wrote: > hello Glen, > > This link doesnt work from Unix-based navigators > since there is a space in the link > > http://freebsd.imatowns.com/BSD KSE-Mail Summary.txt > > -A passive observer > __________________________________________________ FREE voicemail, email, and fax...all in one place. Sign Up Now! http://www.onebox.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 15 16: 0:20 2001 Delivered-To: freebsd-arch@freebsd.org Received: from InterJet.elischer.org (c421509-a.pinol1.sfba.home.com [24.7.86.9]) by hub.freebsd.org (Postfix) with ESMTP id 7EED737B41A for ; Thu, 15 Nov 2001 16:00:17 -0800 (PST) Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id PAA10446; Thu, 15 Nov 2001 15:55:59 -0800 (PST) Date: Thu, 15 Nov 2001 15:55:58 -0800 (PST) From: Julian Elischer To: Glenn Gombert Cc: arch@FreeBSD.ORG Subject: Re: KSE Mail-List Archive Summary In-Reply-To: <20011115220026.SYKD12575.mta04.onebox.com@onebox.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG thanks!... On Thu, 15 Nov 2001, Glenn Gombert wrote: > > I put together a summary of some of the important KSE & Mutex discussions > (threads) from the last few months on my freebsd web site at > "freebsd.imatowns.com" .. I mainly did if for my own reference..(I did > > not try and include everything but the major themes and topics covered) > but thought that others might find them useful as well.. > > -- > Glenn Gombert > glenngombert@onebox.com - email > (513) 587-2643 x2263 - voicemail/fax > > > > __________________________________________________ > FREE voicemail, email, and fax...all in one place. > Sign Up Now! http://www.onebox.com > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 16 19: 0:43 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 9E05F37B419; Fri, 16 Nov 2001 19:00:40 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAH30dv75857; Fri, 16 Nov 2001 19:00:39 -0800 (PST) (envelope-from dillon) Date: Fri, 16 Nov 2001 19:00:39 -0800 (PST) From: Matthew Dillon Message-Id: <200111170300.fAH30dv75857@apollo.backplane.com> To: John Baldwin Cc: Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: Need review - patch for socket locking and ref counting References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I've thought about it a bit and I've come to the conclusion that we should *not* have multiple mutex pools. The single pool we have works wonderfully for interlock operations. For example, the interlocks used inside the sxlock structure and code, and inside the lockmgr structure and code (the lockmgr previously used its own hacked up pool for its interlock). The pool effectively cuts the size and overhead of higher level structures - such as sxlocks - down considerably. But our ability to use pools for higher level constructs, like the sxlocks themselves, is severely limited. My attempts so far have only resulted in more obfuscated code. I think the pool implementation should be left as it is and used ONLY for interlocks and 'leaf' locks, as I originally designed it. Adding multiple-pools (and the allocation / freeing / management headaches that go along with that) will only create a mess. I don't think it's even possible to use a pool of sx locks safely, for example, even with the multiple pool concept. The current pool code is nice because it simplifies our code base somewhat rather then make it more complex. I see absolutely no need for a multiple-pool mechanism at this time. For similar reasons I believe we should also simplify the APIs to other low level constructs. I would like to simplify the SX lock API (get rid of sx_tryupgrade() and sx_downgrade()), and I would like to see a more simplified structure if possible in order to make SX locks more useful as embedded entities in higher level system structures such as TCP sockets or PCBs. -Matt Matthew Dillon :On 15-Nov-01 Peter Wemm wrote: :> Matthew Dillon wrote: :> :>> +static __inline :>> +struct mtx * :>> +_mtx_pool1_find(void *ptr) :>> +{ :>> + return(&mtx_pool_ary[(((int)ptr ^ ((int)ptr >> 6)) & MTX_POOL_XMASK) | :>> 0 :> ]); :>> +} :> :> At the very least, this is not going to compile very well on 64 bit machines. :> You cannot cast a pointer to an int. At needs to be uintptr_t at minimum. : :I would also prefer a generic mechanism for multiple pools with a struct :mtx_pool containing a count, index for alloc, and pointer to the array of :locks and pass it as the first arg to mtx_pool_foo(). This would also entail a :mtx_pool_init(struct mtx_pool *mp, int size); and a :mtx_pool_destroy(struct mtx_pool *mp); This is much cleaner and extensible :than hardcoding 4 pools of equal size. : :-- : :John Baldwin <>< http://www.FreeBSD.org/~jhb/ :"Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat Nov 17 2:24: 0 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205]) by hub.freebsd.org (Postfix) with ESMTP id 151B337B416 for ; Sat, 17 Nov 2001 02:23:58 -0800 (PST) Received: (qmail 31836 invoked from network); 17 Nov 2001 10:23:57 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 17 Nov 2001 10:23:57 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200111170300.fAH30dv75857@apollo.backplane.com> Date: Sat, 17 Nov 2001 02:23:56 -0800 (PST) From: John Baldwin To: Matthew Dillon Subject: Re: Need review - patch for socket locking and ref counting Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 17-Nov-01 Matthew Dillon wrote: > I've thought about it a bit and I've come to the conclusion that > we should *not* have multiple mutex pools. > > The single pool we have works wonderfully for interlock operations. > For example, the interlocks used inside the sxlock structure and code, > and inside the lockmgr structure and code (the lockmgr previously used > its own hacked up pool for its interlock). The pool effectively cuts > the size and overhead of higher level structures - such as sxlocks - down > considerably. They've added 4 new lock order reversals to my boot messages. For that we need a pool of mutexes with MTX_NOWITNESS. However, MTX_NOWITNESS is not appropriate for locks outside of sx and lockmgr backing locks. > But our ability to use pools for higher level constructs, like the > sxlocks > themselves, is severely limited. My attempts so far have only resulted > in more obfuscated code. ??? If you want a sx lock pool, it would be just as simple as the mtx pool you have now, just s/mtx/sx/, and thus sx_pool_slock, etc. Not that complicated. Not sure it is all that useful either though. > I think the pool implementation should be left as it is and used ONLY > for interlocks and 'leaf' locks, as I originally designed it. Adding > multiple-pools (and the allocation / freeing / management headaches > that go along with that) will only create a mess. I don't think it's > even possible to use a pool of sx locks safely, for example, even with > the multiple pool concept. Errr, it's all of two extra functions and one extra parameter to the others. This should not be difficult. > The current pool code is nice because it simplifies our code base > somewhat rather then make it more complex. I see absolutely no need > for a multiple-pool mechanism at this time. Are you planning to turn on MTX_NOWITNESS then and then be forced not to use pool locks for anything besides sx and lockmgr backing locks since they won't have WITNESS checks performed for them? Different types of locks have different types of requirements. > For similar reasons I believe we should also simplify the APIs to > other low level constructs. I would like to simplify the SX lock > API (get rid of sx_tryupgrade() and sx_downgrade()), and I would > like to see a more simplified structure if possible in order to > make SX locks more useful as embedded entities in higher level system > structures such as TCP sockets or PCBs. Err, the try_upgrade and downgrade are trivial and add nothing to the sx lock structure itself. They were also specifically requested for use in porting XFS to FreeBSD and are useful in other areas such as Brian's changes to make vm_map's use sx locks instead of lockmgr locks. We can always optimize the locks later, it is more important right now to actually put locks in places so that actual multithreading can occur. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat Nov 17 5:24:24 2001 Delivered-To: freebsd-arch@freebsd.org Received: from harrier.prod.itd.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12]) by hub.freebsd.org (Postfix) with ESMTP id 97F8F37B416; Sat, 17 Nov 2001 05:24:22 -0800 (PST) Received: from dialup-209.245.142.3.dial1.sanjose1.level3.net ([209.245.142.3] helo=mindspring.com) by harrier.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 1655SC-0004ee-00; Sat, 17 Nov 2001 05:24:16 -0800 Message-ID: <3BF6652F.FC50C99A@mindspring.com> Date: Sat, 17 Nov 2001 05:25:03 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Matthew Dillon Cc: John Baldwin , Peter Wemm , freebsd-arch@FreeBSD.ORG Subject: Re: Need review - patch for socket locking and ref counting References: <200111170300.fAH30dv75857@apollo.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Matthew Dillon wrote: > > I've thought about it a bit and I've come to the conclusion that > we should *not* have multiple mutex pools. It's pretty obvious even under casual thought that the deadlock avoidance can't work correctly in theis scenario, so you MUST limit yourself to last acquisition. By the same token, it does not make sense to permit recursion on such mutexes. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat Nov 17 8:40: 9 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206]) by hub.freebsd.org (Postfix) with ESMTP id 4FD4537B416 for ; Sat, 17 Nov 2001 08:40:07 -0800 (PST) Received: (qmail 26684 invoked from network); 17 Nov 2001 16:39:40 -0000 Received: from unknown (HELO laptop.baldwin.cx) ([64.81.54.73]) (envelope-sender ) by mail6.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 17 Nov 2001 16:39:40 -0000 Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <3BF6652F.FC50C99A@mindspring.com> Date: Sat, 17 Nov 2001 08:40:06 -0800 (PST) From: John Baldwin To: Terry Lambert Subject: Re: Need review - patch for socket locking and ref counting Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm , Matthew Dillon Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 17-Nov-01 Terry Lambert wrote: > Matthew Dillon wrote: >> >> I've thought about it a bit and I've come to the conclusion that >> we should *not* have multiple mutex pools. > > It's pretty obvious even under casual thought that the deadlock > avoidance can't work correctly in theis scenario, so you MUST > limit yourself to last acquisition. Err, witness doesn't do deadlock avoidance, and it just checks lock orders. However, the problem is that the order of a larger lock (reader writer lock) is being compared with those of its components. Obviously one is going to acquire the lock used to implement a reader/writer lock both while holding and not holding the reader/writer lock. Witness cannot efficiently handle this, so instead we disable witness checks on the component locks. > By the same token, it does not make sense to permit recursion on > such mutexes. Err, we don't on most mutexes. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat Nov 17 10:30:20 2001 Delivered-To: freebsd-arch@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id D27D637B417; Sat, 17 Nov 2001 10:30:12 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fAHIUBu80966; Sat, 17 Nov 2001 10:30:11 -0800 (PST) (envelope-from dillon) Date: Sat, 17 Nov 2001 10:30:11 -0800 (PST) From: Matthew Dillon Message-Id: <200111171830.fAHIUBu80966@apollo.backplane.com> To: John Baldwin Cc: freebsd-arch@FreeBSD.ORG, Peter Wemm Subject: Re: Need review - patch for socket locking and ref counting References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :> I think the pool implementation should be left as it is and used ONLY :> for interlocks and 'leaf' locks, as I originally designed it. Adding :> multiple-pools (and the allocation / freeing / management headaches :> that go along with that) will only create a mess. I don't think it's :> even possible to use a pool of sx locks safely, for example, even with :> the multiple pool concept. : :Errr, it's all of two extra functions and one extra parameter to the others. :This should not be difficult. Difficulty isn't the problem. Confusion and Mess are the problems. : :> The current pool code is nice because it simplifies our code base :> somewhat rather then make it more complex. I see absolutely no need :> for a multiple-pool mechanism at this time. : :Are you planning to turn on MTX_NOWITNESS then and then be forced not to use :pool locks for anything besides sx and lockmgr backing locks since they won't :have WITNESS checks performed for them? Different types of locks have :different types of requirements. I'll turn on MTX_NOWITNESS. Again. Difficulty is not the problem here. Confusion and Mess are the problems. It is not necessarily a good idea to take every locking API we have and give each one dozens of features and capabilities that go mostly unused. :> For similar reasons I believe we should also simplify the APIs to :> other low level constructs. I would like to simplify the SX lock :> API (get rid of sx_tryupgrade() and sx_downgrade()), and I would :> like to see a more simplified structure if possible in order to :> make SX locks more useful as embedded entities in higher level system :> structures such as TCP sockets or PCBs. : :Err, the try_upgrade and downgrade are trivial and add nothing to the sx lock :structure itself. They were also specifically requested for use in porting XFS :to FreeBSD and are useful in other areas such as Brian's changes to make I don't think it's worth it just for XFS, :vm_map's use sx locks instead of lockmgr locks. We can always optimize the :locks later, it is more important right now to actually put locks in places so :that actual multithreading can occur. I don't see it as being necessary for VM maps. Since interrupts are in their own threads VM maps can probaly do away with much of the junk they needed for -stable. -Matt Matthew Dillon :John Baldwin <>< http://www.FreeBSD.org/~jhb/ :"Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message