From owner-freebsd-arch@FreeBSD.ORG Fri Dec 17 12:56:43 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB39510656A3; Fri, 17 Dec 2010 12:56:43 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6BCB28FC14; Fri, 17 Dec 2010 12:56:43 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 1CC7D46B58; Fri, 17 Dec 2010 07:56:43 -0500 (EST) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D3C618A01D; Fri, 17 Dec 2010 07:56:41 -0500 (EST) From: John Baldwin To: David Xu Date: Fri, 17 Dec 2010 07:52:06 -0500 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20101102; KDE/4.4.5; amd64; ; ) References: <201012101050.45214.jhb@freebsd.org> <201012160940.58116.jhb@freebsd.org> <4D0AC3EC.1040701@freebsd.org> In-Reply-To: <4D0AC3EC.1040701@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201012170752.06540.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Fri, 17 Dec 2010 07:56:41 -0500 (EST) X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.9 required=4.2 tests=BAYES_00 autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx Cc: arch@freebsd.org, Sergey Babkin Subject: Re: Realtime thread priorities X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Dec 2010 12:56:43 -0000 On Thursday, December 16, 2010 8:59:08 pm David Xu wrote: > John Baldwin wrote: > > On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote: > >> John Baldwin wrote: > >>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote: > >>>> John Baldwin wrote: > >>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote: > >>>>>> John Baldwin wrote: > >>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote: > >>>>>>>> John Baldwin wrote: > >>>>>>>>> The current layout breaks up the global thread priority space (0 - 255) > >>>>>>> into a > >>>>>>>>> couple of bands: > >>>>>>>>> > >>>>>>>>> 0 - 63 : interrupt threads > >>>>>>>>> 64 - 127 : kernel sleep priorities (PSOCK, etc.) > >>>>>>>>> 128 - 159 : real-time user threads (rtprio) > >>>>>>>>> 160 - 223 : time-sharing user threads > >>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs) > >>>>>>>>> > >>>>>>>>> If we decide to change the behavior I see two possible fixes: > >>>>>>>>> > >>>>>>>>> 1) (easy) just move the real-time priority range above the kernel sleep > >>>>>>>>> priority range > >>>>>>>> Would not this cause a priority inversion when an RT process > >>>>>>>> enters the kernel mode? > >>>>>>> How so? Note that timesharing threads are not "bumped" to a kernel sleep > >>>>>>> priority when they enter the kernel either. The kernel sleep priorities are > >>>>>>> purely a way for certain sleep channels to cause a thread to be treated as > >>>>>>> interactive and give it a priority boost to favor interactive threads. > >>>>>>> Threads in the kernel do not automatically have higher priority than threads > >>>>>>> not in the kernel. Keep in mind that all stopped threads (threads not > >>>>>>> executing) are always in the kernel when they stop. > >>>>>> I have requirement to make a thread running in kernel has more higher > >>>>>> priority over a thread running userland code, because our kernel > >>>>>> mutex is not sleepable which does not like Solaris did, I have to use > >>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me > >>>>>> to read and write user address space, this is how umtxq_busy() did, > >>>>>> but it does not prevent a userland thread from preempting a thread > >>>>>> which locked the chain, if a realtime thread preempts a thread > >>>>>> locked the chain, it may lock up whole processes using pthread. > >>>>>> I think our realtime scheduling is not very useful, it is too easy > >>>>>> to lock up system. > >>>>> Users are not forced to use rtprio. They choose to do so, and they have to > >>>>> be root to enable it (either directly or by extending root privileges via > >>>>> sudo or some such). Just because you don't have a use case for it doesn't > >>>>> mean that other people do not. Right now there is no way possible to say > >>>>> that a given userland process is more important than 'sshd' (or any other > >>>>> daemon) blocked in poll/select/kevent waiting for a packet. However, there > >>>>> are use cases where other long-running userland processes are in fact far > >>>>> more important than sshd (or similar processes such as getty, etc.). > >>>>> > >>>> You still don't answer me about how to avoid a time-sharing thread > >>>> holding a critical kernel resource which preempted by a user RT thread, > >>>> and later the RT thread requires the resource, but the time-sharing > >>>> thread has no chance to run because another RT thread is dominating > >>>> the CPU because it is doing CPU bound work, result is deadlock, even if > >>>> you know you trust your RT process, there are many code which were > >>>> written by you, i.e the libc and any other libraries using threading > >>>> are completely not ready for RT use. > >>>> How ever let a thread in kernel have higher priority over a thread > >>>> running userland code will fix such a deadlock in kernel. > >>> Put another way, the time-sharing thread that I don't care about (sshd, or > >>> some other monitoring daemon, etc.) is stealing a resource I care about > >>> (time, in the form of CPU cycles) from my RT process that is critical to > >>> getting my work done. > >>> > >>> Beyond that a few more points: > >>> > >>> - You are ignoring "tools, not policy". You don't know what is in my binary > >>> (and I can't really tell you). Assume for a minute that I'm not completely > >>> dumb and can write userland code that is safe to run at this high of a > >>> priority level. You already trust me to write code in the kernel that runs > >>> at even higher priority now. :) > >>> - You repeatedly keep missing (ignoring?) the fact that this is _optional_. > >>> Users have to intentionally decide to enable this, and there are users who > >>> do _need_ this functionality. > >>> - You have also missed that this has always been true for idprio processes > >>> (and is in fact why we restrict idprio to root), so this is not "new". > >>> - Finally, you also are missing that this can already happen _now_ for plain > >>> old time sharing processes if the thread holding the resource doesn't ever > >>> do a sleep that raises the priority. > >>> > >>> For example, if a time-sharing thread with some typical priority >= > >>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock for > >>> that file (if it is unlocked) and hold that lock while it's priority is >= > >>> PRI_MIN_TIMESHARE. If an interrupt arrives for a network packet that wakes > >>> up sshd for a new SSH connection, the interrupt thread will preempt the > >>> thread holding the vnode lock, and sshd will be executed instead of the > >>> thread holding the vnode lock when the ithread finishes. If sshd needs the > >>> vnode lock that the original thread holds, then sshd will block until the > >>> original thread is rescheduled due to the random fates of time and releases > >>> the vnode lock. > >>> > >>> In summary, the kernel sleep priorities do _not_ serve to prevent all > >>> priority inversions, what they do accomplish is giving preferential treatment > >>> to idle, "interactive" threads. > >>> > >>> A bit more information on my use case btw: > >>> > >>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we remove the > >>> CPU from the global cpuset and ensure no interrupts are routed to that CPU). > >>> The problem I have is that if my RT process blocks on a lock (e.g. a lock on a > >>> VM object during a page fault), then I want the RT thread to lend its RT > >>> priority to the thread that holds the lock over on another CPU so that the lock > >>> can be released as quickly as possible. This use case is perfectly safe (the > >>> RT thread is not preempting other threads, instead other threads are partitioned > >>> off into a separate set of available CPUs). What I need is to ensure that the > >>> syncer or pagedaemon or whoever holds the lock I need gets a chance to run right > >>> away when it holds a lock that I need. > >>> > >> What I meant is that whenever thread is in kernel mode, it always has > >> higher priority over thread running user code, and all threads in kernel > >> mode may have same priority except those interrupt threads which > >> has higher priority, but this should be carefully designed to use > >> mutex and spinlock between interrupt threads and other threads, > >> mutex uses turnstile to propagate priority, spin lock disables > >> interrupt, otherwise there still is priority inversion in kernel, i.e > >> rwlock, sx lock. > > > > Except that this isn't really true. Really, if a thread is asleep in > > select() or poll() or kevent(), what critical resource is it holding? I had > > the same view originally when the current set of priorites were setup. > > However, I've had to change it since I now have a real-world use case for > > rtprio. > > > > First, I think this is the easy part of the argument: Can you agree that if > > a RT process is in the kernel, it should have priority over a TS process in > > the kernel? Thus, if a RT process blocks in the kernel, it would need to > > lend enough of a priority to the lock holder to preempt any TS process in the > > kernel, yes? If so, that argues for RT processes in the kernel having a > > higher priority than all the other kernel sleep priorities. > > > > Yes, RT processes should preempt any TS, but how can you lend priority > for lockmgr and sx lock and all locking based on msleep() and wakeup() ? > That's why I try to fix it, they have priority inversion, to fix the > problem, a POSIX priority-protect mutex like semantic is needed, that > when a lock is locked, thread needs to raise its priority at high enough > priority to protect priority inversion, when a thread tries to lock a > lower priority ceiling lock, it should abort, this means lock order > reversal ? kernel may panic for correctness. > Consequences of priority inversion depends on application, it may be > dangerous or trivial, but it is not correct. Yes, we do not do priority lending for sleep locks, and to date we never have. This is not a new problem and moving RT priority higher is not introducing any _new_ problems. However, it does bring _new_ functionality that some people need. Just because you don't need it doesn't mean it isn't important. Don't let the perfect be the enemy of the good. -- John Baldwin