From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 17 12:56:43 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AB39510656A3;
	Fri, 17 Dec 2010 12:56:43 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 6BCB28FC14;
	Fri, 17 Dec 2010 12:56:43 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 1CC7D46B58;
	Fri, 17 Dec 2010 07:56:43 -0500 (EST)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id D3C618A01D;
	Fri, 17 Dec 2010 07:56:41 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: David Xu <davidxu@freebsd.org>
Date: Fri, 17 Dec 2010 07:52:06 -0500
User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20101102; KDE/4.4.5; amd64; ; )
References: <201012101050.45214.jhb@freebsd.org>
	<201012160940.58116.jhb@freebsd.org> <4D0AC3EC.1040701@freebsd.org>
In-Reply-To: <4D0AC3EC.1040701@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201012170752.06540.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Fri, 17 Dec 2010 07:56:41 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-1.9 required=4.2 tests=BAYES_00 autolearn=ham
	version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx
Cc: arch@freebsd.org, Sergey Babkin <babkin@verizon.net>
Subject: Re: Realtime thread priorities
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 17 Dec 2010 12:56:43 -0000

On Thursday, December 16, 2010 8:59:08 pm David Xu wrote:
> John Baldwin wrote:
> > On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote:
> >> John Baldwin wrote:
> >>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
> >>>> John Baldwin wrote:
> >>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
> >>>>>> John Baldwin wrote:
> >>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
> >>>>>>>> John Baldwin wrote:
> >>>>>>>>> The current layout breaks up the global thread priority space (0 - 255) 
> >>>>>>> into a
> >>>>>>>>> couple of bands:
> >>>>>>>>>
> >>>>>>>>>   0 -  63 : interrupt threads
> >>>>>>>>>  64 - 127 : kernel sleep priorities (PSOCK, etc.)
> >>>>>>>>> 128 - 159 : real-time user threads (rtprio)
> >>>>>>>>> 160 - 223 : time-sharing user threads
> >>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
> >>>>>>>>>
> >>>>>>>>> If we decide to change the behavior I see two possible fixes:
> >>>>>>>>>
> >>>>>>>>> 1) (easy) just move the real-time priority range above the kernel sleep
> >>>>>>>>> priority range
> >>>>>>>> Would not this cause a priority inversion when an RT process
> >>>>>>>> enters the kernel mode?
> >>>>>>> How so?  Note that timesharing threads are not "bumped" to a kernel sleep 
> >>>>>>> priority when they enter the kernel either.  The kernel sleep priorities are 
> >>>>>>> purely a way for certain sleep channels to cause a thread to be treated as 
> >>>>>>> interactive and give it a priority boost to favor interactive threads.  
> >>>>>>> Threads in the kernel do not automatically have higher priority than threads 
> >>>>>>> not in the kernel.  Keep in mind that all stopped threads (threads not 
> >>>>>>> executing) are always in the kernel when they stop.
> >>>>>> I have requirement to make a thread running in kernel has more higher
> >>>>>> priority over a thread running userland code, because our kernel
> >>>>>> mutex is not sleepable which does not like Solaris did, I have to use
> >>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
> >>>>>> to read and write user address space, this is how umtxq_busy() did,
> >>>>>> but it does not prevent a userland thread from preempting a thread
> >>>>>> which locked the chain, if a realtime thread preempts a thread
> >>>>>> locked the chain, it may lock up whole processes using pthread.
> >>>>>> I think our realtime scheduling is not very useful, it is too easy
> >>>>>> to lock up system.
> >>>>> Users are not forced to use rtprio.  They choose to do so, and they have to
> >>>>> be root to enable it (either directly or by extending root privileges via
> >>>>> sudo or some such).  Just because you don't have a use case for it doesn't
> >>>>> mean that other people do not.  Right now there is no way possible to say
> >>>>> that a given userland process is more important than 'sshd' (or any other
> >>>>> daemon) blocked in poll/select/kevent waiting for a packet.  However, there
> >>>>> are use cases where other long-running userland processes are in fact far
> >>>>> more important than sshd (or similar processes such as getty, etc.).
> >>>>>
> >>>> You still don't answer me about how to avoid a time-sharing thread
> >>>> holding a critical kernel resource which preempted by a user RT thread,
> >>>> and later the RT thread requires the resource, but the time-sharing
> >>>> thread has no chance to run because another RT thread is dominating
> >>>> the CPU because it is doing CPU bound work, result is deadlock, even if
> >>>> you know you trust your RT process, there are many code which were
> >>>> written by you, i.e the libc and any other libraries using threading
> >>>> are completely not ready for RT use.
> >>>> How ever let a thread in kernel have higher priority over a thread
> >>>> running userland code will fix such a deadlock in kernel.
> >>> Put another way, the time-sharing thread that I don't care about (sshd, or
> >>> some other monitoring daemon, etc.) is stealing a resource I care about
> >>> (time, in the form of CPU cycles) from my RT process that is critical to
> >>> getting my work done.
> >>>
> >>> Beyond that a few more points:
> >>>
> >>> - You are ignoring "tools, not policy".  You don't know what is in my binary
> >>>   (and I can't really tell you).  Assume for a minute that I'm not completely
> >>>   dumb and can write userland code that is safe to run at this high of a
> >>>   priority level.  You already trust me to write code in the kernel that runs
> >>>   at even higher priority now. :)
> >>> - You repeatedly keep missing (ignoring?) the fact that this is _optional_.
> >>>   Users have to intentionally decide to enable this, and there are users who
> >>>   do _need_ this functionality.
> >>> - You have also missed that this has always been true for idprio processes
> >>>   (and is in fact why we restrict idprio to root), so this is not "new".
> >>> - Finally, you also are missing that this can already happen _now_ for plain
> >>>   old time sharing processes if the thread holding the resource doesn't ever
> >>>   do a sleep that raises the priority.
> >>>
> >>> For example, if a time-sharing thread with some typical priority >=
> >>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock for
> >>> that file (if it is unlocked) and hold that lock while it's priority is >=
> >>> PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet that wakes
> >>> up sshd for a new SSH connection, the interrupt thread will preempt the
> >>> thread holding the vnode lock, and sshd will be executed instead of the
> >>> thread holding the vnode lock when the ithread finishes.  If sshd needs the
> >>> vnode lock that the original thread holds, then sshd will block until the
> >>> original thread is rescheduled due to the random fates of time and releases
> >>> the vnode lock.
> >>>
> >>> In summary, the kernel sleep priorities do _not_ serve to prevent all
> >>> priority inversions, what they do accomplish is giving preferential treatment
> >>> to idle, "interactive" threads.
> >>>
> >>> A bit more information on my use case btw:
> >>>
> >>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we remove the
> >>> CPU from the global cpuset and ensure no interrupts are routed to that CPU).
> >>> The problem I have is that if my RT process blocks on a lock (e.g. a lock on a
> >>> VM object during a page fault), then I want the RT thread to lend its RT
> >>> priority to the thread that holds the lock over on another CPU so that the lock
> >>> can be released as quickly as possible.  This use case is perfectly safe (the
> >>> RT thread is not preempting other threads, instead other threads are partitioned
> >>> off into a separate set of available CPUs).  What I need is to ensure that the
> >>> syncer or pagedaemon or whoever holds the lock I need gets a chance to run right
> >>> away when it holds a lock that I need.
> >>>
> >> What I meant is that whenever thread is in kernel mode, it always has
> >> higher priority over thread running user code, and all threads in kernel
> >> mode may have same priority except those interrupt threads which
> >> has higher priority, but this should be carefully designed to use
> >> mutex and spinlock between interrupt threads and other threads,
> >> mutex uses turnstile to propagate priority, spin lock disables 
> >> interrupt, otherwise there still is priority inversion in kernel, i.e 
> >> rwlock, sx lock.
> > 
> > Except that this isn't really true.  Really, if a thread is asleep in
> > select() or poll() or kevent(), what critical resource is it holding?  I had
> > the same view originally when the current set of priorites were setup.
> > However, I've had to change it since I now have a real-world use case for
> > rtprio.
> > 
> > First, I think this is the easy part of the argument:  Can you agree that if
> > a RT process is in the kernel, it should have priority over a TS process in
> > the kernel?  Thus, if a RT process blocks in the kernel, it would need to
> > lend enough of a priority to the lock holder to preempt any TS process in the
> > kernel, yes?  If so, that argues for RT processes in the kernel having a
> > higher priority than all the other kernel sleep priorities.
> > 
> 
> Yes, RT processes should preempt any TS, but how can you lend priority
> for lockmgr and sx lock and all locking based on msleep() and wakeup() ?
> That's why I try to fix it, they have priority inversion, to fix the
> problem, a POSIX priority-protect mutex like semantic is needed, that
> when a lock is locked, thread needs to raise its priority at high enough
> priority to protect priority inversion, when a thread tries to lock a
> lower priority ceiling lock, it should abort, this means lock order 
> reversal ? kernel may panic for correctness.
> Consequences of priority inversion depends on application, it may be
> dangerous or trivial, but it is not correct.

Yes, we do not do priority lending for sleep locks, and to date we never
have.  This is not a new problem and moving RT priority higher is not
introducing any _new_ problems.  However, it does bring _new_ functionality
that some people need.  Just because you don't need it doesn't mean it isn't
important.

Don't let the perfect be the enemy of the good.

-- 
John Baldwin