Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Dec 2010 14:20:47 +0800
From:      David Xu <davidxu@freebsd.org>
To:        Julian Elischer <julian@freebsd.org>
Cc:        arch@freebsd.org, Sergey Babkin <babkin@verizon.net>
Subject:   Re: Realtime thread priorities
Message-ID:  <4D0B013F.3060203@freebsd.org>
In-Reply-To: <4D0A54A8.90901@freebsd.org>
References:  <201012101050.45214.jhb@freebsd.org>	<201012150938.44217.jhb@freebsd.org>	<4D0992B5.7060005@freebsd.org> <201012160940.58116.jhb@freebsd.org> <4D0A54A8.90901@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Julian Elischer wrote:
> On 12/16/10 6:40 AM, John Baldwin wrote:
>> On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote:
>>> John Baldwin wrote:
>>>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
>>>>> John Baldwin wrote:
>>>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
>>>>>>> John Baldwin wrote:
>>>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
>>>>>>>>> John Baldwin wrote:
>>>>>>>>>> The current layout breaks up the global thread priority space 
>>>>>>>>>> (0 - 255)
>>>>>>>> into a
>>>>>>>>>> couple of bands:
>>>>>>>>>>
>>>>>>>>>>    0 -  63 : interrupt threads
>>>>>>>>>>   64 - 127 : kernel sleep priorities (PSOCK, etc.)
>>>>>>>>>> 128 - 159 : real-time user threads (rtprio)
>>>>>>>>>> 160 - 223 : time-sharing user threads
>>>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
>>>>>>>>>>
>>>>>>>>>> If we decide to change the behavior I see two possible fixes:
>>>>>>>>>>
>>>>>>>>>> 1) (easy) just move the real-time priority range above the 
>>>>>>>>>> kernel sleep
>>>>>>>>>> priority range
>>>>>>>>> Would not this cause a priority inversion when an RT process
>>>>>>>>> enters the kernel mode?
>>>>>>>> How so?  Note that timesharing threads are not "bumped" to a 
>>>>>>>> kernel sleep
>>>>>>>> priority when they enter the kernel either.  The kernel sleep 
>>>>>>>> priorities are
>>>>>>>> purely a way for certain sleep channels to cause a thread to be 
>>>>>>>> treated as
>>>>>>>> interactive and give it a priority boost to favor interactive 
>>>>>>>> threads.
>>>>>>>> Threads in the kernel do not automatically have higher priority 
>>>>>>>> than threads
>>>>>>>> not in the kernel.  Keep in mind that all stopped threads 
>>>>>>>> (threads not
>>>>>>>> executing) are always in the kernel when they stop.
>>>>>>> I have requirement to make a thread running in kernel has more 
>>>>>>> higher
>>>>>>> priority over a thread running userland code, because our kernel
>>>>>>> mutex is not sleepable which does not like Solaris did, I have to 
>>>>>>> use
>>>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
>>>>>>> to read and write user address space, this is how umtxq_busy() did,
>>>>>>> but it does not prevent a userland thread from preempting a thread
>>>>>>> which locked the chain, if a realtime thread preempts a thread
>>>>>>> locked the chain, it may lock up whole processes using pthread.
>>>>>>> I think our realtime scheduling is not very useful, it is too easy
>>>>>>> to lock up system.
>>>>>> Users are not forced to use rtprio.  They choose to do so, and 
>>>>>> they have to
>>>>>> be root to enable it (either directly or by extending root 
>>>>>> privileges via
>>>>>> sudo or some such).  Just because you don't have a use case for it 
>>>>>> doesn't
>>>>>> mean that other people do not.  Right now there is no way possible 
>>>>>> to say
>>>>>> that a given userland process is more important than 'sshd' (or 
>>>>>> any other
>>>>>> daemon) blocked in poll/select/kevent waiting for a packet.  
>>>>>> However, there
>>>>>> are use cases where other long-running userland processes are in 
>>>>>> fact far
>>>>>> more important than sshd (or similar processes such as getty, etc.).
>>>>>>
>>>>> You still don't answer me about how to avoid a time-sharing thread
>>>>> holding a critical kernel resource which preempted by a user RT 
>>>>> thread,
>>>>> and later the RT thread requires the resource, but the time-sharing
>>>>> thread has no chance to run because another RT thread is dominating
>>>>> the CPU because it is doing CPU bound work, result is deadlock, 
>>>>> even if
>>>>> you know you trust your RT process, there are many code which were
>>>>> written by you, i.e the libc and any other libraries using threading
>>>>> are completely not ready for RT use.
>>>>> How ever let a thread in kernel have higher priority over a thread
>>>>> running userland code will fix such a deadlock in kernel.
>>>> Put another way, the time-sharing thread that I don't care about 
>>>> (sshd, or
>>>> some other monitoring daemon, etc.) is stealing a resource I care about
>>>> (time, in the form of CPU cycles) from my RT process that is 
>>>> critical to
>>>> getting my work done.
>>>>
>>>> Beyond that a few more points:
>>>>
>>>> - You are ignoring "tools, not policy".  You don't know what is in 
>>>> my binary
>>>>    (and I can't really tell you).  Assume for a minute that I'm not 
>>>> completely
>>>>    dumb and can write userland code that is safe to run at this high 
>>>> of a
>>>>    priority level.  You already trust me to write code in the kernel 
>>>> that runs
>>>>    at even higher priority now. :)
>>>> - You repeatedly keep missing (ignoring?) the fact that this is 
>>>> _optional_.
>>>>    Users have to intentionally decide to enable this, and there are 
>>>> users who
>>>>    do _need_ this functionality.
>>>> - You have also missed that this has always been true for idprio 
>>>> processes
>>>>    (and is in fact why we restrict idprio to root), so this is not 
>>>> "new".
>>>> - Finally, you also are missing that this can already happen _now_ 
>>>> for plain
>>>>    old time sharing processes if the thread holding the resource 
>>>> doesn't ever
>>>>    do a sleep that raises the priority.
>>>>
>>>> For example, if a time-sharing thread with some typical priority>=
>>>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode 
>>>> lock for
>>>> that file (if it is unlocked) and hold that lock while it's priority 
>>>> is>=
>>>> PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet 
>>>> that wakes
>>>> up sshd for a new SSH connection, the interrupt thread will preempt the
>>>> thread holding the vnode lock, and sshd will be executed instead of the
>>>> thread holding the vnode lock when the ithread finishes.  If sshd 
>>>> needs the
>>>> vnode lock that the original thread holds, then sshd will block 
>>>> until the
>>>> original thread is rescheduled due to the random fates of time and 
>>>> releases
>>>> the vnode lock.
>>>>
>>>> In summary, the kernel sleep priorities do _not_ serve to prevent all
>>>> priority inversions, what they do accomplish is giving preferential 
>>>> treatment
>>>> to idle, "interactive" threads.
>>>>
>>>> A bit more information on my use case btw:
>>>>
>>>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we 
>>>> remove the
>>>> CPU from the global cpuset and ensure no interrupts are routed to 
>>>> that CPU).
>>>> The problem I have is that if my RT process blocks on a lock (e.g. a 
>>>> lock on a
>>>> VM object during a page fault), then I want the RT thread to lend 
>>>> its RT
>>>> priority to the thread that holds the lock over on another CPU so 
>>>> that the lock
>>>> can be released as quickly as possible.  This use case is perfectly 
>>>> safe (the
>>>> RT thread is not preempting other threads, instead other threads are 
>>>> partitioned
>>>> off into a separate set of available CPUs).  What I need is to 
>>>> ensure that the
>>>> syncer or pagedaemon or whoever holds the lock I need gets a chance 
>>>> to run right
>>>> away when it holds a lock that I need.
>>>>
>>> What I meant is that whenever thread is in kernel mode, it always has
>>> higher priority over thread running user code, and all threads in kernel
>>> mode may have same priority except those interrupt threads which
>>> has higher priority, but this should be carefully designed to use
>>> mutex and spinlock between interrupt threads and other threads,
>>> mutex uses turnstile to propagate priority, spin lock disables
>>> interrupt, otherwise there still is priority inversion in kernel, i.e
>>> rwlock, sx lock.
>> Except that this isn't really true.  Really, if a thread is asleep in
>> select() or poll() or kevent(), what critical resource is it holding?  
>> I had
>> the same view originally when the current set of priorites were setup.
>> However, I've had to change it since I now have a real-world use case for
>> rtprio.
>>
>> First, I think this is the easy part of the argument:  Can you agree 
>> that if
>> a RT process is in the kernel, it should have priority over a TS 
>> process in
>> the kernel?  Thus, if a RT process blocks in the kernel, it would need to
>> lend enough of a priority to the lock holder to preempt any TS process 
>> in the
>> kernel, yes?  If so, that argues for RT processes in the kernel having a
>> higher priority than all the other kernel sleep priorities.
>>
>> The second part is harder, and that is what happens when a RT process 
>> is in
>> userland.  First, some food for thought.  Do you realize that 
>> currently, the
>> syncer and pagedaemon threads run at PVM?  This is intentional so that 
>> these
>> processes run in the "background" even though they are in the kernel.
>> Specifically, when sshd does wakeup from a sleep at PSOCK or the like, 
>> the
>> kernel doesn't just let it run in the kernel, it effectively lets it keep
>> that PSOCK priority in userland until the next context switch due to an
>> interrupt or the quantum expiring.  This means that when you ssh into 
>> a box,
>> the your interactive typing ends up preempting syncer and pagedaemon.  
>> And
>> this is a good thing, because syncer and pagedaemon are _background_
>> processes.  Preempting them only for the kernel portion of sshd (as the
>> change to userret in both your proposal and my original #2 would do) 
>> would
>> not really favor interactive processes because the user relies on the
>> userland portion of an interactive process to run, too (userland is 
>> the part
>> that echos back the characters as they are typed).  So even now, with TS
>> threads, we have TS userland code that is _more important_ than code 
>> in the
>> kernel.  Another example is the idlezero kernel process.  This is kernel
>> code, but is easily far less important than pretty much all userland 
>> code.
>> Kernel code is _not_ always more important than userland code.  It 
>> often is,
>> but it sometimes isn't.  If you can accept that, then it is no longer 
>> strange
>> to consider that even the userland code in a RT process is more important
>> than kernel code in a TS process.
>>
>> In our case we do chew up a lot of CPU in userland for our RT 
>> processes, but
>> we handle this case by using dedicated CPUs.  Our RT processes really 
>> are the
>> most important processes on the box.
>>
> 
> I have to agree with John on this one..
> The real-time property for threads is a dangerous tool which we allow a
> system "Adminstrator"  (i.e. someone with root,) to do some things.
> It is perfectly understood that doing the WRONG thing will negatively
> impact the system (maybe even make it unworkable). However the decision to
> set a process to realtime mode means that the Administrator has decided 
> that
> that process/thread is more importnat than everything else in the system.
> One could argue about whether this applies to interrupts, but in the 
> modern day
> of even cell phones having multiple processors, it gets harder and harder
> to make the case that userland code should not be able to pre-empt
> or block kernel code.
> 
> I think this philosophy has always been true..  As Terry Lambert used to 
> say
> at the beginning of the project: Unix's job is to delver the bullet to 
> where-ever the
> user wants to put it, including the user's foot.  When you are the 
> administrator
> you get to have  a pretty big foot.
> 
> In addition many of freeBSD's 'Users' are in fact producers of 'product' 
> boxes.
> They know EXACTLY what is running on the system, and where, and want the 
> ability
> to label a process in the way that John shows.  For them it is the 
> primary purpose
> of the box to do task X and doing task X comes before all other tasks, 
> possibly even
> non related interrupts.
> 
> Julian
> 

The main problem is correctness, not if root can use it or not,
I know it is his machine, he can do whatever he wants to do. :-)
I have to repeat:
The question is can the kernel correctly schedule RT threads ? no.
The fact is so many lock semantics are not RT safe, lockmgr, sx lock,
rwlock and other locks based on msleep/wakeup which do not use
priority propagating or do not protect priority have priority inversion.
Also the PPQ = 4 is incorrect for RT scheduling, it is another
kind of priority inversion.
So what can we do here ? if mutex and spin lock can not be used,
it should either raise thread's priority to a high enough
level or all threads have equal priority in kernel.
If future changes can not fix the above problems, those changes
are nonsense.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4D0B013F.3060203>