Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Mar 2008 12:22:54 -1000 (HST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: Getting rid of the static msleep priority boost
Message-ID:  <20080310121527.F1091@desktop>
In-Reply-To: <200803101313.03526.jhb@freebsd.org>
References:  <20080307020626.G920@desktop> <20080307124038.I920@desktop> <20080307234452.U1091@desktop> <200803101313.03526.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 10 Mar 2008, John Baldwin wrote:

> On Saturday 08 March 2008 04:46:32 am Jeff Roberson wrote:
>> On Fri, 7 Mar 2008, Jeff Roberson wrote:
>>
>>> On Fri, 7 Mar 2008, John Baldwin wrote:
>>>
>>>> On Friday 07 March 2008 08:42:37 am John Baldwin wrote:
>>>>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've been studying some problems with recent scheduler improvements
> that
>>>>>> help a lot on some workloads and hurt on others.  I've tracked the
>>>>>> problem down to static priority boosts handed out by
>>>>>> msleep/cv_broadcastpri.  The basic problem is that a user thread will
> be
>>>>>> woken with a kernel priority thus allowing it to preempt a thread
> running
>>>>>> on any processor with a lesser priority.  The lesser priority thread
> may
>>>>>> in fact hold some resource that the higher priority thread requires.
>>>>>> Thus we context switch several times and perhaps go through priority
>>>>>> propagation as well.
>>>>>>
>>>>>> I have verified that disabling these static priority boosts entirely
>>>>>> fixes the performance problem I've run into on at least one workload.
>>>>>> There are probably others that it helps and hopefully we can discover
>>>>>> that.
>>>>>>
>>>>>> I'd like to know if anyone has a strong preference to keep this
> feature.
>>>>>> It is likely that it helps in some interactive situations.  I'm not
> sure
>>>>>> how much however.  I propose that we make a sysctl that disables it and
>>>>>> turn it off by default.  If we see complaints on current@ we can
> suggest
>>>>>> that they toggle the sysctl to see if it alleviates problems.
>>>>>>
>>>>>> Based on feedback from that experiment and some testing we can then
>>>>>> choose a few options:
>>>>>>
>>>>>> 1)  Disable the static boosts entirely.  Leave kernel priorities for
>>>>>> kernel threads and priority propagation.  Most other kernels do this.
>>>>>> Would make my life in ULE much easier as well.
>>>>>>
>>>>>> 2)  Leave the support for static boosts but remove it from all but a
> few
>>>>>> key locations.  Leaving it in the api would give some flexibility but
>>>>>> might confuse developers.
>>>>>>
>>>>>> 3)  Leave things as they are.  undesirable.
>>>>>>
>>>>>> I'm leaning towards #2 based on the information I have presently.  This
>>>>>> is almost a significant change to historic BSD behavior so we might
> want
>>>>>> to tread lightly.
>>>>>
>>>>> One thing to note is that we actually depend on the priority boost
>>>>> (evilly)
>>>>> to pick processes to swap out.  (I think we check for <= PSOCK and don't
>>>>> swap those out).  One thing that I've wanted to happen for a while is
> that
>>>>> the sleep priority for msleep() just be a parameter available to the
>>>>> scheduler that the scheduler can use to calculate the real internal
>>>>> priority rather than just being a set.  That is, I imagine having:
>>>>>
>>>>> void	sched_set_sleep_prio(struct thread *td, u_char pri);
>>>>> u_char	sched_get_sleep_prio(struct thread *td);
>>>>>
>>>>> (The swap check would use the get call).  The 4BSD scheduler's
>>>>> implementation of sched_set_sleep_prio would look like this:
>>>>>
>>>>> void
>>>>> sched_set_sleep_prio(struct thread *td, u_char pri)
>>>>> {
>>>>>
>>>>> 	td->td_sched->sleep_pri = pri;
>>>>> 	sched_prio(td, pri);
>>>>> }
>>>>>
>>>>> void
>>>>> sched_userret(..)
>>>>> {
>>>>>
>>>>> 	...
>>>>> 	td->td_sched->sleep_pri = 0;	/* not in the kernel anymore */
>>>>> }
>>>>>
>>>>> but other schedulers may just save it and recalculate the priority where
>>>>> the priority calculation just considers the sleep priority as one among
>>>>> many factors.  If nothing else, this allows it to be a scheduler
> decision
>>>>> to ignore it (so 4BSD could continue to do what it does now, but ULE may
>>>>> ignore it, or ignore certain levels, etc.)
>>>>
>>>> One thing to clarify: I'm not opposed to replacing the PSOCK check with
>>>> something more suitable in the swap code, (in fact, that would be
>>>> desirable),
>>>> but it might take a good bit of work to do that and is probably easier to
>>>> work on that as a separate change.  I also think there can be some merit
> in
>>>> having code paths hint to the scheduler the relative
> interactivity/priority
>>>> of a sleep.
>>>
>>> Couple of notes..
>>>
>>> The priority argument to sleep is a reasonable way for the code to hint at
>>> the relative priority/interactivity.  So that argues for leaving these
>>> arguments in place and making them more advisory.  I don't think we have
> to
>>> change the api to take advantage of that.
>>>
>>> I'll look more closely for places like the swap that care about the
> absolute
>>> priority of a process and see what I can come up with.  Thanks for raising
>>> that concern.
>>>
>>> I'd like to avoid apis that require the sched lock in seperate steps like
>>> msleep does now to elevate the priority.  So far all sched* apis require
> the
>>> thread lock on enter and I'd hate to deviate from that norm.  But another
>>> option may be just to make a globally visible td_sleep_pri that doesn't
>>> require the lock for write but does for read.  The other option is to
> bubble
>>> the argument down through the sleepq code and into sched_sleep() and
>>> sched_wakeup().  I like that the best but it's the most api churn.
>>
>> http://people.freebsd.org/~jeff/sleeppri.diff
>>
>> What do you think of this?  I added another parameter to sleepq_add() and
>> sched_sleep().  So the scheduler is responsible for adjusting the
>> priority.  We could do the same thing for wakeup time adjustments like
>> sleepq_broadcastpri() but we'd have to pass it through setrunnable() as
>> well.
>
> The cv_broadcastpri() thing is a hack and I wish there was a better way to do
> it.  I.e., I don't like having wakeup setting the priority at all.  I think
> it's a good idea to pass this to sched_sleep(), but I'd rather leave
> sched_sleep() where it is and pass the prio arg to the sleepq_wait() routines
> instead so you don't get a bump unless you actually sleep.  I think it's
> probably a bug that we bump the prio on threads that may not sleep now.

Ok, I preferred not to move sched_sleep() as well but I also didn't want 
to add those arguments to the stack everywhere.  I'll do that however.

>
>> I'd like to normalize the other pri arguments in sleepq to use the same 0
>> is not set vs -1 that msleep did.  I realize that 0 is a valid priority
>> but for practical purposes this makes things consistent and does not
>> really restrict the api.
>
> Sounds fine to me.  I think we should even formally make 0 an invalid priority
> (via a comment or something).

Ok, I'll consider that.

I'm just going to commit this when it's tested and working.  It's simple 
enough I don't think it warrents further review.

Thanks,
Jeff

>
> -- 
> John Baldwin
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080310121527.F1091>