Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Aug 2017 22:26:36 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Don Lewis <truckman@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject:   Re: ULE steal_idle questions
Message-ID:  <d9dae0c1-e718-13fe-b6b5-87160c71784e@FreeBSD.org>
In-Reply-To: <201708231504.v7NF4nYe035934@gw.catspoiler.org>
References:  <201708231504.v7NF4nYe035934@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 23/08/2017 18:04, Don Lewis wrote:
> I've been looking at the steal_idle code in tdq_idled() and found some
> things that puzzle me.
> 
> Consider a machine with three CPUs:
>   A, which is idle
>   B, which is busy running a thread
>   C, which is busy running a thread and has another thread in queue
> It would seem to make sense that the tdq_load values for these three
> CPUs would be 0, 1, and 2 respectively in order to select the best CPU
> to run a new thread.
> 
> If so, then why do we pass thresh=1 to sched_highest() in the code that
> implements steal_idle?  That value is used to set cs_limit which is used
> in this comparison in cpu_search:
>                         if (match & CPU_SEARCH_HIGHEST)
>                                 if (tdq->tdq_load >= hgroup.cs_limit &&
> That would seem to make CPU B a candidate for stealing a thread from.
> Ignoring CPU C for the moment, that shouldn't happen if the thread is
> running, but even if it was possible, it would just make CPU B go idle,
> which isn't terribly helpful in terms of load balancing and would just
> thrash the caches.  The same comparison is repeated in tdq_idled() after
> a candidate CPU has been chosen:
>                 if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
>                         tdq_unlock_pair(tdq, steal);
>                         continue;
>                 }
> 
> It looks to me like there is an off-by-one error here, and there is a
> similar problem in the code that implements kern.sched.balance.


I agree with your analysis.  I had the same questions as well.
I think that the tdq_transferable check is what saves the code from
running into any problems.  But it indeed would make sense for the code
to understand that tdq_load includes a currently running, never
transferable thread as well.

> The reason I ask is that I've been debugging random segfaults and other
> strange errors on my Ryzen machine and the problems mostly go away if I
> either disable kern.sched.steal_idle and kern_sched.balance, or if I
> leave kern_sched.steal_idle enabled and hack the code to change the
> value of thresh from 1 to 2.  See
> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029>; for the gory
> details.  I don't know if my CPU has what AMD calls the "performance
> marginality issue".

I have been following your experiments and it's interesting that
"massaging" the CPU in certain ways makes it a bit happier.  But
certainly the fault is with the CPU as the code is trouble-free on many
different architectures including x86, and various processors from both
Intel and AMD [with earlier CPU families].


-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?d9dae0c1-e718-13fe-b6b5-87160c71784e>