Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 May 2013 13:35:01 -0700
From:      Alfred Perlstein <alfred@ixsystems.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Alexander Motin <mav@freebsd.org>, hackers@freebsd.org, Xin Li <delphij@delphij.net>
Subject:   Re: please review, patch for lost camisr
Message-ID:  <51A514F5.9050405@ixsystems.com>
In-Reply-To: <201305281613.32414.jhb@freebsd.org>
References:  <51A44B0C.8010908@ixsystems.com> <201305281204.14146.jhb@freebsd.org> <51A505A5.7030105@ixsystems.com> <201305281613.32414.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
[[  moved to hackers@ from private mail. ]]

On 5/28/13 1:13 PM, John Baldwin wrote:
> On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
>> On 5/28/13 9:04 AM, John Baldwin wrote:
>>> On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
>>>> Hey folks,
>>>>
>>>> I had a talk with Nathan Whitehorn about the camisr issue.  The issue we
>>>> are seeing we mostly know, but to summarize, we are losing the camisr
>>>> signal and the camisr is not being run.
>>>>
>>>> I gave him a summary of what we have been seeing and pointed him to the
>>>> code I am concerned about here:
>>>> http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
>>>>
>>>> What I think that is happening is that the setting of it_need to 0
>>>> inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
>>>> correctly and it is being delayed until AFTER the call to
>>>> ithread_execute_handlers() right below the atomic_store_rel_int().
>>> This seems highly unlikely, to the extent that if this were true all our
>>> locking primitives would be broken.  The store_rel is actually a release
>>> barrier which acts like more a read/write fence.  No memory accesses (read or
>>> write) from before the release can be scheduled after the associated store,
>>> either by the compiler or CPU.  That is what Konstantin is referring to in his
>>> commit when he says "release" semantics".
>> Yes, that makes sense, however does it specify that the writes *must*
>> occur at that *point*?  If it only enforces ordering then we may have
>> some issue, specifically because the setting of it to '1' inside of
>> intr_event_schedule_thread has no barrier other than the acq semantics
>> of the thread lock.  I am wondering what is forcing out the '1' there.
> Nothing ever forces writes.  You would have to invalidate the cache to do that
> and that is horribly expensive.  It is always only about ordering and knowing
> that if you can complete another operation on the same "cookie" variable with
> acquire semantics that earlier writes will be visible.

By cookie, you mean a specific memory address, basically a lock? This is 
starting to reinforce my suspicions as the setting of it_need is done 
with release semantics, however the acq on the other CPU is done on the 
thread lock.  Maybe that is irrelevant.  We will find out shortly.

>
>> See below as I think we have proof that this is somehow happening.
> Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest fix
> is probably to stop using atomics on it_need and just grab the thread lock
> in the main ithread loop and hold it while checking and clearing it_need.
>

OK, we have some code that will either prove this, or perturb the memory 
ordering enough to make the bug go away, or prove this assertion wrong.

We will update on our findings hopefully in the next few days.

Thank you for your advice.

-Alfred





-Alfred



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51A514F5.9050405>