Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 13 Oct 2010 11:29:08 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Gennady Proskurin <gprspb@mail.ru>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Bruce Evans <brde@optusnet.com.au>, Andriy Gapon <avg@freebsd.org>
Subject:   Re: svn commit: r213648 - head/sys/kern
Message-ID:  <20101013094338.U1075@besplex.bde.org>
In-Reply-To: <20101012210804.GA4286@gpr.nnz-home.ru>
References:  <201010090807.o9987nG0030939@svn.freebsd.org> <20101009191600.N3531@besplex.bde.org> <4CB03A82.6040701@freebsd.org> <20101012210804.GA4286@gpr.nnz-home.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 13 Oct 2010, Gennady Proskurin wrote:

> On Sat, Oct 09, 2010 at 12:48:50PM +0300, Andriy Gapon wrote:
>> on 09/10/2010 12:33 Bruce Evans said the following:
>>> On Sat, 9 Oct 2010, Andriy Gapon wrote:
>[>]*...
>>> Now, why don't the partial memory barriers prevent caching the variable?
>>>
>>> %     if (panic_cpu != PCPU_GET(cpuid))
>>> %         while (atomic_cmpset_int(&panic_cpu, NOCPU,
>>> %             PCPU_GET(cpuid)) == 0)
>>> %             while (panic_cpu != NOCPU)
>>> %                 ; /* nothing */
>>>
>>> The very first access can't reasonably use a cachec value.  atomic_cmpset()

So the first access doesn't need a compiler-type memory barrier.  Not needing
a CPU-type one is more subtle -- see below.

>>> can change panic_cpu, so even without the memory barrier, panic_cpu must
>>> be reloaded for the third access the first time.  But then when the third
>>> access is repeated in the second while loop, the missing atomic op with
>>> barrier makes things very broken.  panic_cpu isn't changed by the loop,
>>> and the compiler thinks that it isn't changed by anything else either, so
>>> the compiler may reduce the loop to:
>>>
>>> %             if (panic_cpu != NOCPU)
>>> %                 for (;;)
>>> %                     ; /* nothing */
>>
>>
>> Yes, it's exactly the last loop that had the problem.
>> On amd64 with -O2:
>>         .loc 1 544 0
>>         movl    panic_cpu(%rip), %eax
>> .LVL134:
>>         .p2align 4,,7
>> .L210:
>>         cmpl    $255, %eax
>>         jne     .L210
>>         jmp     .L225
>>
>>> except I've seen claims that even an endless for loop can be optimized
>>> to nothing.  Declaring panic_cpu as volatile prevents the compiler doing
>>> this, but I think it is insufficient.  We really do want to see panic_cpu
>>> changed by other CPUs, and what is the point of atomic_load_acq*() if not
>>> to use for this -- if declaring things volatile were sufficient, then we
>>> could just use *(volatile *)&var.  atomic_load/store_acq/rel*() on i386
>>
>> I discussed this with kib and the idea is that atomic operation is not needed in
>> that place and volatile is sufficient, because we don't need barrier semantics
>> there and only want to ensure that variable is reloaded from memory.
>
> If I understand correctly, without acquiring barrier, variable is not
> guaranteed to be loaded from memory, it can end up with some stale value from
> cpu's cache or pipeline.  If incorrect value will be checked in the first "if"
> operator, it can lead to skiping all the atomic synchronization.
> The same about writing to variable, without barrier new value may be written to
> some local cpu cache and not be seen by readers until it is flushed from cache.

That's what I was thinking.  Now I think the only problem is that not
locking everything just makes the correctness very hard to understand,
and it shouldn't be done since this code is as far as possible from
needing to be efficient.

If the access in the inner while loop sees a stale value, then there
is no problem provided the value eventually becomes unstale or just
changes to a different stale value -- then we exit the inner loop and
go back to the locked access in the outer loop.

The access before the outer loop can at most see a stale value which
doesn't matter.  Only the current CPU can set panic_cpu to PCPU_GET(cpuid)).
When panic() is called with panic_cpu having this value, it means that
the panic is recursive (since panic_cpu is initially NOCPU, and the
current CPU doesn't/shouldn't change it to anything but itself).  For
recursive panics, the current CPU should already have locked everything
and stopped other CPUs).  It doesn't need to lock against itself, and it
needs the test using the first access to avoid deadlocking against itself.
When panic() is called with panic_cpu = NOCPU and this value is not stale,
then there is no problem.  When panic() is called with panic_cpu = NOCPU
and this value is stale, it means that one CPU entered and left panic()
and set the value to NOCPU, but another CPU has entered panic() and set
the value to !NOCPU but we haven't seen this yet.  Then we reach the
outer while loop expecting to acquire the lock, but when we try to
acquire it we don't get it, and there is no problem.  Finally, when
when panic() is called with panic_cpu = another_cpu, there is similarly
no problem, whether or not this value is stale.  Then we reach the outer
while loop expecting not to acquire the lock, but we may sometimes
acquire it, even on the first try, if the value was stale or if it
just became stale after we read it.

Perhaps my rule for locking read accesses may allow/require the complexity
in the above.  It is, never lock read accesses, since the value may be
stale after you read it, whether or not you locked the access.  It is
only in delicate cases, where you need a value that is non-stale at the
time of the access but are happy with a value that is stale later, that
locking a read access is useful.

The panic locking could be written using only standard complications using
a recursive mutex (or trylock), except then it might have to fight with
the mutex implementation doing more than the simple spin:

 	mtx_lock_spin(&panic_mtx);
 	/*
 	 * Already have a problem -- the implementation disable interrupts,
 	 * which we should want, but since we do wrong things like syncing
 	 * with normal i/o expected to work, we probably need interrupts.
 	 */

It's interesting that the implementation of mutexes doesn't optimize for
the unusual case of a mutex being recursed on, but we do that here.  We
could be more like the mutex implementation and do:

 	while (atomic_cmpset_int(&panic_cpu, NOCPU, PCPU_GET(cpuid)) == 0) {
 		if (panic_cpu == PCPU_GET(cpuid))
 			break;	/* recursive */
 		while (panic_cpu != NOCPU)
 			continue;
 	}

>
> This also applies to r213736.
>
>> atomic_store_rel at the end seems to be needed because of inter-dependency
>> between panicstr and panic_cpu.  That is we want changes to panicstr to becomes
>> visible at the same time (before actually) changes to panic_cpu are visible.

There are similar but larger complications for debugger entry.  Debugger
entry has the additional problem that a thundering herd of CPUs may
enter at a breakpoint, or at any debugger trap with handling necessary
but normal handling not possible.  Debuggers entr has the additional
non-problem that it at least tries to stop other CPUs before proceeding
far.  My i386 ddb entry starts a bit like the above:

% 	ef = read_eflags();
% 	disable_intr();
% 	if (atomic_cmpset_int(&kdb_trap_lock, NOCPU, PCPU_GET(cpuid)) == 0 &&
% 	    kdb_trap_lock != PCPU_GET(cpuid)) {

It has the anti-recursion detection in the same if statement as the cmpset.

% 		while (atomic_cmpset_int(&output_lock, 0, 1) == 0)
% 			;
% 		db_printf(
% 		    "concurrent ddb entry: type %d trap, code=%x cpu=%d\n",
% 		    type, code, PCPU_GET(cpuid));
% 		atomic_store_rel_int(&output_lock, 0);

Another lock to serialize output for debugging this stuff.  Thundering
herds at breakpoints are very easy to arrange, and they will interleave
their output almost perfectly with some console drivers (say, a serial
console at 50 bps so that you can easily watch the problem; this is also
a good test for console drivers).

Elsewhere I posted a similar method for properly serializing plain
printf output.  That bounds the time for the output so it might not
work on 50 bps consoles.  The above has no bound and depends on
db_printf() never blocking endlessly even under fire from places like
here.

% 		if (type == T_BPTFLT)
% 			regs->tf_eip--;

This backs of from the breakpoint.  It is not MI enough.  Some nut
might have put a 2-byte breakpoint instruction in the instruction
stream.  Then the breakpoint won't belong to us, so not handling it
here would be correct.  Backing off 2 would work OK too (it allows the
instruction to trap again).  Backing of 1 backs into it and corrupts
it.  Backing off 1 works OK in the same way as when the breakpoint is
not ours.  Backing off 1 here handles the case where the breakpoint
is ours now but will go away because another ddb entry proceeds and
removes it.

% 		else {
% 			while (atomic_cmpset_int(&output_lock, 0, 1) == 0)
% 				;
% 			db_printf(
% "concurrent ddb entry on non-breakpoint: too hard to handle properly\n");
% 			atomic_store_rel_int(&output_lock, 0);
% 		}

More debugging.

% 		while (atomic_load_acq_int(&kdb_trap_lock) != NOCPU)
% 			;

This corresponds to the inner while loop in panic().  According to the
above discussion, I don't need the atomic op here.  I should use pause()
here.

% 		write_eflags(ef);

panic() doesn't disable interrupts, especially in the inner while loop.
Interrupts must be kept disabled to avoid them doing bad things including
recursive debugger and/or panic entries, although masking of all interrupts
breaks sending of most IPIS.

% 		return (1);
% 	}
% // Proceed to stopping CPUs...  Since some CPUs may be looping with
% // interrupts masked, either in the above loop or accidentally, the
% // old method of sending maskable IPIs may block us endlessly.  I
% // use a hack to work around this.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101013094338.U1075>