Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 May 2017 15:29:39 -0700
From:      Olivier Cinquin <olivier.cinquin@uci.edu>
To:        freebsd-stable@freebsd.org
Subject:   Kernel panic in smp rendezvous
Message-ID:  <EC4BA9D6-A5B2-45D3-8DE9-A98FEE20694F@uci.edu>

next in thread | raw e-mail | index | archive | help
Hi,
I'm getting the following kernel panics on recent 11-STABLE (r317883):

spin lock 0xffffffff81df43d0 (smp rendezvous) held by 0xfffff8019c7a7000 =
(tid 100845) too long
timeout stopping cpus
panic: spin lock held too long
cpuid =3D 12
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame =
0xfffffe3e64caf2a0
vpanic() at vpanic+0x186/frame 0xfffffe3e64caf320
panic() at panic+0x43/frame 0xfffffe3e64caf380
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x311/frame =
0xfffffe3e64caf3f0
smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x101/frame =
0xfffffe3e64caf470
smp_masked_invlpg_range() at smp_masked_invlpg_range+0x50/frame =
0xfffffe3e64caf4a0
pmap_invalidate_range() at pmap_invalidate_range+0x196/frame =
0xfffffe3e64caf4e0
vm_thread_stack_dispose() at vm_thread_stack_dispose+0x2f/frame =
0xfffffe3e64caf530
thread_free() at thread_free+0x39/frame 0xfffffe3e64caf550
thread_reap() at thread_reap+0x10e/frame 0xfffffe3e64caf570
proc_reap() at proc_reap+0x6bd/frame 0xfffffe3e64caf5b0
proc_to_reap() at proc_to_reap+0x48c/frame 0xfffffe3e64caf600
kern_wait6() at kern_wait6+0x49e/frame 0xfffffe3e64caf6b0
sys_wait4() at sys_wait4+0x78/frame 0xfffffe3e64caf8a0
amd64_syscall() at amd64_syscall+0x6c4/frame 0xfffffe3e64cafa30
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe3e64cafa30


The panics occur sporadically and seemingly randomly. They occur on
multiple machines with the same configuration, and on older revisions of
11-STABLE as well as r317883 (I cannot easily bisect this down to a
specific commit given that it can take days or sometimes weeks for the
panic to occur). Note that my kernel is compiled with "options IPSEC", =
but
that does not seem to be relevant.

My understanding of the overall problem is that an IPI is sent to all =
cores
by smp_rendezvous_cpus, but that some of the cores do not respond to it =
(or
at least do not respond to it correctly and in time). More specifically, =
I
can find 61 threads that seem to be blocked in cpustop_handler on an =
atomic
operation to indicate that they have stopped in response to the IPI. =
This
operation is CPU_SET_ATOMIC(cpu, &stopped_cpus); CPU_SET_ATOMIC boils =
down
to a call to "atomic_set_long" (not sure where that is itself defined =
for
amd64). Either there is a line numbering problem, or this suggests that
perhaps there is a deadlock at this step (unless there's some sort of
livelock and CPU_SET_ATOMIC is the place where the threads spend most of
their time).

Looking at the thread backtraces as a whole, I can see the one that
triggered the panic, the 61 that are in cpustop_handler, and a lot of
sleeping threads. I guess each core should have an active thread, and =
this
is an architecture with 64 cores, so that leaves 64 - (1 + 61) =3D 2 =
cores
that are unaccounted for. Interestingly, for 2 different panics for =
which I
have a vmcore file these numbers are the same. For these two panics, the
IDs of the cores that are neither in cpustop_handler nor calling the smp
rendez vous are not the same (#18 and #19 in one instance, #44 and #45 =
in
the other instance) but in both instances the IDs are consecutive; =
perhaps
that's relevant.

I've uploaded a full set of backtraces at
https://gist.github.com/cinquin/d63cdf9de01b0b8033c47128868f2d38 and a
dmesg at =
https://gist.github.com/cinquin/17c0cf6ac7832fd1b683488d08d3e69b
(in short, the machine is a 4 processor Opteron 6274 system). I'm =
pasting below
an example backtrace of the threads in stopcpu_handler.

Any suggestions as to what the problem might be and how to solve it? =
I've
saved the core and can provide further information.

Thanks
Olivier Cinquin

A total of 61 threads are along the lines of

#0  cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275
#1  0xffffffff8105a6f5 in ipi_nmi_handler () at =
/usr/src/sys/x86/x86/mp_x86.c:1231
#2  0xffffffff80ef740a in trap (frame=3D0xfffffe3e687f7f30) at =
/usr/src/sys/amd64/amd64/trap.c:185
#3  0xffffffff80edbd03 in nmi_calltrap () at =
/usr/src/sys/amd64/amd64/exception.S:510
...
(kgdb) frame 0
#0  cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275
1275		CPU_SET_ATOMIC(cpu, &stopped_cpus);
(kgdb) list
1270		cpu =3D PCPU_GET(cpuid);
1271=09
1272		savectx(&stoppcbs[cpu]);
1273=09
1274		/* Indicate that we are stopped */
1275		CPU_SET_ATOMIC(cpu, &stopped_cpus);
1276=09
1277		/* Wait for restart */
1278		while (!CPU_ISSET(cpu, &started_cpus))
1279		    ia32_pause();
(kgdb) p stopped_cpus
$1 =3D {__bits =3D 0xffffffff81df4368}
(kgdb) p started_cpus
$2 =3D {__bits =3D 0xffffffff81df4418}





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?EC4BA9D6-A5B2-45D3-8DE9-A98FEE20694F>