From owner-freebsd-bugs@freebsd.org  Fri Sep 15 07:26:37 2017
Return-Path: <owner-freebsd-bugs@freebsd.org>
Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3750CE048B0
 for <freebsd-bugs@mailman.ysv.freebsd.org>;
 Fri, 15 Sep 2017 07:26:37 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 2502A80299
 for <freebsd-bugs@FreeBSD.org>; Fri, 15 Sep 2017 07:26:37 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v8F7QbZv020910
 for <freebsd-bugs@FreeBSD.org>; Fri, 15 Sep 2017 07:26:37 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-bugs@FreeBSD.org
Subject: [Bug 222234] head -r323246 aarch64 (Pine64+ 2GB) boot time context,
 sometimes: acquiring blockable sleep lock with spinlock or critical section
 held
Date: Fri, 15 Sep 2017 07:26:37 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: CURRENT
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: markmi@dsl-only.net
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-222234-8-JctlQ38AmG@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-222234-8@https.bugs.freebsd.org/bugzilla/>
References: <bug-222234-8@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs/>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Sep 2017 07:26:37 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D222234

--- Comment #1 from Mark Millard <markmi@dsl-only.net> ---
(In reply to Mark Millard from comment #0)

Based on verbose boot attempts I've seen that
the failures are during:

taskqgroup_adjust_softirq(0)...


I've traced the failure back to the bad pointer
value that is in place when it was put to use.

Summary of the following investigations:

It traces back to the condition:

       pcpu_find(cpu)->pc_curthread =3D=3D NULL

in the code:

       ctd =3D pcpu_find(cpu)->pc_curthread;

for cpu =3D=3D 1 in tdq_notify (but inlined). It then
attempts to evaluate:

ctd->td_priority

which gets a data_abort but it is during when
blocked_lock is the container lock for the
thread.

In the witness included case this leads to:

panic: acquiring blockable sleep lock with spinlock or critical section held
(sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:4710
cpuid =3D 0
time =3D 13

but that is a later consequence of the earlier
problem.

I'm not sure why pcpu_find(cpu)->pc_curthread
is still NULL but since the behavior is intermittent
for debug kernel builds it suggests a race for an
update that was initiated but not always finished
in time.

(I've had occasions of hours of reboots to try
to get a failure but mostly only needing a few.
I've not run into a long sequence of failures to
boot for a debug kernel but have had some
back-to-back ones.)


Supporting detail that lead to the above:

int
pmap_fault(pmap_t pmap, uint64_t esr, uint64_t far)

The far (fault address register) argument to pmap_fault is the=20
rd one (x2 below):

ffff000000606954 <pmap_fault> stp       x22, x21, [sp, #-48]!
ffff000000606958 <pmap_fault+0x4> stp   x20, x19, [sp, #16]
ffff00000060695c <pmap_fault+0x8> stp   x29, x30, [sp, #32]
ffff000000606960 <pmap_fault+0xc> add   x29, sp, #0x20
ffff000000606964 <pmap_fault+0x10> mov  x20, x2
ffff000000606968 <pmap_fault+0x14> mov  x22, x1
ffff00000060696c <pmap_fault+0x18> mov  x21, x0

For the failing call sequence far ends up stored on the stack
via the x20 save-to-stack in:

ffff0000002f8c0c <__mtx_lock_flags> stp x24, x23, [sp, #-64]!
ffff0000002f8c10 <__mtx_lock_flags+0x4> stp     x22, x21, [sp, #16]
ffff0000002f8c14 <__mtx_lock_flags+0x8> stp     x20, x19, [sp, #32]
ffff0000002f8c18 <__mtx_lock_flags+0xc> stp     x29, x30, [sp, #48]
ffff0000002f8c1c <__mtx_lock_flags+0x10> add    x29, sp, #0x30

Stack segment with a little context:

0xffff000069850470:     ffff0000698504b0                ffff0000002f8b80
0xffff000069850480:     ffff000000c6a528                0
0xffff000069850490:     96000004                        ffff000000c6a658
0xffff0000698504a0:     37e                             ffff000000c6a670
0xffff0000698504b0:     ffff0000698504e0                ffff000000606998

So it appears:

pmap_fault`esr =3D=3D 0x96000004
pmap_fault`pmap =3D=3D 0xffff000000c6a658 (vmspace0+0x130)
pmap_fault`far =3D=3D 0x37e

I'll note that 0x37e =3D 894 so it matches up with
x8 =3D=3D 0x0 for the likes of:

elr         0xffff00000033f0dc  sched_switch+0x2bc
. . .
ssched_switch+0x2b8:     ldrb    w9, [x8, #894]

matching:

db> show reg
. . .
x8                           0
. . .

So apparently sched_switch tried to access 0x37e

db> x/gx 0xffff000000606998=20=20=20
pmap_fault+0x44:        f100111f927e0ec8

Part of the back trace is (for the example
debug kernel):

kassert_panic() at witness_checkorder+0x160
        pc =3D 0xffff0000003174e4  lr =3D 0xffff000000374990
        sp =3D 0xffff0000698503f0  fp =3D 0xffff000069850470

witness_checkorder() at __mtx_lock_flags+0xa8
        pc =3D 0xffff000000374990  lr =3D 0xffff0000002f8b7c
        sp =3D 0xffff000069850480  fp =3D 0xffff0000698504b0

__mtx_lock_flags() at pmap_fault+0x40
        pc =3D 0xffff0000002f8b7c  lr =3D 0xffff000000606994
        sp =3D 0xffff0000698504c0  fp =3D 0xffff0000698504e0

pmap_fault() at data_abort+0xb8
        pc =3D 0xffff000000606994  lr =3D 0xffff000000608a9c
        sp =3D 0xffff0000698504f0  fp =3D 0xffff0000698505a0

data_abort() at do_el1h_sync+0xfc
        pc =3D 0xffff000000608a9c  lr =3D 0xffff0000006088f0
        sp =3D 0xffff0000698505b0  fp =3D 0xffff0000698505e0

do_el1h_sync() at handle_el1h_sync+0x74
        pc =3D 0xffff0000006088f0  lr =3D 0xffff0000005f1874
        sp =3D 0xffff0000698505f0  fp =3D 0xffff000069850700

handle_el1h_sync() at sched_switch+0x2a8
        pc =3D 0xffff0000005f1874  lr =3D 0xffff00000033f0c8
        sp =3D 0xffff000069850710  fp =3D 0xffff0000698507f0

sched_switch() at mi_switch+0x1b8
        pc =3D 0xffff00000033f0c8  lr =3D 0xffff00000032161c
        sp =3D 0xffff000069850800  fp =3D 0xffff000069850820

mi_switch() at taskqgroup_binder+0x7c
        pc =3D 0xffff00000032161c  lr =3D 0xffff00000035510c
        sp =3D 0xffff000069850830  fp =3D 0xffff000069850860

taskqgroup_binder() at gtaskqueue_run_locked+0x104
        pc =3D 0xffff00000035510c  lr =3D 0xffff000000354f74
        sp =3D 0xffff000069850870  fp =3D 0xffff0000698508e0

gtaskqueue_run_locked() at gtaskqueue_thread_loop+0x9c
        pc =3D 0xffff000000354f74  lr =3D 0xffff000000354d10
        sp =3D 0xffff0000698508f0  fp =3D 0xffff000069850910

gtaskqueue_thread_loop() at fork_exit+0x7c
        pc =3D 0xffff000000354d10  lr =3D 0xffff0000002dbd3c
        sp =3D 0xffff000069850920  fp =3D 0xffff000069850950

fork_exit() at fork_trampoline+0x10
        pc =3D 0xffff0000002dbd3c  lr =3D 0xffff000000608664
        sp =3D 0xffff000069850960  fp =3D 0x0000000000000000

Note:

ffff00000033f0bc <sched_switch+0x29c> ldr       w0, [x19, #1292]
ffff00000033f0c0 <sched_switch+0x2a0> ldrb      w27, [x19, #894]
ffff00000033f0c4 <sched_switch+0x2a4> str       w0, [sp, #12]
ffff00000033f0c8 <sched_switch+0x2a8> bl        ffff000000359708 <pcpu_find>
ffff00000033f0cc <sched_switch+0x2ac> ldr       x8, [x0]
ffff00000033f0d0 <sched_switch+0x2b0> mov       w11, w27
ffff00000033f0d4 <sched_switch+0x2b4> adrp      x27, ffff000000c85000
<dpcpu+0x158>
ffff00000033f0d8 <sched_switch+0x2b8> ldrb      w9, [x8, #894]
ffff00000033f0dc <sched_switch+0x2bc> cmp       w11, w9
ffff00000033f0e0 <sched_switch+0x2c0> b.cs      ffff00000033f150
<sched_switch+0x330>  // b.hs, b.nlast

This is code for the later part of what is
shown below:

static void
tdq_notify(struct tdq *tdq, struct thread *td)
{
       struct thread *ctd;
       int pri;
       int cpu;

       if (tdq->tdq_ipipending)
               return;
       cpu =3D td_get_sched(td)->ts_cpu;
       pri =3D td->td_priority;
       ctd =3D pcpu_find(cpu)->pc_curthread;
       if (!sched_shouldpreempt(pri, ctd->td_priority, 1))
               return;
. . .
}

(Where: sched_shouldpreempt has been inlined and
some of it is interlaced.)

The failing [x8, #894] is the attempt to access: ctd->td_priority

In other words: ctd =3D=3D NULL resulted from the pcpu_find
(i.e., x8 =3D=3D 0 ). As for how it got to be zero:

db> show reg
spsr        0x9600000440000085
x0          0xffff000000ac1000  __pcpu+0x200
. . .

db> x/gx cpuid_to_pcpu,32
cpuid_to_pcpu:  ffff000000ac0e00                ffff000000ac1000
cpuid_to_pcpu+0x10:     ffff000000ac1200                ffff000000ac1400
cpuid_to_pcpu+0x20:     0                               0
. . .

(So cpu =3D=3D 1 .)

db> x/gx 0xffff000000ac1000,8
__pcpu+0x200:   0                               fffffd00005dda80
__pcpu+0x210:   0                               0
__pcpu+0x220:   0                               0
__pcpu+0x230:   100000000                       ffff000000ac1200

Thus it seems that at the time for cpu=3D=3D1 :

       pcpu_find(cpu)->pc_curthread =3D=3D NULL

(at __pcpu+0x200).

--=20
You are receiving this mail because:
You are the assignee for the bug.=