From owner-freebsd-bugs@freebsd.org Fri Sep 15 07:26:37 2017 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3750CE048B0 for ; Fri, 15 Sep 2017 07:26:37 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2502A80299 for ; Fri, 15 Sep 2017 07:26:37 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v8F7QbZv020910 for ; Fri, 15 Sep 2017 07:26:37 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 222234] head -r323246 aarch64 (Pine64+ 2GB) boot time context, sometimes: acquiring blockable sleep lock with spinlock or critical section held Date: Fri, 15 Sep 2017 07:26:37 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: markmi@dsl-only.net X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Sep 2017 07:26:37 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D222234 --- Comment #1 from Mark Millard --- (In reply to Mark Millard from comment #0) Based on verbose boot attempts I've seen that the failures are during: taskqgroup_adjust_softirq(0)... I've traced the failure back to the bad pointer value that is in place when it was put to use. Summary of the following investigations: It traces back to the condition: pcpu_find(cpu)->pc_curthread =3D=3D NULL in the code: ctd =3D pcpu_find(cpu)->pc_curthread; for cpu =3D=3D 1 in tdq_notify (but inlined). It then attempts to evaluate: ctd->td_priority which gets a data_abort but it is during when blocked_lock is the container lock for the thread. In the witness included case this leads to: panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:4710 cpuid =3D 0 time =3D 13 but that is a later consequence of the earlier problem. I'm not sure why pcpu_find(cpu)->pc_curthread is still NULL but since the behavior is intermittent for debug kernel builds it suggests a race for an update that was initiated but not always finished in time. (I've had occasions of hours of reboots to try to get a failure but mostly only needing a few. I've not run into a long sequence of failures to boot for a debug kernel but have had some back-to-back ones.) Supporting detail that lead to the above: int pmap_fault(pmap_t pmap, uint64_t esr, uint64_t far) The far (fault address register) argument to pmap_fault is the=20 rd one (x2 below): ffff000000606954 stp x22, x21, [sp, #-48]! ffff000000606958 stp x20, x19, [sp, #16] ffff00000060695c stp x29, x30, [sp, #32] ffff000000606960 add x29, sp, #0x20 ffff000000606964 mov x20, x2 ffff000000606968 mov x22, x1 ffff00000060696c mov x21, x0 For the failing call sequence far ends up stored on the stack via the x20 save-to-stack in: ffff0000002f8c0c <__mtx_lock_flags> stp x24, x23, [sp, #-64]! ffff0000002f8c10 <__mtx_lock_flags+0x4> stp x22, x21, [sp, #16] ffff0000002f8c14 <__mtx_lock_flags+0x8> stp x20, x19, [sp, #32] ffff0000002f8c18 <__mtx_lock_flags+0xc> stp x29, x30, [sp, #48] ffff0000002f8c1c <__mtx_lock_flags+0x10> add x29, sp, #0x30 Stack segment with a little context: 0xffff000069850470: ffff0000698504b0 ffff0000002f8b80 0xffff000069850480: ffff000000c6a528 0 0xffff000069850490: 96000004 ffff000000c6a658 0xffff0000698504a0: 37e ffff000000c6a670 0xffff0000698504b0: ffff0000698504e0 ffff000000606998 So it appears: pmap_fault`esr =3D=3D 0x96000004 pmap_fault`pmap =3D=3D 0xffff000000c6a658 (vmspace0+0x130) pmap_fault`far =3D=3D 0x37e I'll note that 0x37e =3D 894 so it matches up with x8 =3D=3D 0x0 for the likes of: elr 0xffff00000033f0dc sched_switch+0x2bc . . . ssched_switch+0x2b8: ldrb w9, [x8, #894] matching: db> show reg . . . x8 0 . . . So apparently sched_switch tried to access 0x37e db> x/gx 0xffff000000606998=20=20=20 pmap_fault+0x44: f100111f927e0ec8 Part of the back trace is (for the example debug kernel): kassert_panic() at witness_checkorder+0x160 pc =3D 0xffff0000003174e4 lr =3D 0xffff000000374990 sp =3D 0xffff0000698503f0 fp =3D 0xffff000069850470 witness_checkorder() at __mtx_lock_flags+0xa8 pc =3D 0xffff000000374990 lr =3D 0xffff0000002f8b7c sp =3D 0xffff000069850480 fp =3D 0xffff0000698504b0 __mtx_lock_flags() at pmap_fault+0x40 pc =3D 0xffff0000002f8b7c lr =3D 0xffff000000606994 sp =3D 0xffff0000698504c0 fp =3D 0xffff0000698504e0 pmap_fault() at data_abort+0xb8 pc =3D 0xffff000000606994 lr =3D 0xffff000000608a9c sp =3D 0xffff0000698504f0 fp =3D 0xffff0000698505a0 data_abort() at do_el1h_sync+0xfc pc =3D 0xffff000000608a9c lr =3D 0xffff0000006088f0 sp =3D 0xffff0000698505b0 fp =3D 0xffff0000698505e0 do_el1h_sync() at handle_el1h_sync+0x74 pc =3D 0xffff0000006088f0 lr =3D 0xffff0000005f1874 sp =3D 0xffff0000698505f0 fp =3D 0xffff000069850700 handle_el1h_sync() at sched_switch+0x2a8 pc =3D 0xffff0000005f1874 lr =3D 0xffff00000033f0c8 sp =3D 0xffff000069850710 fp =3D 0xffff0000698507f0 sched_switch() at mi_switch+0x1b8 pc =3D 0xffff00000033f0c8 lr =3D 0xffff00000032161c sp =3D 0xffff000069850800 fp =3D 0xffff000069850820 mi_switch() at taskqgroup_binder+0x7c pc =3D 0xffff00000032161c lr =3D 0xffff00000035510c sp =3D 0xffff000069850830 fp =3D 0xffff000069850860 taskqgroup_binder() at gtaskqueue_run_locked+0x104 pc =3D 0xffff00000035510c lr =3D 0xffff000000354f74 sp =3D 0xffff000069850870 fp =3D 0xffff0000698508e0 gtaskqueue_run_locked() at gtaskqueue_thread_loop+0x9c pc =3D 0xffff000000354f74 lr =3D 0xffff000000354d10 sp =3D 0xffff0000698508f0 fp =3D 0xffff000069850910 gtaskqueue_thread_loop() at fork_exit+0x7c pc =3D 0xffff000000354d10 lr =3D 0xffff0000002dbd3c sp =3D 0xffff000069850920 fp =3D 0xffff000069850950 fork_exit() at fork_trampoline+0x10 pc =3D 0xffff0000002dbd3c lr =3D 0xffff000000608664 sp =3D 0xffff000069850960 fp =3D 0x0000000000000000 Note: ffff00000033f0bc ldr w0, [x19, #1292] ffff00000033f0c0 ldrb w27, [x19, #894] ffff00000033f0c4 str w0, [sp, #12] ffff00000033f0c8 bl ffff000000359708 ffff00000033f0cc ldr x8, [x0] ffff00000033f0d0 mov w11, w27 ffff00000033f0d4 adrp x27, ffff000000c85000 ffff00000033f0d8 ldrb w9, [x8, #894] ffff00000033f0dc cmp w11, w9 ffff00000033f0e0 b.cs ffff00000033f150 // b.hs, b.nlast This is code for the later part of what is shown below: static void tdq_notify(struct tdq *tdq, struct thread *td) { struct thread *ctd; int pri; int cpu; if (tdq->tdq_ipipending) return; cpu =3D td_get_sched(td)->ts_cpu; pri =3D td->td_priority; ctd =3D pcpu_find(cpu)->pc_curthread; if (!sched_shouldpreempt(pri, ctd->td_priority, 1)) return; . . . } (Where: sched_shouldpreempt has been inlined and some of it is interlaced.) The failing [x8, #894] is the attempt to access: ctd->td_priority In other words: ctd =3D=3D NULL resulted from the pcpu_find (i.e., x8 =3D=3D 0 ). As for how it got to be zero: db> show reg spsr 0x9600000440000085 x0 0xffff000000ac1000 __pcpu+0x200 . . . db> x/gx cpuid_to_pcpu,32 cpuid_to_pcpu: ffff000000ac0e00 ffff000000ac1000 cpuid_to_pcpu+0x10: ffff000000ac1200 ffff000000ac1400 cpuid_to_pcpu+0x20: 0 0 . . . (So cpu =3D=3D 1 .) db> x/gx 0xffff000000ac1000,8 __pcpu+0x200: 0 fffffd00005dda80 __pcpu+0x210: 0 0 __pcpu+0x220: 0 0 __pcpu+0x230: 100000000 ffff000000ac1200 Thus it seems that at the time for cpu=3D=3D1 : pcpu_find(cpu)->pc_curthread =3D=3D NULL (at __pcpu+0x200). --=20 You are receiving this mail because: You are the assignee for the bug.=