From owner-freebsd-ppc@freebsd.org Tue Jun 6 18:16:01 2017 Return-Path: Delivered-To: freebsd-ppc@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 24178BF229C for ; Tue, 6 Jun 2017 18:16:01 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-210-12.reflexion.net [208.70.210.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D212C75D29 for ; Tue, 6 Jun 2017 18:16:00 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 29943 invoked from network); 6 Jun 2017 18:09:19 -0000 Received: from unknown (HELO rtc-sm-01.app.dca.reflexion.local) (10.81.150.1) by 0 (rfx-qmail) with SMTP; 6 Jun 2017 18:09:19 -0000 Received: by rtc-sm-01.app.dca.reflexion.local (Reflexion email security v8.40.0) with SMTP; Tue, 06 Jun 2017 14:09:19 -0400 (EDT) Received: (qmail 1220 invoked from network); 6 Jun 2017 18:09:18 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 6 Jun 2017 18:09:18 -0000 Received: from [192.168.1.114] (c-76-115-7-162.hsd1.or.comcast.net [76.115.7.162]) by iron2.pdx.net (Postfix) with ESMTPSA id 11DB4EC8B8B; Tue, 6 Jun 2017 11:09:18 -0700 (PDT) From: Mark Millard Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: A different 32-bit powerpc head -r317820 panic on old PowerMac G5: dual backtraces from "timeout stopping cpus" (dump failed though): any comments? Date: Tue, 6 Jun 2017 11:09:17 -0700 References: To: Justin Hibbits , Nathan Whitehorn , FreeBSD PowerPC ML , freebsd-hackers@freebsd.org In-Reply-To: Message-Id: <1F1E52BD-375E-47CC-BF06-ECB1092121B4@dsl-only.net> X-Mailer: Apple Mail (2.3273) X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Jun 2017 18:16:01 -0000 On 2017-Jun-6, at 3:46 AM, Mark Millard wrote: > I'm not sure what to make of this. May be someone has > some idea. >=20 > One of the bt's is for "irq53: smudoorbell0" (tid 100055) the other is = for usb. >=20 > spinlock 0xefbbf00 (sched lock 1) held by 0x593d6c0 (tid 100055) too = long > spinlock 0xxf2e1f8 (sleepq chain) help by 0x17606c0 (tid 100049) too = long > timeout stopping cpus >=20 > usb. . . > (I omit any "at kdb_backtrace" lines) >=20 > 0xe56636d0: at vpanic+0x210 > 0xe5663740: at panic+0x54 > 0xe5663790: at _mtx_lock_spin_failed_+0x54 > 0xe56637b0: at _mtx_lock_spin_cookie+0x19c > 0xe5663810: at sched_add+0x140 > 0xe5663850: at sched_wakeup+0xa4 > 0xe5663870: at setrunnable+0x9c > 0xe5663890: at sleepq_resume_thread+0x17c > 0xe56638c0: at sleepq_signal+0xa4 > 0xe56638f0: at cv_signal+0x94 > 0xe5663910: at usbd_do_request_callback+0x80 > 0xe5663930: at usb_request_callback_0x28 > 0xe5663950: at usbd_callback_wrapper+0xa98 > 0xe5663990: at usb_command_wrapper+0x14c > 0xe56639d0: at usb_callback_proc+0x190 > 0xe5663a00: at usb_process+0x1d0 > 0xe5663a50: at fork_exit+0xf4 > 0xe5663a80: at fork_trampoline+0xc >=20 > irq53: smudoorbell0. . . >=20 > 0xe56e4460: at vpanic+0x22c > 0xe56e44d0: at panic+0x54 > 0xe56e4520: at _mtx_lock_spin_failed+0x54 > 0xe56e4540: at thread_lock_flags_0x1bc > 0xe56e45b0: at sleepq_timeout+0x44 > 0xe56e45f0: at softclock_call_cc+0x1bc > 0xe56e4660: at callout_process+0x27c > 0xe56e46c0: at handleevents+0x2ac > 0xe56e4710: at timercb+0x4c4 > 0xe56e4790: at decr_intr+0xf0 > 0xe56e47b0: at powerpc_intrerrupt+0xf4 > 0xe56e47e0: at kernel DECR trap > by _mtx_lock_spin_cookie+0x194 > srr1=3D 0x9032 > r1=3D 0xe56e48a0 > cr=3D 0x20000002 > xer=3D 0 > ctr=3D 0 > 0xe56e48a0: at _mtx_lock+_spin_cookie+0x190 > 0xe56e4900: at sleepq_lock+0xac > 0xe56e4930: at wakeup+0x24 > 0xe56e4950: at smu_doorbell_intr+0x128 > 0xe56e4980: at intr_event_execute_handler+0x220 > 0xe56e49f0: at ithread_loop+0xf0 > 0xe56e4a50: at fork_exit+0xf4 > 0xe56e4a80: at fork_trampoline+0xc >=20 > For reference acttrace also reported > (that looks odd after the first 2 > stack lines): >=20 > powerpd pid 960 tid 100097 td 0x5ba1999 (CPU 2) > 0xd2418510: at uma_zalloc_arg+0x178 > 0xd2418580: at mmu_unmapdev+desc+0x8 > 0x00fcf9dc: at primes+0x68 > 0x00d026b0: at moe64_page_exists_quick+0x1d8 > 0x00ce0ff8: at mmu_unmapdev_desc+0x8 >=20 > (CPU3 had an empty backtrace for pid 11's > tid 100006.) >=20 >=20 > I had another example of sorts but it did not > get far before hanging up: >=20 > spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too = long > spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too = long >=20 > (Note: Two different colored lines above but the > same otherwise.) >=20 > timeout stopping cpus >=20 > Note: The partial backtrace was: > ("cpuid =3D 2", all in the first color above) >=20 > 0xdf6bb5c0: at vpanic+0x210 > 0xdf6bb630: at panic++0x54 > 0xdf6bb680: at _mtx_lock_spin_failed+0x54 > 0xdf6bb6a0: at _mtx_lock_spin_cookie+0x190 > 0xdf6bb700: at sched_add+0x140 > 0xdf6bb740: at sched_wakeup+0xa4 > 0xdf6bb760: at=20 >=20 > (and that is as far as it got for the backtrace(s)) FYI: I'm currently doing an approximate binary search for localizing part of the panic problem. This is based on the classic panics that are instead from jumping to a non-code area. . . At a given point in my other experiments I was getting: srr0=3D0x90a0f0 etext+0xb8fc Adding (unused) code somewhat before that etext (so increasing etext) got: srr0=3D0x90a0f0 etext+0xb8a8 (The additional code was larger than I now use.) But instead adding some code earlier (by around 0x100000 in this example) got: srr0=3D0x90a110 etext+0xb8fc So comparing to the starting conditions in each case: The bad-address accessed in one case stayed constant but the etext offset decreased: in essence the only thing that happened is etext increased (matching the offset decrease). In the other case the etext offset stayed constant but the bad-address and etext increased by the same amount. Unfortunately this will take a while: A) It typically takes hours for a failure to happen, I've had up to between 7 hours and 8 hours. (I've had minutes-to-failure but it is rare.) B) I'd let it go 24 hours before concluding that a test combination was not likely to produce an obvious form of failure (such as the panic). (I've had (B) in some earlier forms of testing variations of things.) Another time-taker is that I need to be around because I avoid leaving the PowerMac G5 panicked with the fans going fill-speed for long periods. Currently I'm adding code by adding: void HACKISH_EXTRA_CODE(void) {} to one .c file from /usr/src/sys/. . . based which file gets to within a ballpark of a more accurate binary search position. (Large binary search jumps currently: I'm not being picky about where in the .c the addition is made yet.) =3D=3D=3D Mark Millard markmi at dsl-only.net