From owner-freebsd-arm@freebsd.org Wed Feb 15 01:08:31 2017 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8A625CDF223 for ; Wed, 15 Feb 2017 01:08:31 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-210-74.reflexion.net [208.70.210.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4F2801697 for ; Wed, 15 Feb 2017 01:08:30 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 7340 invoked from network); 15 Feb 2017 01:10:29 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 15 Feb 2017 01:10:29 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v8.30.0) with SMTP; Tue, 14 Feb 2017 20:08:28 -0500 (EST) Received: (qmail 27247 invoked from network); 15 Feb 2017 01:08:28 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 15 Feb 2017 01:08:28 -0000 Received: from [192.168.1.111] (c-67-170-167-181.hsd1.or.comcast.net [67.170.167.181]) by iron2.pdx.net (Postfix) with ESMTPSA id B7D77EC88F3; Tue, 14 Feb 2017 17:08:27 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: A potential fix for arm64's: sh`forkshell child-process path after fork sometimes has a bad stack pointer value From: Mark Millard In-Reply-To: <6EED2BFF-CAFB-4F58-8D0D-8E060319278C@dsl-only.net> Date: Tue, 14 Feb 2017 17:08:27 -0800 Cc: freebsd-arm Content-Transfer-Encoding: quoted-printable Message-Id: <142FC38B-48F6-4456-8CD1-D180EDB6A73C@dsl-only.net> References: <2D04FF37-DEC8-42CE-961D-AE8CD58A0EAA@dsl-only.net> <93064627-5F72-4167-90B1-0A98ABF4C99C@dsl-only.net> <3BC697B9-4A3E-49FF-AB11-1106E2EF8399@dsl-only.net> <20170214165644.15dedf6e@zapp> <6EED2BFF-CAFB-4F58-8D0D-8E060319278C@dsl-only.net> To: Andrew Turner X-Mailer: Apple Mail (2.3259) X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2017 01:08:31 -0000 On 2017-Feb-14, at 9:17 AM, Mark Millard wrote: > On 2017-Feb-14, at 8:56 AM, Andrew Turner = wrote: >=20 > On Tue, 14 Feb 2017 08:35:54 -0800 >> Mark Millard wrote: >>=20 >>> The following change has let my test run for 8.5 hours so far = without >>> a fork-failure in sh`forkshell : >>>=20 >>> # svnlite diff /usr/src/sys/arm64/arm64/swtch.S >>> Index: /usr/src/sys/arm64/arm64/swtch.S >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> --- /usr/src/sys/arm64/arm64/swtch.S (revision 312982) >>> +++ /usr/src/sys/arm64/arm64/swtch.S (working copy) >>> @@ -241,6 +241,12 @@ >>> mov fp, #0 /* Stack traceback stops here. */ >>> bl _C_LABEL(fork_exit) >>>=20 >>> + /* >>> + * Disable interrupts to avoid >>> + * overwriting sp_el0 and spsr_el1 by an IRQ exception. >>> + */ >>> + msr daifset, #2 >>> + >>> /* Restore sp and lr */ >>> ldp x0, x1, [sp] >>> msr sp_el0, x0 >>> @@ -263,12 +269,6 @@ >>> ldp x28, x29, [sp, #TF_X + 28 * 8] >>> /* Skip x30 as it was restored above as lr */ >>>=20 >>> - /* >>> - * Disable interrupts to avoid >>> - * overwriting spsr_el1 by an IRQ exception. >>> - */ >>> - msr daifset, #2 >>> - >>> /* Restore elr and spsr */ >>> ldp x0, x1, [sp, #16] >>> msr elr_el1, x0 >>>=20 >>> I'm going to switch to attempting a self-hosted buildworld >>> buildkernel again. >>=20 >> Can you try the patch in https://reviews.freebsd.org/D9593. It moves >> loading of sp_el0 until after interrupts have been disabled. >>=20 >> Andrew >=20 > Sure. I'll stop the self-hosted buildworld buildkernel and > switch over to your source. >=20 > One minor point: >=20 > /* Skip x30 as it was restored above as lr */ >=20 > now should say something like: >=20 > /* Skip x30 as it is restored below as lr */ As reported on https://reviews.freebsd.org/D9593 the buildworld buildkernel test stopped in buildworld with two sh processed failing. But the core files do not suggest a stack corruption to me, nor was fork active. My test code recorded its before and after fork stack address examples and they were equal as they should be. It appeared that simply starting the buildworld buildkernel would continue on so I restarted it. It has in fact continued on and is still building. I see no reason to take the stoppage as something to count against the change. And I'll say so in new comments in https://reviews.freebsd.org/D9593 once the build completes or fails and I report on that. Failure details (both cores are basically the same for these details): (lldb) up frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=3D, = ptr=3D, tcache=3D, slow_path=3D) = + 304 at jemalloc_jemalloc.c:1889 1886 usize =3D isalloc(tsd_tsdn(tsd), ptr, = config_prof); 1887 prof_free(tsd, ptr, usize); 1888 } else if (config_stats || config_valgrind) -> 1889 usize =3D isalloc(tsd_tsdn(tsd), ptr, = config_prof); 1890 if (config_stats) 1891 *tsd_thread_deallocatedp_get(tsd) +=3D usize; 1892=09 (lldb) print config_stats (const bool) $0 =3D true (lldb) print config_valgrind (const bool) $1 =3D false So the new failure was actually during config_stats activity, which is apparently enabled by default for how I built -r312982 . The actual abort initiation was from: (lldb) up frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] = __je_rtree_get(dependent=3Dtrue) + 308 at rtree.h:328 325 RTREE_GET_LEAF(RTREE_HEIGHT_MAX-1) 326 #undef RTREE_GET_SUBTREE 327 #undef RTREE_GET_LEAF -> 328 default: not_reached(); 329 } 330 #undef RTREE_GET_BIAS 331 not_reached(); The back traces look similar to this one of the pair: (lldb) bt * thread #1: tid =3D 100137, 0x0000000040554e54 libc.so.7`_thr_kill + 8, = name =3D 'sh', stop reason =3D signal SIGABRT * frame #0: 0x0000000040554e54 libc.so.7`_thr_kill + 8 frame #1: 0x0000000040554e18 libc.so.7`__raise(s=3D6) + 64 at = raise.c:52 frame #2: 0x0000000040554d8c libc.so.7`abort + 84 at abort.c:65 frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] = __je_rtree_get(dependent=3Dtrue) + 308 at rtree.h:328 frame #4: 0x00000000405340dc libc.so.7`huge_node_get [inlined] = __je_chunk_lookup(dependent=3Dtrue) at chunk.h:89 frame #5: 0x00000000405340dc = libc.so.7`huge_node_get(ptr=3D) + 276 at jemalloc_huge.c:11 frame #6: 0x0000000040534114 = libc.so.7`__je_huge_salloc(tsdn=3D, ptr=3D) + = 24 at jemalloc_huge.c:434 frame #7: 0x000000004054c84c libc.so.7`ifree [inlined] = __je_arena_salloc(demote=3Dfalse) + 32 at arena.h:1426 frame #8: 0x000000004054c82c libc.so.7`ifree [inlined] = __je_isalloc(demote=3Dfalse) at jemalloc_internal.h:1045 frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=3D, = ptr=3D, tcache=3D, slow_path=3D) = + 304 at jemalloc_jemalloc.c:1889 frame #10: 0x000000004054cd94 = libc.so.7`__free(ptr=3D0x0000000040a17520) + 148 at = jemalloc_jemalloc.c:2016 frame #11: 0x0000000000411328 sh`ckfree(p=3D) + 32 at = memalloc.c:88 frame #12: 0x0000000000407cd8 sh`clearcmdentry + 76 at exec.c:505 frame #13: 0x0000000000406bfc sh`evalcommand(cmd=3D, = flags=3D, backcmd=3D) + 3476 at eval.c:1182 frame #14: 0x0000000000405570 sh`evaltree(n=3D0x0000000040a1c270, = flags=3D) + 212 at eval.c:290 frame #15: 0x000000000041105c sh`cmdloop(top=3D) + 252 = at main.c:231 frame #16: 0x0000000000410ed0 sh`main(argc=3D, = argv=3D) + 660 at main.c:178 frame #17: 0x0000000000402f30 sh`__start + 360 frame #18: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at = rtld_start.S:41 =3D=3D=3D Mark Millard markmi at dsl-only.net