Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2017 17:08:27 -0800
From:      Mark Millard <markmi@dsl-only.net>
To:        Andrew Turner <andrew@fubar.geek.nz>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: A potential fix for arm64's: sh`forkshell child-process path after fork sometimes has a bad stack pointer value
Message-ID:  <142FC38B-48F6-4456-8CD1-D180EDB6A73C@dsl-only.net>
In-Reply-To: <6EED2BFF-CAFB-4F58-8D0D-8E060319278C@dsl-only.net>
References:  <DC3CC3BE-9D8C-41ED-ADD0-AFD4019B2E90@dsl-only.net> <2D04FF37-DEC8-42CE-961D-AE8CD58A0EAA@dsl-only.net> <93064627-5F72-4167-90B1-0A98ABF4C99C@dsl-only.net> <3BC697B9-4A3E-49FF-AB11-1106E2EF8399@dsl-only.net> <20170214165644.15dedf6e@zapp> <6EED2BFF-CAFB-4F58-8D0D-8E060319278C@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Feb-14, at 9:17 AM, Mark Millard <markmi@dsl-only.net> wrote:

> On 2017-Feb-14, at 8:56 AM, Andrew Turner <andrew at fubar.geek.nz> =
wrote:
>=20
> On Tue, 14 Feb 2017 08:35:54 -0800
>> Mark Millard <markmi at dsl-only.net> wrote:
>>=20
>>> The following change has let my test run for 8.5 hours so far =
without
>>> a fork-failure in sh`forkshell :
>>>=20
>>> # svnlite diff /usr/src/sys/arm64/arm64/swtch.S
>>> Index: /usr/src/sys/arm64/arm64/swtch.S
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> --- /usr/src/sys/arm64/arm64/swtch.S    (revision 312982)
>>> +++ /usr/src/sys/arm64/arm64/swtch.S    (working copy)
>>> @@ -241,6 +241,12 @@
>>>       mov     fp, #0  /* Stack traceback stops here. */
>>>       bl      _C_LABEL(fork_exit)
>>>=20
>>> +       /*
>>> +        * Disable interrupts to avoid
>>> +        * overwriting sp_el0 and spsr_el1 by an IRQ exception.
>>> +        */
>>> +       msr     daifset, #2
>>> +
>>>       /* Restore sp and lr */
>>>       ldp     x0, x1, [sp]
>>>       msr     sp_el0, x0
>>> @@ -263,12 +269,6 @@
>>>       ldp     x28, x29, [sp, #TF_X + 28 * 8]
>>>       /* Skip x30 as it was restored above as lr */
>>>=20
>>> -       /*
>>> -        * Disable interrupts to avoid
>>> -        * overwriting spsr_el1 by an IRQ exception.
>>> -        */
>>> -       msr     daifset, #2
>>> -
>>>       /* Restore elr and spsr */
>>>       ldp     x0, x1, [sp, #16]
>>>       msr     elr_el1, x0
>>>=20
>>> I'm going to switch to attempting a self-hosted buildworld
>>> buildkernel again.
>>=20
>> Can you try the patch in https://reviews.freebsd.org/D9593. It moves
>> loading of sp_el0 until after interrupts have been disabled.
>>=20
>> Andrew
>=20
> Sure. I'll stop the self-hosted buildworld buildkernel and
> switch over to your source.
>=20
> One minor point:
>=20
> /* Skip x30 as it was restored above as lr */
>=20
> now should say something like:
>=20
> /* Skip x30 as it is restored below as lr */

As reported on https://reviews.freebsd.org/D9593 the
buildworld buildkernel test stopped in buildworld
with two sh processed failing.

But the core files do not suggest a stack corruption
to me, nor was fork active. My test code
recorded its before and after fork stack address
examples and they were equal as they should be.

It appeared that simply starting the buildworld
buildkernel would continue on so I restarted it.
It has in fact continued on and is still building.

I see no reason to take the stoppage as something
to count against the change. And I'll say so in
new comments in https://reviews.freebsd.org/D9593
once the build completes or fails and I report on
that.



Failure details (both cores are basically the same
for these details):

(lldb) up
frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=3D<unavailable>, =
ptr=3D<unavailable>, tcache=3D<unavailable>, slow_path=3D<unavailable>) =
+ 304 at jemalloc_jemalloc.c:1889
   1886			usize =3D isalloc(tsd_tsdn(tsd), ptr, =
config_prof);
   1887			prof_free(tsd, ptr, usize);
   1888		} else if (config_stats || config_valgrind)
-> 1889			usize =3D isalloc(tsd_tsdn(tsd), ptr, =
config_prof);
   1890		if (config_stats)
   1891			*tsd_thread_deallocatedp_get(tsd) +=3D usize;
   1892=09

(lldb) print config_stats
(const bool) $0 =3D true

(lldb) print config_valgrind
(const bool) $1 =3D false

So the new failure was actually during config_stats activity,
which is apparently enabled by default for how I built
-r312982 .

The actual abort initiation was from:

(lldb) up
frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] =
__je_rtree_get(dependent=3Dtrue) + 308 at rtree.h:328
   325 		RTREE_GET_LEAF(RTREE_HEIGHT_MAX-1)
   326 	#undef RTREE_GET_SUBTREE
   327 	#undef RTREE_GET_LEAF
-> 328 		default: not_reached();
   329 		}
   330 	#undef RTREE_GET_BIAS
   331 		not_reached();

The back traces look similar to this one of the pair:

(lldb) bt
* thread #1: tid =3D 100137, 0x0000000040554e54 libc.so.7`_thr_kill + 8, =
name =3D 'sh', stop reason =3D signal SIGABRT
  * frame #0: 0x0000000040554e54 libc.so.7`_thr_kill + 8
    frame #1: 0x0000000040554e18 libc.so.7`__raise(s=3D6) + 64 at =
raise.c:52
    frame #2: 0x0000000040554d8c libc.so.7`abort + 84 at abort.c:65
    frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] =
__je_rtree_get(dependent=3Dtrue) + 308 at rtree.h:328
    frame #4: 0x00000000405340dc libc.so.7`huge_node_get [inlined] =
__je_chunk_lookup(dependent=3Dtrue) at chunk.h:89
    frame #5: 0x00000000405340dc =
libc.so.7`huge_node_get(ptr=3D<unavailable>) + 276 at jemalloc_huge.c:11
    frame #6: 0x0000000040534114 =
libc.so.7`__je_huge_salloc(tsdn=3D<unavailable>, ptr=3D<unavailable>) + =
24 at jemalloc_huge.c:434
    frame #7: 0x000000004054c84c libc.so.7`ifree [inlined] =
__je_arena_salloc(demote=3Dfalse) + 32 at arena.h:1426
    frame #8: 0x000000004054c82c libc.so.7`ifree [inlined] =
__je_isalloc(demote=3Dfalse) at jemalloc_internal.h:1045
    frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=3D<unavailable>, =
ptr=3D<unavailable>, tcache=3D<unavailable>, slow_path=3D<unavailable>) =
+ 304 at jemalloc_jemalloc.c:1889
    frame #10: 0x000000004054cd94 =
libc.so.7`__free(ptr=3D0x0000000040a17520) + 148 at =
jemalloc_jemalloc.c:2016
    frame #11: 0x0000000000411328 sh`ckfree(p=3D<unavailable>) + 32 at =
memalloc.c:88
    frame #12: 0x0000000000407cd8 sh`clearcmdentry + 76 at exec.c:505
    frame #13: 0x0000000000406bfc sh`evalcommand(cmd=3D<unavailable>, =
flags=3D<unavailable>, backcmd=3D<unavailable>) + 3476 at eval.c:1182
    frame #14: 0x0000000000405570 sh`evaltree(n=3D0x0000000040a1c270, =
flags=3D<unavailable>) + 212 at eval.c:290
    frame #15: 0x000000000041105c sh`cmdloop(top=3D<unavailable>) + 252 =
at main.c:231
    frame #16: 0x0000000000410ed0 sh`main(argc=3D<unavailable>, =
argv=3D<unavailable>) + 660 at main.c:178
    frame #17: 0x0000000000402f30 sh`__start + 360
    frame #18: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at =
rtld_start.S:41



=3D=3D=3D
Mark Millard
markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?142FC38B-48F6-4456-8CD1-D180EDB6A73C>