Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Mar 2021 22:51:59 -0800
From:      Mark Millard <marklmi@yahoo.com>
To:        freebsd-arm <freebsd-arm@freebsd.org>, freebsd-current <freebsd-current@freebsd.org>
Cc:        Konstantin Belousov <kib@freebsd.org>
Subject:   Re: FYI: main (bad9fa56620e based): some unexpected SIGSEGV's are tied to interrupted system calls for armv7 poudriere target (cortex-a53/a57/a72 fail, cortex-a7 works(?))
Message-ID:  <8C348A15-D967-46C1-8BEB-1E41A7490E11@yahoo.com>
In-Reply-To: <1C468B92-E53F-4CBC-A6C0-05FEB887F45B@yahoo.com>
References:  <B376F90B-57E8-42C4-93EB-EDABC962C380@yahoo.com> <C3E560D9-F861-4887-A027-952D6DBCC929@yahoo.com> <1C468B92-E53F-4CBC-A6C0-05FEB887F45B@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-Mar-9, at 22:00, Mark Millard <marklmi at yahoo.com> wrote:

[Trying to be timely about reporting new information
because of 13.0 not having much time left.]

> On 2021-Mar-9, at 21:11, Mark Millard <marklmi at yahoo.com> wrote:
>=20
>> On 2021-Mar-9, at 19:17, Mark Millard <marklmi at yahoo.com> wrote:
>>=20
>> [My only testing context for this has been main, not 13.0.
>> But it might be a 13.0 worry.]
>>=20
>>> Using the quickest to so-far-reliably-fail type of example from
>>> another thread I used truss to see what happens, here filtered
>>> down to two processes that appear to be involved and only
>>> near the failure. (The overall truss output is huge from the
>>> prior activity in the poudriere bulk relatated activity). Also,
>>> this initiated watching from aarch64 but the failing code is
>>> armv7.
>>>=20
>>> 83630 100199: #340(0x1,0xffffd18c,0xffffd17c)    =3D 0 (0x0)
>>> 83630 100199: #416(0x14,0xffffd1b4,0xffffd19c)   =3D 0 (0x0)
>>> 83630 100199: #7(0xffffffff,0xffffd178,0x1,0x0)  =3D 0 (0x0)
>>> 83731 100161: #240(0xffffd5f0,0xffffd5f0)        =3D 0 (0x0)
>>> 83731 100161: #1(0x0)                          =20
>>> 83731 100161: process exit, rval =3D 0
>>> 83630 100199: SIGNAL 20 (SIGCHLD) code=3DCLD_EXITED pid=3D83731 =
uid=3D0 status=3D0
>>> 83630 100199: #341(0xffffd17c)                   ERR#4 'Interrupted =
system call'
>>> 83630 100199: SIGNAL 11 (SIGSEGV) code=3DSEGV_MAPERR trapno=3D36 =
addr=3D0xffffffe1
>>> 83630 100199: process killed, signal =3D 11 (core dumped)
>>>=20
>>> As a reminder of the lldb backtrace of the sh.core
>>> and the like:
>>>=20
>>> (lldb) bt
>>> * thread #1, name =3D 'sh', stop reason =3D signal SIGSEGV
>>> * frame #0: 0xffffe190
>>>  frame #1: 0x00031aa8 sh`waitcmdloop(job=3D0x00064230) at =
jobs.c:608:11
>>>  frame #2: 0x00031a24 sh`waitcmd(argc=3D<unavailable>, =
argv=3D<unavailable>) at jobs.c:554:13
>>>  frame #3: 0x00028f54 sh`evalcommand(cmd=3D0x400ad0e4, =
flags=3D<unavailable>, backcmd=3D0x00000000) at eval.c:1107:16
>>>  frame #4: 0x00027800 sh`evaltree(n=3D0x400ad0e4, =
flags=3D<unavailable>) at eval.c:289:4
>>>  frame #5: 0x000344d0 sh`cmdloop(top=3D1) at main.c:221:4
>>>  frame #6: 0x000342f4 sh`main(argc=3D<unavailable>, =
argv=3D<unavailable>) at main.c:168:3
>>>  frame #7: 0x0002480c sh`__start(argc=3D8, argv=3D<unavailable>, =
env=3D<unavailable>, ps_strings=3D<unavailable>, obj=3D0x400b4004, =
cleanup=3D0x40081aa0) at crt1_c.c:92:7
>>> (lldb) up
>>> frame #1: 0x00031aa8 sh`waitcmdloop(job=3D0x00064230) at =
jobs.c:608:11
>>> 605 						break;
>>> 606 				}
>>> 607 			}
>>> -> 608 		} while (dowait(DOWAIT_BLOCK | DOWAIT_SIG, =
(struct job *)NULL) !=3D -1);
>>> 609 =09
>>> 610 		sig =3D pendingsig_waitcmd;
>>> 611 		pendingsig_waitcmd =3D 0;
>>>=20
>>> (lldb) disass
>>> sh`waitcmdloop:
>>>  0x31a54 <+0>:   push   {r4, r5, r6, r7, r8, r9, r10, r11, lr}
>>>  0x31a58 <+4>:   add    r11, sp, #28
>>>  0x31a5c <+8>:   sub    sp, sp, #4
>>>  0x31a60 <+12>:  movw   r6, #0x3ea0
>>>  0x31a64 <+16>:  movw   r7, #0x3e9c
>>>  0x31a68 <+20>:  movw   r9, #0x4040
>>>  0x31a6c <+24>:  movw   r8, #0x3ea4
>>>  0x31a70 <+28>:  mov    r4, r0
>>>  0x31a74 <+32>:  movt   r6, #0x6
>>>  0x31a78 <+36>:  movt   r7, #0x6
>>>  0x31a7c <+40>:  movt   r9, #0x6
>>>  0x31a80 <+44>:  mov    r10, #0
>>>  0x31a84 <+48>:  movt   r8, #0x6
>>>  0x31a88 <+52>:  cmp    r4, #0
>>>  0x31a8c <+56>:  beq    0x31ab4                   ; <+96> at =
jobs.c:590:37
>>>  0x31a90 <+60>:  ldrb   r0, [r4, #0x18]
>>>  0x31a94 <+64>:  cmp    r0, #2
>>>  0x31a98 <+68>:  beq    0x31b84                   ; <+304> [inlined] =
getjobstatus at jobs.c:575
>>>  0x31a9c <+72>:  mov    r0, #3
>>>  0x31aa0 <+76>:  mov    r1, #0
>>>  0x31aa4 <+80>:  bl     0x32bcc                   ; dowait at =
jobs.c:1142
>>> ->  0x31aa8 <+84>:  cmn    r0, #1
>>>=20
>>>=20
>>> For reference a local context around the
>>> SIGSEGV looks like (all lines in the range
>>> selected):
>>>=20
>>> . . .
>>> 83833 102738: fcntl(2,F_DUPFD_CLOEXEC,0xa)       =3D 10 (0xa)
>>> 83833 102738: =
openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) =3D 3 (0x3)
>>> 83833 102738: dup2(3,2)                          =3D 2 (0x2)
>>> 83833 102738: close(3)                           =3D 0 (0x0)
>>> 83833 102738: unlink("./.data.json.SYR1bCaL")    =3D 0 (0x0)
>>> 83833 102738: dup2(10,2)                         =3D 2 (0x2)
>>> 83833 102738: close(10)                          =3D 0 (0x0)
>>> 83833 102738: exit(0x0)                        =20
>>> 83833 102738: process exit, rval =3D 0
>>> 77872 100638: wait4(-1,{ EXITED,val=3D0 },0x0,0x0) =3D 83833 =
(0x14779)
>>> 77872 100638: fcntl(0,F_DUPFD_CLOEXEC,0xa)       =3D 10 (0xa)
>>> 77872 100638: =
openat(AT_FDCWD,"/var/run/poudriere/lock-poudriere-shared-json_top.pid",O_=
RDONLY,00) =3D 3 (0x3)
>>> 77872 100638: dup2(3,0)                          =3D 0 (0x0)
>>> 77872 100638: close(3)                           =3D 0 (0x0)
>>> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       =3D 11 (0xb)
>>> 77872 100638: =
openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) =3D 3 (0x3)
>>> 77872 100638: dup2(3,2)                          =3D 2 (0x2)
>>> 77872 100638: close(3)                           =3D 0 (0x0)
>>> 77872 100638: lseek(0,0x0,SEEK_CUR)              =3D 0 (0x0)
>>> 77872 100638: read(0,"77563",1024)               =3D 5 (0x5)
>>> 77872 100638: read(0,0xffffffffb9e8,1024)        =3D 0 (0x0)
>>> 77872 100638: dup2(10,0)                         =3D 0 (0x0)
>>> 77872 100638: close(10)                          =3D 0 (0x0)
>>> 77872 100638: dup2(11,2)                         =3D 2 (0x2)
>>> 77872 100638: close(11)                          =3D 0 (0x0)
>>> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       =3D 10 (0xa)
>>> 77872 100638: =
openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) =3D 3 (0x3)
>>> 77872 100638: dup2(3,2)                          =3D 2 (0x2)
>>> 77872 100638: close(3)                           =3D 0 (0x0)
>>> 77872 100638: =
rmdir("/var/run/poudriere/lock-poudriere-shared-json_top") =3D 0 (0x0)
>>> 77872 100638: dup2(10,2)                         =3D 2 (0x2)
>>> 77872 100638: close(10)                          =3D 0 (0x0)
>>> 77872 100638: sigprocmask(SIG_SETMASK,{ },0x0)   =3D 0 (0x0)
>>> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       =3D 10 (0xa)
>>> 77872 100638: =
openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) =3D 3 (0x3)
>>> 77872 100638: dup2(3,2)                          =3D 2 (0x2)
>>> 77872 100638: close(3)                           =3D 0 (0x0)
>>> 77872 100638: sigaction(SIGINFO,{ 0x239c30 SA_RESTART ss_t },{ =
SIG_DFL 0x0 ss_t }) =3D 0 (0x0)
>>> 83731 100161: #240(0xffffd5f0,0xffffd5f0)        =3D 0 (0x0)
>>> -- UNKNOWN FreeBSD32 SYSCALL 1 --
>>> 83731 100161: #1(0x0)                          =20
>>> 83731 100161: process exit, rval =3D 0
>>> 83630 100199: SIGNAL 20 (SIGCHLD) code=3DCLD_EXITED pid=3D83731 =
uid=3D0 status=3D0
>>> 83630 100199: #341(0xffffd17c)                   ERR#4 'Interrupted =
system call'
>>> 83630 100199: SIGNAL 11 (SIGSEGV) code=3DSEGV_MAPERR trapno=3D36 =
addr=3D0xffffffe1
>>> 83630 100199: process killed, signal =3D 11 (core dumped)
>>> 83316 100123: #7(0xffffffff,0xffffca58,0x0,0x0)  =3D 83630 (0x146ae)
>>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>>> 83316 100123: =
#477(0x0,0x7000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) =3D =
1077833728 (0x403e7000)
>>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No =
such file or directory'
>>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No =
such file or directory'
>>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No =
such file or directory'
>>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No =
such file or directory'
>>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>>> 83316 100123: =
#477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) =3D =
1077862400 (0x403ee000)
>>> -- UNKNOWN FreeBSD32 SYSCALL 4 --
>>> 83316 100123: #4(0x2,0x403ee000,0x21)            =3D 33 (0x21)
>>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>>> 83316 100123: =
#477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) =3D =
1077866496 (0x403ef000)
>>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>>> 83316 100123: =
#477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) =3D =
1077870592 (0x403f0000)
>>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>>> 83316 100123: =
#477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) =3D =
1077874688 (0x403f1000)
>>> -- UNKNOWN FreeBSD32 SYSCALL 4 --
>>> 83316 100123: #4(0x1,0x403ef000,0x2e)            =3D 46 (0x2e)
>>> -- UNKNOWN FreeBSD32 SYSCALL 542 --
>>> 83316 100123: #542(0xffffcd54,0x0)               =3D 0 (0x0)
>>> -- UNKNOWN FreeBSD32 SYSCALL 2 --
>>> 83842 100199: <new process>
>>> 83316 100123: #2()                               =3D 83842 (0x14782)
>>> -- UNKNOWN FreeBSD32 SYSCALL 6 --
>>> -- UNKNOWN FreeBSD32 SYSCALL 6 --
>>> 83316 100123: #6(0x7)                            =3D 0 (0x0)
>>> 83842 100199: #6(0x5)                            =3D 0 (0x0)
>>> . . .
>>=20
>> Turns out that the failure happens on the
>> processors with out-of-order execution and
>> the like but works on the strictly in-order
>> cortex-a53. (For as much testing as I've
>> done.)

Further testing has shown the problem on the
cortex-a53 as well, at least when all 5 ports
known to hit the problem(s) are considered.

Out of order execution or the like is not
required; processors with in order execution
can get the problem.

>> So it looks like some form of synchronization
>> is missing that in-order-only does not need.
>> (This would be the 2nd time I've run into such
>> for FreeBSD aarch64 if it holds true. The
>> prior example was fixed a fair time ago.)

Since further testing has shown the problem on
the cortex-a53, the above does not apply.

>> The testing status . . .
>>=20
>>=20
>> Problem replicated using the following contexts
>> to attempt the textproc/itstool build, targeting
>> armv7 (cortex-a7):
>>=20
>> cortex-a72 aarch64 MACHHIATObin Double Shot
>> cortex-a57 aarch64 OverDrive 1000
>>=20
>> (No successful builds for the above 2,
>> all stopping in configure the same way.)

cortex-a53 aarch64 Rock64    (armv7 on aarch64 case)
(also fails, at least sometimes)
(again seems racy)

>> No problem using the following to build
>> textproc/itstool, targeting armv7:
>>=20
>> cortex-a53 aarch64 Rock64    (armv7 on aarch64 case)

Wrong above: cortex-a53 does (sometimes) get the
problem(s), at least when all 5 examples of failing
ports are considered.

>> cortext-a7 armv7   OrangePi+ 2ed (native armv7 case)

It likely will be some time before I'll have
significantly more evidence for cortex-a7 native
builds: slower builds.

>> It will take a long time to run a full
>> poudriere bulk that will build about
>> 200 ports, targeting the cortex-a7 on
>> the slower cortex-a53: days. So
>> further evidence that the cortex-a53
>> does not get the problem will take a
>> while.

I lucked out on the order of port
builds and getting early failures
on further cortex-a53 testing. So
reporting that failures happen did
not take all that long.

cortex-a7 native-testing takes much
more time to build things. If no port
builds fail, the around 200 ports
might take a week or more to build.

> I have confirmed that I still get the problem on
> the cortex-a72 when I substitute the kernel.txz
> and kernel-dbg.txz content from:
>=20
> =
https://artifact.ci.freebsd.org/snapshot/main/bad9fa56620eb82395c5ab66d300=
e91a0222dde2/arm64/aarch64/
>=20
> and boot with it. ( My non-debug build is also of
> bad9fa56 .) This avoids worries about my kernel
> build being involved.
>=20
> The debug kernel did not report anything special
> while the port was building. Earlier it did report
> the classic:
>=20
> [00:00:06] Mounting system devices for FBSDFSSDjailArmV7-default
> lock order reversal:
> 1st 0xffffa0017b9915b0 ufs (ufs, lockmgr) @ =
/usr/src/sys/kern/vfs_mount.c:1071
> 2nd 0xffffa0017bdb8070 devfs (devfs, lockmgr) @ =
/usr/src/sys/kern/vfs_mount.c:1083
> lock order devfs -> ufs established at:
> . . .
>=20
> I have not tried substituting base.txz and
> base-dbg.txz content.




=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8C348A15-D967-46C1-8BEB-1E41A7490E11>