Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 9 Mar 2017 15:57:06 -0800
From:      Bryan Drewery <bdrewery@FreeBSD.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        current@FreeBSD.org
Subject:   Re: r314708: panic: tdsendsignal: ksi on queue
Message-ID:  <b7068678-e593-f4ed-647c-036ff0bc0576@FreeBSD.org>
In-Reply-To: <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org>
References:  <d510a9da-8293-ba22-a1e6-75b3ea7ffa1d@FreeBSD.org> <20170309144646.GB16105@kib.kiev.ua> <20170309231151.GA49720@stack.nl> <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--6oKB8GMj0DjIucTvXVtofbvdNFRn8VAKd
Content-Type: multipart/mixed; boundary="N5VRPxGWQn1N9mKI3n26Sr4waQKxOLkPd";
 protected-headers="v1"
From: Bryan Drewery <bdrewery@FreeBSD.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: current@FreeBSD.org
Message-ID: <b7068678-e593-f4ed-647c-036ff0bc0576@FreeBSD.org>
Subject: Re: r314708: panic: tdsendsignal: ksi on queue
References: <d510a9da-8293-ba22-a1e6-75b3ea7ffa1d@FreeBSD.org>
 <20170309144646.GB16105@kib.kiev.ua> <20170309231151.GA49720@stack.nl>
 <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org>
In-Reply-To: <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org>

--N5VRPxGWQn1N9mKI3n26Sr4waQKxOLkPd
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

On 3/9/2017 3:47 PM, Bryan Drewery wrote:
> On 3/9/2017 3:11 PM, Jilles Tjoelker wrote:
>> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote:
>>> Yes, there is a race, apparently, with the child zombie still not fin=
ishing
>>> sending the SIGCHLD to the parent and parent exiting.  The following =
should
>>> fix the issue, but I do not think that reproducing the problem is eas=
y.
>>
>>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
>>> index c524fe5df37..ba5ff84e9de 100644
>>> --- a/sys/kern/kern_exit.c
>>> +++ b/sys/kern/kern_exit.c
>>> @@ -189,6 +189,7 @@ exit1(struct thread *td, int rval, int signo)
>>>  {
>>>  	struct proc *p, *nq, *q, *t;
>>>  	struct thread *tdt;
>>> +	ksiginfo_t ksi;
>>> =20
>>>  	mtx_assert(&Giant, MA_NOTOWNED);
>>>  	KASSERT(rval =3D=3D 0 || signo =3D=3D 0, ("exit1 rv %d sig %d", rva=
l, signo));
>>> @@ -456,7 +457,12 @@ exit1(struct thread *td, int rval, int signo)
>>>  			proc_reparent(q, q->p_reaper);
>>>  			if (q->p_state =3D=3D PRS_ZOMBIE) {
>>>  				PROC_LOCK(q->p_reaper);
>>> -				pksignal(q->p_reaper, SIGCHLD, q->p_ksi);
>>> +				if (q->p_ksi !=3D NULL) {
>>> +					ksiginfo_init(&ksi);
>>> +					ksiginfo_copy(q->p_ksi, &ksi);
>>> +				}
>>> +				pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=3D
>>> +				    NULL ? &ksi : NULL);
>>>  				PROC_UNLOCK(q->p_reaper);
>>>  			}
>>>  		} else {
>=20
> I just got something weird with this patch that wasn't happening before=
:
>=20
> /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j
> exp-10amd64 -p commit -z test devel/ccache
> [poudriere runs and completes with exit status 0]
>> time: command terminated abnormally                                   =
                     =20
>>        28.08 real         9.92 user        10.38 sys                  =
                     =20
>>      23464  maximum resident set size                                 =
                     =20
>>       4996  average shared memory size                                =
                     =20
>>         88  average unshared data size                                =
                     =20
>>        127  average unshared stack size                               =
                     =20
>>     282705  page reclaims                                             =
                     =20
>>       5623  page faults                                               =
                     =20
>>          0  swaps                                                     =
                     =20
>>       2673  block input operations                                    =
                     =20
>>       4836  block output operations                                   =
                     =20
>>         33  messages sent                                             =
                     =20
>>          0  messages received                                         =
                     =20
>>         37  signals received                                          =
                     =20
>>      11226  voluntary context switches                                =
                     =20
>>        780  involuntary context switches                              =
                     =20
>> zsh: alarm      /usr/bin/time -l src/bin/poudriere -e /usr/local/etc t=
estport -j exp-10amd64
> exit status: 142 (SIGALRM).
>=20
> I don't see time(1) using SIGALRM or proc reaper at all.
>=20
> Rerunning it, and trying other simpler test cases, does not produce the=

> same result.  It may be some race unrelated to this patch, dunno.
>=20

I'm consistently getting foreground processes getting the wrong signals
now. I'm removing this patch for now.

>=20
>>
>> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginf=
o
>> should always be the zombie's p_ksi; otherwise, the siginfo may be los=
t
>> if there are too many signals pending for the target process or in the=

>> system. If the siginfo is lost and the reaper normally passes si_pid t=
o
>> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie
>> will remain until the reaper terminates.
>>
>> Conceptually the siginfo is sent to one process at a time only, so the=

>> bug is an artifact of the implementation. Perhaps the piece of code
>> added in r309886 can be moved or the ksiginfo can be removed from the
>> parent's queue.
>>
>> If such a fix is not possible, it may be better to send a bare SIGCHLD=

>> (si_code is SI_KERNEL or 0, depending on how many signals are pending)=

>> in this situation and document that reapers must use WAIT_ANY or P_ALL=
=2E
>> (However, compared to the pre-r309886 situation they can still use
>> SIGCHLD to get notified when to call waitpid() or similar.)
>>
>=20
>=20


--=20
Regards,
Bryan Drewery


--N5VRPxGWQn1N9mKI3n26Sr4waQKxOLkPd--

--6oKB8GMj0DjIucTvXVtofbvdNFRn8VAKd
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJYwevSAAoJEDXXcbtuRpfPOMAH/ihoupaKtVFBnbUb+BNBmeq6
cfYTx2JXxnoK6Hlok/pIUIwrpstUUzIGPfVpBQE8j/JM7tnYkDdsw3KNtdF5BAsu
zx/oIPgv3FH3U0IkT1G+/MXUFXlb4pC+qhM47SJkm/2KfkZGRwgB9EwXVGovDpAj
yHg4IwHfhCrYXXWjr5KSU3aqBkArFPrjG7NtA+yJ83DQjO+GhLvJcCrNwi0FJd5G
fq4F4CwYWCz0klnPv9jU1BsO28P+b/U5iov0mYFG6g1cnYY0bltmrfme5tmnRUrv
KSHGpWPlIQ7qO/v66e18YioGY8BGCITm3W/F90FwFZUrEqpaNxQngVSOEBHYmnc=
=o3Z9
-----END PGP SIGNATURE-----

--6oKB8GMj0DjIucTvXVtofbvdNFRn8VAKd--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b7068678-e593-f4ed-647c-036ff0bc0576>