Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 7 Sep 2012 09:35:52 -0700
From:      Garrett Cooper <yanegomi@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        arch@freebsd.org
Subject:   Re: [PATCH] Close a race between exit1() and SIGSTOP
Message-ID:  <CAGH67wRu91eOmB-cxWOkVDvLPZ-BpLs=GuYhwwbD9s697xpkFA@mail.gmail.com>
In-Reply-To: <201209071217.47439.jhb@freebsd.org>
References:  <201209071217.47439.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Sep 7, 2012 at 9:17 AM, John Baldwin <jhb@freebsd.org> wrote:
> We ran into a bug at work recently where processes were reporting an exit
> status from wait(2) of WIFSIGNALED() with WTERMSIG() of SIGSTOP.  I narrowed
> this down to a race in exit1().  Specifically, exit1() sets p->p_xstat to the
> passed in exit status and sets P_WEXIT.  It then drops the PROC_LOCK() to do
> other work such as freeing file descriptors, etc.  Later it reacquires the
> PROC_LOCK(), marks the process as a zombie, and terminates.  During the window
> where the PROC_LOCK() is not held, if a stop signal arrives (SIGSTOP, SIGTSTP,
> etc.), then the signal code will overwrite p->p_xstat with the signal that
> initiated the stop.  The end result is that setting from SIGSTOP stays in
> p->p_xstat and is reported via wait(2).
>
> My first attempt at a fix was to simply ignore all signals once P_WEXIT is
> set by adding a check for P_WEXIT to the existing check for PRS_ZOMBIE.
> However, I think this is not quite right.  For example, if an exiting process
> gets hung on an interruptible NFS mount while doing fdfree() I think we want a
> user to be able to kill the process to let it break out of NFS and finish
> exiting.
>
> The second fix I have is to explicitly clear P_STOPPED_SIGNAL to discard any
> pending SIGSTOP when setting P_WEXIT and to explicitly disallow any stop
> signals for a process that is currently exiting (that is, P_WEXIT is set).
> This fixes the race I ran into while still allowing other signals during
> process exit.  The patch to do that is below.  Below that is a test program
> that reproduces the issue.
>
> I would really like to get some feedback on which approach is best and if my
> concerns about allowing other signals during exit1() is a legitimate concern.
> (For some reason I feel like I've seen an argument for allowing that before.)

Does this case continue to function normally?

$ sh -c 'while :; do sleep 1; done'
^Z
[1]+  Stopped                 sh -c 'while :; do sleep 1; done'
$ kill %1
$ fg
bash: fg: job has terminated
[1]+  Terminated: 15          sh -c 'while :; do sleep 1; done'

$ sh -c 'while :; do sleep 1; done'
^Z
[1]+  Stopped                 sh -c 'while :; do sleep 1; done'
$ kill %1
$ kill %1
bash: kill: (3522) - No such process
[1]+  Terminated: 15          sh -c 'while :; do sleep 1; done'

In particular, a single kill results in the signal being sent after
the process wakes up, and a double kill results in the immediate death
of the process (in this case a shell job) (in part because of the
signals /bin/sh masks).

Thanks!
-Garrett



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGH67wRu91eOmB-cxWOkVDvLPZ-BpLs=GuYhwwbD9s697xpkFA>