Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 7 Sep 2012 14:18:58 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        arch@freebsd.org
Subject:   Re: [PATCH] Close a race between exit1() and SIGSTOP
Message-ID:  <201209071418.58576.jhb@freebsd.org>
In-Reply-To: <20120907165210.GC33100@deviant.kiev.zoral.com.ua>
References:  <201209071217.47439.jhb@freebsd.org> <20120907165210.GC33100@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, September 07, 2012 12:52:10 pm Konstantin Belousov wrote:
> On Fri, Sep 07, 2012 at 12:17:47PM -0400, John Baldwin wrote:
> > We ran into a bug at work recently where processes were reporting an
> > exit status from wait(2) of WIFSIGNALED() with WTERMSIG() of SIGSTOP.
> > I narrowed this down to a race in exit1(). Specifically, exit1()
> > sets p->p_xstat to the passed in exit status and sets P_WEXIT. It
> > then drops the PROC_LOCK() to do other work such as freeing file
> > descriptors, etc. Later it reacquires the PROC_LOCK(), marks the
> > process as a zombie, and terminates. During the window where the
> > PROC_LOCK() is not held, if a stop signal arrives (SIGSTOP, SIGTSTP,
> > etc.), then the signal code will overwrite p->p_xstat with the signal
> > that initiated the stop. The end result is that setting from SIGSTOP
> > stays in p->p_xstat and is reported via wait(2).
> >
> > My first attempt at a fix was to simply ignore all signals once
> > P_WEXIT is set by adding a check for P_WEXIT to the existing check for
> > PRS_ZOMBIE. However, I think this is not quite right. For example, if
> > an exiting process gets hung on an interruptible NFS mount while doing
> > fdfree() I think we want a user to be able to kill the process to let
> > it break out of NFS and finish exiting.
> >
> > The second fix I have is to explicitly clear P_STOPPED_SIGNAL to
> > discard any pending SIGSTOP when setting P_WEXIT and to explicitly
> > disallow any stop signals for a process that is currently exiting
> > (that is, P_WEXIT is set). This fixes the race I ran into while still
> > allowing other signals during process exit. The patch to do that is
> > below. Below that is a test program that reproduces the issue.
> >
> > I would really like to get some feedback on which approach is best
> > and if my concerns about allowing other signals during exit1() is a
> > legitimate concern. (For some reason I feel like I've seen an argument
> > for allowing that before.)
> >
> 
> Argument ? Please see r163541. I hope you agree with the author, who
> basically provided the same reasoning about interruptible NFS mounts.

I was using "argument" as in definition 4 here:

http://dictionary.reference.com/browse/argument

rather than definitions 1 or 2.

As I said, I thought I recalled there being a good reason for allowing this. 
:)

> So I agree with r163541, which log message says that I requested it.
> And I do not see any other possible solution except case 2. Shouldn't
> the fix actually prevent any overwrite of p_xstat for P_WEXIT process
> from the signal delivery code ?

The solution does do that AFAICT.  SIGCONT should be a nop since the 
process should never be in a stopped state once P_WEXIT is set, and
the patch explicitly turns SIGTOP into a nop.

I added assertions to make sure that no signal is delivered to a
stopped process while P_WEXIT is set as well as in the other places
that overwrite p_xstat but that I think are unreachable (e.g. in
ptracestop()).

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201209071418.58576.jhb>