Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Jan 2005 01:10:33 GMT
From:      Bruce Evans <bde@zeta.org.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/76525: Subsequent calls to select() a FIFO after aprevious FIFO EOF causes calling process to hang
Message-ID:  <200501230110.j0N1AXiH012781@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/76525; it has been noted by GNATS.

From: Bruce Evans <bde@zeta.org.au>
To: Daniel Fuller / Greg Ward <defuller@lbl.gov>
Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org
Subject: Re: kern/76525: Subsequent calls to select() a FIFO after a previous
 FIFO EOF causes calling process to hang
Date: Sun, 23 Jan 2005 12:05:18 +1100 (EST)

 On Fri, 21 Jan 2005, Daniel Fuller / Greg Ward wrote:
 
 > Well, it only took me 8 hours, but I found the problem with the latest version of [Free]BSD.  I tracked down several false leads before I got on the right track -- it's only named FIFO's that seem to exhibit this problem.  I depend on them for the -P and -PP option of rtrace, which is needed for memory sharing as I've set it up in Radiance.  I don't think named FIFO's are used very often, which might explain why this has gone undetected (or at least unfixed).
 
 Please limit line lengths to considerably less than 463 characters.
 
 This is essentially the same bug as the one for poll() that was reported
 recently in PR 76144.  The problem is that some users want poll() and
 select() on a FIFO with no writer (and no data) to block in some contexts,
 and some systems implement this.  FreeBSD used to never block, but was
 changed to always block.  I knew that this might break things and tried
 to limit the breakage, but somehow missed that it broke the most important
 case of a writer going away.  It seems that I only limited the breakage
 for read() but made things worse for select() and poll().
 
 Other OS's apaprently have more context (and associated races) so that
 they can handle the EOF from a writer going away differently from EOF
 when there has "never" been a writer.  A hangup flag is obviously
 needed to implement POLLHUP for poll(), but select() cannot report
 hangups properly.  The semantics of "never" are unclear.  I think the
 flag should be per-file so that new opens don't see the hangup/EOF
 condition just because another reader has seen a hangup.  The
 implementation only has a per-device hangup flag, so fixing the bug
 involves more than just making the behaviour depend on that flag.
 
 History of related bugs, all from wrong setting and wrong use of the
 per-device hangup flag:
 (1) in rev.1.1 of fifo_vnops.c, the flag was set on the wrong half of
     the socket pair at open() time, so everything starting with read()
     was potentially broken for the context where there has never been
     a writer.  read() was plain broken -- it blocked waiting for a
     writer, but must return 0 immediately in the O_NONBLOCK case.
     select() accidentally blocked, which is apparently what is wanted.
     After fixing this:
 (2) in rev.1.1, the flag was cleared on the first successful() read.  This
     broke subsequent reads in the case of no writer (nonblocking reads must
     return 0 immediately, but only did so for the first read).
 (3) the above 2 bugs were fixed in rev.1.40 in December 1997.  FIFOs are
     apparently not often used, since no one seemed to notice the presence
     or absence of these bugs.  I only noticed because some POSIX conformance
     test programs reported the bugs.
 (4) bug (2) was reimplemented in rev.1.56 in November 2001 by undoing part
     of 1.40.
 (5) bug (1) was reimplemented in rev.1.57 in November 2001 by deleting the
     initialization of the hangup flag instead of by initializing the flag
     in the wrong place.  Thus after (4) and (5), read() was broken but
     select() sort of worked, as in rev.1.1, and the new poll() syscall
     sort of worked, like select().  POLLHUP has never been implemented
     for FIFOs or sockets, so poll()'s reporting of the hangup condition
     has never worked (see PR 76144).
 (6) read() was unbroken in rev.1.62 in January 2002 by restoring the fixes
     for (1) and (2), but select() and poll() were broken by ignoring the
     flag for them.  For poll(), it is possible to get the old behaviour (3)
     usign a new poll flag to request not ignoring the hangup flag.
 (7) 3 years later, some PRs about (6) were filed.  FIFOs are apparently
     still not often used.
 
 > There are two test programs that demonstrate the problem.  The first is called pipe.c, and on OS X, it produces the following (correct) output:
 >
 > pipe available for read
 > Read 4 bytes from pipe: 'TEST'
 > pipe available for read
 > Read 0 bytes from pipe: ''
 
 I believe this stuff is all implemented correctly for nameless pipes,
 but only in old versions of FreeBSD, apparently including the one that
 OS X is based on.  The case of a reader with no writer doesn't occur
 initially and writers don't come back after they are closed, so things
 are simpler.
 
 > Under FreeBSD 5.3, for some reason I get an exception condition on my pipe every time, which is strange but not fatal:
 >
 > Exception on pipe
 > pipe available for read
 > Read 4 bytes from pipe: 'TEST'
 > Exception on pipe
 > pipe available for read
 > Read 0 bytes from pipe: ''
 >
 > On FreeBSD 4.10, I only get an exception at EOF, which I might expect:
 >
 > pipe available for read
 > Read 4 bytes from pipe: 'TEST'
 > Exception on pipe
 > pipe available for read
 > Read 0 bytes from pipe: ''
 
 I think the exception is for hangup.  select() and poll() use the same low-
 level interface.  This interface gives poll() semantics, and select()
 semantics are derived.  After a hangup, poll() always sets POLLHUP and
 doesn't block.  Also, POLLHUP is actually implemented for nameless pipes,
 unlike for named pipes.  After a hangup on a pipe, select() on an
 exception descriptor cannot block so it must report an exception.
 
 The extra exception in 5.3 is a bug.  I've debugged it before in
 connection with piping input to gdb.  5.3 returns POLLHUP from poll()
 (and exceptions for select()) as soon as the writer is closed, despite
 there being data in the pipe.  This breaks applications like gdb which
 stop reading input when they see POLLHUP.  See PR 53447 for more
 details.  PR 53447 is also primarily about select/poll hangup handling.
 Your test shows that 4.10 is somehow missing this bug.
 
 > The real trouble begins with the second FIFO test in fifo.c.  Under OS X, I get the correct output:
 >
 > FIFO available for read
 > Read 4 bytes from FIFO: 'TEST'
 > FIFO available for read
 > Read 0 bytes from FIFO: ''
 >
 > Under FreeBSD 4.10, I get exactly the same output -- even the exception condition is gone:
 >
 > FIFO available for read
 > Read 4 bytes from FIFO: 'TEST'
 > FIFO available for read
 > Read 0 bytes from FIFO: ''
 
 This is because these systems are based on versions of fifo_vnops.c older
 than rev.1.56 (so only POLLHUP for hangup and arguably blocking for no
 writer && no data && no hangup are broken).
 
 > However, under FreeBSD 5.3-STABLE, the poor thing hangs at the EOF, and select(2) never returns:
 >
 > FIFO available for read
 > Read 4 bytes from FIFO: 'TEST'
 > (process hangs in second call to select)
 
 This is because not just POLLHUP for hangup is broken; not blocking for
 hangup is broken too.
 
 > Keep in mind that there should be no difference in the behavior between a named FIFO and a pipe -- the only difference is how they are mechanically connected by the two processes.  Having the select() call hang when an EOF condition exists is not acceptable.
 
 I agree for select(), but this is nonstandard for the EOF that occurs
 when there is no writer && no data && no hangup, and for poll() the
 hangup condition can be reported separately so there is no need to
 overload the EOF condition.  For poll(), the difficulty is clearing
 the hangup condition: if it is cleared on open() of a reader, then
 open() races with clearing the flag and select() after open() may block
 when we don't want it to after losing a race, but if the hangup condition
 is not cleared until all readers and writers are closed, then select()
 after open() may return immediately when we want it to block.
 
 Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200501230110.j0N1AXiH012781>