Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Mar 2006 13:00:36 GMT
From:      Bruce Evans <bde@zeta.org.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/94772: FIFOs (named pipes) + select() == broken
Message-ID:  <200603261300.k2QD0a2c067502@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/94772; it has been noted by GNATS.

From: Bruce Evans <bde@zeta.org.au>
To: Oliver Fromme <olli@lurza.secnetix.de>
Cc: bug-followup@freebsd.org
Subject: Re: kern/94772: FIFOs (named pipes) + select() == broken
Date: Sun, 26 Mar 2006 23:57:15 +1100 (EST)

 On Thu, 23 Mar 2006, Oliver Fromme wrote:
 
 > OK, here are new patches.  I wrote and tested them on
 > RELENG_6, but they should apply to HEAD as well, AFAICT.
 >
 > With these patches, all of the test programs pass with
 > success (no output), i.e. the select test and the poll
 > test.  My own test program from the beginning of this
 > PR passes without problems, too.
 >
 > --- ./fs/fifofs/fifo_vnops.c.orig	Tue Mar 21 09:42:32 2006
 > +++ ./fs/fifofs/fifo_vnops.c	Thu Mar 23 19:57:21 2006
 > @@ -231,6 +231,12 @@
 > 				wakeup(&fip->fi_writers);
 > 				sowwakeup(fip->fi_writesock);
 > 			}
 > +			else if (ap->a_mode & O_NONBLOCK) {
 > +				SOCKBUF_LOCK(&fip->fi_readsock->so_rcv);
 > +				fip->fi_readsock->so_rcv.sb_state |=
 > +				    SBS_EOFNOHUP;
 > +				SOCKBUF_UNLOCK(&fip->fi_readsock->so_rcv);
 > +			}
 > 		}
 > 	}
 > 	if (ap->a_mode & FWRITE) {
 
 This flags seems to be necessary (I couldn't recover it from the SBS_CANT*
 clags like I hoped to).
 
 The flag needs to be set even in the !O_NONBLOCK case since otherwise it
 doesn't get set if the first reader is !O_NONBLOCK and a later reader
 (before any writers) is O_NONBLOCK.
 
 Clearing of this flag seems to be missing in cases where all readers
 go away before any writers appear.
 
 I cnaged the sense of the flag and renamed it to SBS_COULDRCV.
 
 > @@ -241,7 +247,8 @@
 > 		fip->fi_writers++;
 > 		if (fip->fi_writers == 1) {
 > 			SOCKBUF_LOCK(&fip->fi_readsock->so_rcv);
 > -			fip->fi_readsock->so_rcv.sb_state &= ~SBS_CANTRCVMORE;
 > +			fip->fi_readsock->so_rcv.sb_state &=
 > +			    ~(SBS_CANTRCVMORE | SBS_EOFNOHUP);
 > 			SOCKBUF_UNLOCK(&fip->fi_readsock->so_rcv);
 > 			if (fip->fi_readers > 0) {
 > 				wakeup(&fip->fi_readers);
 
 OK.
 
 > @@ -661,37 +668,23 @@
 > 	int levents, revents = 0;
 >
 > 	fip = fp->f_data;
 > -	levents = events &
 > -	    (POLLIN | POLLINIGNEOF | POLLPRI | POLLRDNORM | POLLRDBAND);
 > +	levents = events & (POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND);
 > 	if ((fp->f_flag & FREAD) && levents) {
 > -		/*
 > -		 * If POLLIN or POLLRDNORM is requested and POLLINIGNEOF is
 > -		 * not, then convert the first two to the last one.  This
 > -		 * tells the socket poll function to ignore EOF so that we
 > -		 * block if there is no writer (and no data).  Callers can
 > -		 * set POLLINIGNEOF to get non-blocking behavior.
 > -		 */
 > -		if (levents & (POLLIN | POLLRDNORM) &&
 > -		    !(levents & POLLINIGNEOF)) {
 > -			levents &= ~(POLLIN | POLLRDNORM);
 > -			levents |= POLLINIGNEOF;
 > -		}
 > -
 > 		filetmp.f_data = fip->fi_readsock;
 > 		filetmp.f_cred = cred;
 > 		revents |= soo_poll(&filetmp, levents, cred, td);
 > -
 > -		/* Reverse the above conversion. */
 > -		if ((revents & POLLINIGNEOF) && !(events & POLLINIGNEOF)) {
 > -			revents |= (events & (POLLIN | POLLRDNORM));
 > -			revents &= ~POLLINIGNEOF;
 > -		}
 > 	}
 > 	levents = events & (POLLOUT | POLLWRNORM | POLLWRBAND);
 > 	if ((fp->f_flag & FWRITE) && levents) {
 > 		filetmp.f_data = fip->fi_writesock;
 > 		filetmp.f_cred = cred;
 > 		revents |= soo_poll(&filetmp, levents, cred, td);
 > +	}
 
 OK.
 
 > +	if (revents & POLLHUP) {
 > +		SOCKBUF_LOCK(&fip->fi_readsock->so_rcv);
 > +		if (fip->fi_readsock->so_rcv.sb_state & SBS_EOFNOHUP)
 > +			revents = (revents & ~POLLHUP) | POLLHUPIGNEOF;
 > +		SOCKBUF_UNLOCK(&fip->fi_readsock->so_rcv);
 > 	}
 > 	return (revents);
 > }
 
 I think the locking here isn't useful (locking for reading a flag generally
 isn't since there is little difference if you lose a race before, during
 or after reading the flag.  Locking the whole function might be needed.
 There is now locking for the whole of sopoll().  In the old version of
 -current that I use there isn't any locking at all there.
 
 The above became more complicated when I set POLLIN together with POLLHUP
 in sopoll().  Then the above needed to clear POLLIN together with POLLHUP
 but only if there is no data.
 
 Clearing flags in callers is ugly and causes the following bug: sopoll()
 has decided not to call selrecord() if revents != 0; the above mainains
 revents != 0 but pollscan() may clear the final flag POLLHUPIGNEOF out
 of revents and then we sleep without having called selrecord().  I didn't
 notice anything breaking from this.  The zero timeout in the regression
 tests prevents problems there and probably nothing else in my system has
 used a fifo lately.
 
 I think I fixed these problems by moving all the decisions to sopoll().
 pollscan() sets a flag POLLPOLL and fifofs maintains SBS_COULDRCV so
 that sopoll() can decide.
 
 Now I wonder about sleeping with or without selrecord() for silly
 combinations of poll flags.  I think events == 0 causes a sleep that
 is not terminated by hangup since sopoll() is not called so it doesn't
 even get a chance to set POLLHUP.  I think this case should cause
 a sleep that is terminated by hangup (but nothing else except a timeout
 or signal).
 
 > --- ./kern/uipc_socket.c.orig	Wed Dec 28 19:05:13 2005
 > +++ ./kern/uipc_socket.c	Thu Mar 23 22:50:33 2006
 > @@ -2033,16 +2033,15 @@
 > 	SOCKBUF_LOCK(&so->so_snd);
 > 	SOCKBUF_LOCK(&so->so_rcv);
 
 The version that you use has the locking here too.
 
 > 	if (events & (POLLIN | POLLRDNORM))
 > -		if (soreadable(so))
 > -			revents |= events & (POLLIN | POLLRDNORM);
 > -
 > -	if (events & POLLINIGNEOF)
 > 		if (so->so_rcv.sb_cc >= so->so_rcv.sb_lowat ||
 > 		    !TAILQ_EMPTY(&so->so_comp) || so->so_error)
 > -			revents |= POLLINIGNEOF;
 > +			revents |= events & (POLLIN | POLLRDNORM);
 >
 > -	if (events & (POLLOUT | POLLWRNORM))
 > -		if (sowriteable(so))
 > +	if ((so->so_rcv.sb_state & SBS_CANTRCVMORE) ||
 > +	    (so->so_snd.sb_state & SBS_CANTSENDMORE))
 > +		revents |= POLLHUP;
 > +	else
 > +		if (events & (POLLOUT | POLLWRNORM) && sowriteable(so))
 > 			revents |= events & (POLLOUT | POLLWRNORM);
 >
 > 	if (events & (POLLPRI | POLLRDBAND))
 > @@ -2050,9 +2049,7 @@
 > 			revents |= events & (POLLPRI | POLLRDBAND);
 >
 > 	if (revents == 0) {
 > -		if (events &
 > -		    (POLLIN | POLLINIGNEOF | POLLPRI | POLLRDNORM |
 > -		     POLLRDBAND)) {
 > +		if (events & (POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND)) {
 > 			selrecord(td, &so->so_rcv.sb_sel);
 > 			so->so_rcv.sb_flags |= SB_SEL;
 > 		}
 
 This all worked as intended.  It broke lat_rpc because (as intended) the
 condition for setting POLLIN became a subset of soreadable(), so POLLIN
 no longer gets set at pure EOF, but lat_rpc depends on it being set.
 
 The rest of the changes are simple and worked.
 
 I forgot to mention another source of inconsistencies: kqueue.  kqueue
 mainly uses the SB_CANT* flags for sockets and fifos to decide EOF, so
 it should work reasonably provided sopoll() does, but it will behave
 like select() in the special case unless it is changed to behave like
 poll().
 
 Here is my work-in-progress version.  The patch is relative to a very
 old version of FreeBSD.
 
 %%%
 Index: fifo_vnops.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/fs/fifofs/fifo_vnops.c,v
 retrieving revision 1.100
 diff -u -2 -r1.100 fifo_vnops.c
 --- fifo_vnops.c	23 Jun 2004 00:35:50 -0000	1.100
 +++ fifo_vnops.c	26 Mar 2006 09:42:47 -0000
 @@ -232,4 +232,22 @@
   		fip->fi_readers++;
   		if (fip->fi_readers == 1) {
 +			SOCKBUF_LOCK(&fip->fi_readsock->so_rcv);
 +			if (fip->fi_writers > 0)
 +				fip->fi_readsock->so_rcv.sb_state |=
 +				    SBS_COULDRCV;
 +			else
 +				/*
 +				 * Sloppy?  Might be necessary to clear it
 +				 * in all the places where fi_readers is
 +				 * decremented to 0.  I think only writers
 +				 * polling for input could be confused by
 +				 * having it not set, and there is a problem
 +				 * with these anyway now that we have
 +				 * reversed the sense of the flag -- they
 +				 * now block (?), but shouldn't.
 +				 */
 +				fip->fi_readsock->so_rcv.sb_state &=
 +				    ~SBS_COULDRCV;
 +			SOCKBUF_UNLOCK(&fip->fi_readsock->so_rcv);
   			SOCKBUF_LOCK(&fip->fi_writesock->so_snd);
   			fip->fi_writesock->so_snd.sb_state &= ~SBS_CANTSENDMORE;
 @@ -248,7 +266,8 @@
   		fip->fi_writers++;
   		if (fip->fi_writers == 1) {
 -			SOCKBUF_LOCK(&fip->fi_writesock->so_rcv);
 +			SOCKBUF_LOCK(&fip->fi_readsock->so_rcv);
   			fip->fi_readsock->so_rcv.sb_state &= ~SBS_CANTRCVMORE;
 -			SOCKBUF_UNLOCK(&fip->fi_writesock->so_rcv);
 +			fip->fi_readsock->so_rcv.sb_state |= SBS_COULDRCV;
 +			SOCKBUF_UNLOCK(&fip->fi_readsock->so_rcv);
   			if (fip->fi_readers > 0) {
   				wakeup(&fip->fi_readers);
 @@ -521,32 +540,11 @@
   	int events, revents = 0;
 
 -	events = ap->a_events &
 -	    (POLLIN | POLLINIGNEOF | POLLPRI | POLLRDNORM | POLLRDBAND);
 +	events = ap->a_events & (POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND);
   	if (events) {
 -		/*
 -		 * If POLLIN or POLLRDNORM is requested and POLLINIGNEOF is
 -		 * not, then convert the first two to the last one.  This
 -		 * tells the socket poll function to ignore EOF so that we
 -		 * block if there is no writer (and no data).  Callers can
 -		 * set POLLINIGNEOF to get non-blocking behavior.
 -		 */
 -		if (events & (POLLIN | POLLRDNORM) &&
 -		    !(events & POLLINIGNEOF)) {
 -			events &= ~(POLLIN | POLLRDNORM);
 -			events |= POLLINIGNEOF;
 -		}
 -
   		filetmp.f_data = ap->a_vp->v_fifoinfo->fi_readsock;
   		filetmp.f_cred = ap->a_cred;
 -		if (filetmp.f_data)
 -			revents |= soo_poll(&filetmp, events,
 -			    ap->a_td->td_ucred, ap->a_td);
 -
 -		/* Reverse the above conversion. */
 -		if ((revents & POLLINIGNEOF) &&
 -		    !(ap->a_events & POLLINIGNEOF)) {
 -			revents |= (ap->a_events & (POLLIN | POLLRDNORM));
 -			revents &= ~POLLINIGNEOF;
 -		}
 +		revents |= soo_poll(&filetmp,
 +		    events | (ap->a_events & POLLPOLL), ap->a_td->td_ucred,
 +		    ap->a_td);
   	}
   	events = ap->a_events & (POLLOUT | POLLWRNORM | POLLWRBAND);
 @@ -554,8 +552,6 @@
   		filetmp.f_data = ap->a_vp->v_fifoinfo->fi_writesock;
   		filetmp.f_cred = ap->a_cred;
 -		if (filetmp.f_data) {
 -			revents |= soo_poll(&filetmp, events,
 -			    ap->a_td->td_ucred, ap->a_td);
 -		}
 +		revents |= soo_poll(&filetmp, events, ap->a_td->td_ucred,
 +		    ap->a_td);
   	}
   	return (revents);
 Index: sys_generic.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v
 retrieving revision 1.131
 diff -u -2 -r1.131 sys_generic.c
 --- sys_generic.c	5 Apr 2004 21:03:35 -0000	1.131
 +++ sys_generic.c	26 Mar 2006 05:41:48 -0000
 @@ -1093,6 +1080,6 @@
   				 * POLLERR if appropriate.
   				 */
 -				fds->revents = fo_poll(fp, fds->events,
 -				    td->td_ucred, td);
 +				fds->revents = fo_poll(fp,
 +				    fds->events | POLLPOLL, td->td_ucred, td);
   				if (fds->revents != 0)
   					n++;
 Index: uipc_socket.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/kern/uipc_socket.c,v
 retrieving revision 1.189
 diff -u -2 -r1.189 uipc_socket.c
 --- uipc_socket.c	24 Jun 2004 04:28:30 -0000	1.189
 +++ uipc_socket.c	26 Mar 2006 09:27:47 -0000
 @@ -1872,13 +1870,43 @@
   			revents |= events & (POLLIN | POLLRDNORM);
 
 -	if (events & POLLINIGNEOF)
 -		if (so->so_rcv.sb_cc >= so->so_rcv.sb_lowat ||
 -		    !TAILQ_EMPTY(&so->so_comp) || so->so_error)
 -			revents |= POLLINIGNEOF;
 -
   	if (events & (POLLOUT | POLLWRNORM))
   		if (sowriteable(so))
   			revents |= events & (POLLOUT | POLLWRNORM);
 
 +	/*
 +	 * SBS_CANTRCVMORE (which is checked by soreadable()) normally
 +	 * implies EOF (and thus readable) and hung up, but for
 +	 * compatibility with other systems and to obtain behavior that
 +	 * is otherwise unavailable we make the case of poll() on a fifo
 +	 * that has never had any writers during the lifetime of any
 +	 * current reader special: then we pretend that the fifo is
 +	 * unreadable unless it contains non-null data, and that it is
 +	 * not hung up.  The POLLPOLL flag is set by poll() to identify
 +	 * poll() here, and the SBS_COULDRCV flag is set by the fifo
 +	 * layer to indicate a fifo that is not in the special state.
 +	 */
 +	if (so->so_rcv.sb_state & SBS_CANTRCVMORE) {
 +		if (!(events & POLLPOLL) || so->so_rcv.sb_state & SBS_COULDRCV)
 +			revents |= POLLHUP;	/* finish settings */
 +		else if (!(so->so_rcv.sb_cc >= so->so_rcv.sb_lowat ||
 +		    !TAILQ_EMPTY(&so->so_comp) || so->so_error))
 +			revents &= ~(POLLIN | POLLRDNORM); /* undo settings */
 +	}
 +
 +	/*
 +	 * Testing of hangup for writers could be optimized by combining
 +	 * it with testing for writeability, but we keep the test separate
 +	 * and with the same organization as the test for readers for
 +	 * clarity.  Note that writeable implies not hung up, so if POLLHUP
 +	 * is set here then (POLLOUT | POLLWRNORM) is not set above, as
 +	 * standards require.  Less obviously, if POLLHUP was set above for
 +	 * a reader, then the output flags cannot have been set above for
 +	 * a writer.  Even less obviously, we cannot end up with both
 +	 * POLLHUP output flags set in revents after ORing the revents for
 +	 * the read and write socket in fifo_poll().
 +	 */
 +	if (so->so_snd.sb_state & SBS_CANTSENDMORE)
 +		revents |= POLLHUP;
 +
   	if (events & (POLLPRI | POLLRDBAND))
   		if (so->so_oobmark || (so->so_rcv.sb_state & SBS_RCVATMARK))
 @@ -1886,7 +1914,5 @@
 
   	if (revents == 0) {
 -		if (events &
 -		    (POLLIN | POLLINIGNEOF | POLLPRI | POLLRDNORM |
 -		     POLLRDBAND)) {
 +		if (events & (POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND)) {
   			SOCKBUF_LOCK(&so->so_rcv);
   			selrecord(td, &so->so_rcv.sb_sel);
 Index: poll.h
 ===================================================================
 RCS file: /home/ncvs/src/sys/sys/poll.h,v
 retrieving revision 1.13
 diff -u -2 -r1.13 poll.h
 --- poll.h	10 Jul 2002 04:47:25 -0000	1.13
 +++ poll.h	26 Mar 2006 07:56:52 -0000
 @@ -67,7 +67,6 @@
   #define	POLLWRBAND	0x0100		/* OOB/Urgent data can be written */
 
 -#if __BSD_VISIBLE
 -/* General FreeBSD extension (currently only supported for sockets): */
 -#define	POLLINIGNEOF	0x2000		/* like POLLIN, except ignore EOF */
 +#ifdef _KERNEL
 +#define	POLLPOLL	0x8000		/* system call is actually poll() */
   #endif
 
 Index: socketvar.h
 ===================================================================
 RCS file: /home/ncvs/src/sys/sys/socketvar.h,v
 retrieving revision 1.130
 diff -u -2 -r1.130 socketvar.h
 --- socketvar.h	24 Jun 2004 04:27:10 -0000	1.130
 +++ socketvar.h	26 Mar 2006 08:35:56 -0000
 @@ -212,4 +212,5 @@
   #define	SBS_CANTRCVMORE		0x0020	/* can't receive more data from peer */
   #define	SBS_RCVATMARK		0x0040	/* at mark on input */
 +#define	SBS_COULDRCV		0x0080	/* could receive previously (or now) */
 
   /*
 Index: syscalls.c
 ===================================================================
 RCS file: /home/ncvs/src/usr.bin/truss/syscalls.c,v
 retrieving revision 1.39
 diff -u -2 -r1.39 syscalls.c
 --- syscalls.c	11 Jun 2004 11:58:07 -0000	1.39
 +++ syscalls.c	25 Mar 2006 13:25:13 -0000
 @@ -402,5 +402,5 @@
   #define POLLKNOWN_EVENTS \
   	(POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP | POLLNVAL | \
 -	 POLLRDNORM |POLLRDBAND | POLLWRBAND | POLLINIGNEOF) 
 +	 POLLRDNORM |POLLRDBAND | POLLWRBAND)
 
   	  u += snprintf(tmp + used, per_fd,
 %%%
 
 Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200603261300.k2QD0a2c067502>