Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Jul 2016 16:40:54 -0700 (PDT)
From:      Don Lewis <truckman@FreeBSD.org>
To:        brde@optusnet.com.au
Cc:        freebsd-net@freebsd.org
Subject:   Re: IPv6 -> IPv4 fallback broken in serf, kernel bug?
Message-ID:  <201607262340.u6QNes2t082436@gw.catspoiler.org>
In-Reply-To: <20160727054616.X990@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 27 Jul, Bruce Evans wrote:
> On Tue, 26 Jul 2016, Don Lewis wrote:
> 
>> Serf has some code to fall back from IPv4 if an IPv6 and more generally
>> try different addresses on multi-homed servers if connection attempts
>> fail, but it does not work properly on recent versions of FreeBSD. I've
>> tested both recent FreeBSD 10.3-STABLE and HEAD.
>>
>> The way that it is supposed to work is that serf creates a socket, sets
>> it non-blocking, calls connect(), and then passes the fd to poll(). When
>> the connection attempt fails, it expects to see a POLLERR event.  The
>> POLLERR event handler will then call getsockopt(fd, SOL_SOCKET,
>> SO_ERROR, &error, ...).  If the returned error is ECONNREFUSED or one of
>> a couple of other errors, then serf will move on to the next address.
>>
>> Instead what happens is that serf also(?) sees POLLIN set, which it
>> processes first by calling read(), which returns an ECONNREFUSED error.
>> That not a documented error return from read().
> 
> FreeBSD still bogusly returns POLLIN (and POLLRDNORM) together with
> POLLHUP at EOF when there is no data (both set should mean both), and
> still has the bogus POLLINIGNEOF, but it it almost never returns POLLERR.
> My regression tests in tools/regression/poll check for not having this
> bug
> 
> The only setting of POLLERR in kern is in kqueue_poll() for errors in
> initialization, and this doesn't set the other flags.
> 
> The only uses of POLLERR in kern are:
> - in select(), to turn POLLERR into "set" for any backend that sets it
>    (and there seems to be only 1 backend that sets it)
> - in vop_stdpoll() and poll_no_poll(), there is inconsistent bogus masking
>    using POLLSTANDARD to obfuscate that standard flags which must be
>    ignored are _not_ masked.
> 
> So I don't see how you can get POLLIN with POLLERR.
> 
>> An easy way to test this is to truss svn and attempt to do an http
>> checkout from a host that has both IPv6 and IPv4 addresses, but is not
>> listening on port 80.  The only connection attempt will be to the IPv6
>> address.
>>
>> socket(PF_INET6,SOCK_STREAM|SOCK_CLOEXEC,6)	 = 4 (0x4)
>> fcntl(4,F_GETFL,)				 = 2 (0x2)
>> fcntl(4,F_SETFL,O_NONBLOCK|0x2)			 = 0 (0x0)
>> setsockopt(0x4,0x6,0x1,0x7fffffffdda4,0x4)	 = 0 (0x0)
>> gettimeofday({ 1469515046.979461 },0x0)		 = 0 (0x0)
>> connect(4,{ AF_INET6 [xxxx:xxxx:xxxx:xxxx::xxxx]:80 },28) ERR#36 'Operation now in progress'
>> gettimeofday({ 1469515046.979614 },0x0)		 = 0 (0x0)
>> kevent(3,{ 4,EVFILT_READ,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0)
>> kevent(3,{ 4,EVFILT_WRITE,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0)
>> kevent(3,0x0,0,{ 4,EVFILT_READ,EV_EOF,NOTE_LOWAT|0x3c,0x0,0x805491300 4,EVFILT_WRITE,EV_EOF,NOTE_LOWAT|0x3c,0x8000,0x805491300 },32,{ 0.500000000 }) = 2 (0x2)
> 
> I don't see any POLL* there or completely understand the notation or kqueue,
> but this looks like the poll() bug with POLLIN together with POLLHUP, not
> POLLIN together with POLLERR.

I didn't try to decipher out the kqueue stuff.  I was thinking that our
poll() was using kqueue under the hood, but it turns out that the poll
emulation is actually being done by apr.  Sigh ...

A comment in the emulation code says:

    /* APR_POLLPRI, APR_POLLERR, and APR_POLLNVAL are not handled by this
     * implementation.

... double sigh.

> Everything here seems to be correct.  Not very good, but good enough here.
> 
> EV_EOF is set by filt_soread() when SBS_CANTRECVMORE is set.
> SBS_CANTRECVMORE means hangup, not EOF, and I think there can be
> readable data from a socket in general but not after a connection
> error.  So this translation is incorrect in general but correct after
> a connection error.  kqueue just can't represent hangup and conflates
> it with EOF.

But should there be a hangup or EOF if we never got connected in the
first place?

> When filt_soread() sets EV_EOF, it doesn't clear other flags, so
> NOTE_LOWAT remains set.  This happens to be correct.  But since NOTE_LOWAT
> really means low water, you can't use it to determine if (non-null) data
> can be read.  (POSIX is unclear about whether the "data" for select() and
> poll() is actual data or just EOF.)
> 
> poll() has almost the opposite problems.  It can represent hangup but
> can't represent EOF.  It can represent no data, but this doesn't mean
> EOF when the file is open.  It can't represent low-water.
> so_poll_generic() starts carefully by setting POLLIN iff soreadable().
> soreadable() is true above the watermark.  So POLLIN for a socket
> normally means that (non-null) data above the watermark can be read
> (without blocking because it is above the watermark).  This is correct
> semantics.  But then so_poll_generic() sets POLLIN if it sets POLLHUP.
> This makes POLLIN worse than useless.  A naive reader won't look at
> POLLHUP, but will trust POLLIN and spin reading at EOF.  A non-naive
> reader will see POLLHUP but can't trust POLLIN then.   It must spin
> reading until read returns EOF, and poll() is useless for avoiding
> this busy-waiting.  Turning off O_NONBLOCK to avoid spinning is unsafe
> if the EOF is not sticky.
> 
> Just having watermarks further complicates the idea of what "data" is.
> Null data is a special case of data that it is too small to be worth
> reading.  It corresponds to a low watermark of 0 or 1.  With watermarks,
> non-null datai below low water should be considered as not being there
> for the purposes of select() and poll(), but there if you try to read
> it.  POSIX is unclear about this too.  kqueue has the opposite problem.
> It handles watermarks directly, but seems to be missing support for
> transient EOF.
> 
> This causes problems for tty devices too.  In Net/2, select() basically
> uses a hard-coded watermark of 1, and this doesn't even work to give
> tinygrams because read() blocks after select() returns "set" for certain
> MIN/TIME combinations where the watermark should be MIN.  This was fixed
> in FreeBSD-1, basically by copying the socket code.  This was broken in
> 4.4BSD.  This was broken in FreeBSD-2.early by copying 4.4BSD.  This was
> fixed in FreeBSD-2 by restoring fixes.  The fixes were refined in
> FreeBSD-[2-7].  All of the fixes were lost in FreeBSD-8.  Most of the
> fixes are restored in my version.
> 
>> read(4,0x80549c064,8000)			 ERR#61 'Connection refused'
>> kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0)
>> kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0)
>> kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory'
>> kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory'
>> close(4)					 = 0 (0x0)
>> close(3)					 = 0 (0x0)
>> svn: E170013: Unable to connect to a repository at URL ...
>>
>> It looks like it should be possible to patch serf to handle this, but:
>>  * Should POLLIN be set for this event?
> 
> I think there never was any data, so no for poll().  kqueue just cannot
> represent the no-data condition.
> 
>>  * What errno value should read() return in this case, if it is
>>    ECONNREFUSED, then that should be documented.
> 
> Don't know.
> 
> Bruce




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201607262340.u6QNes2t082436>