Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Oct 2011 09:11:05 +0300
From:      Mikolaj Golub <trociny@freebsd.org>
To:        dave jones <s.dave.jones@gmail.com>
Cc:        Adrian Chadd <adrian@freebsd.org>, "K. Macy" <kmacy@freebsd.org>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, bz@freebsd.org, Robert Watson <rwatson@freebsd.org>, Arnaud Lacombe <lacombar@gmail.com>
Subject:   Re: Kernel panic on FreeBSD 9.0-beta2
Message-ID:  <86fwiy91ty.fsf@in138.ua3>
In-Reply-To: <CANf5e8bLcxYDe%2BmHssUndOqh2B0j-V28Ox2dZCfy6%2Bo7aURw=w@mail.gmail.com> (dave jones's message of "Wed, 12 Oct 2011 09:53:34 %2B0800")
References:  <CANf5e8aG4go4M_vsRExUsJB_sjaN5x-QK-TCDAhSH64JSo0mdQ@mail.gmail.com> <CACqU3MXStMMEoppvDtZS6hV4WGttbdJiF8E-ORwJ%2BQSmnTy-Yg@mail.gmail.com> <CACqU3MV-t4Va6VWUoXy1Y9FYnNJTUw1X%2BE7ik-2%2BtMVuVOV3RA@mail.gmail.com> <CAJ-Vmom-177OkdUXjz%2BZLqbaqn=p%2BuTGypiVuMqdeXgdOgb4hQ@mail.gmail.com> <CAHM0Q_Mmn3z1V6AtZHQMpgbdY7oQqOChiNt=8NJrZQDnravb7A@mail.gmail.com> <CACqU3MU9ZZtOsdBOa%2BF3SqUaYgO%2BEo0v1ACjY0S4rY4fRQyv5Q@mail.gmail.com> <CAHM0Q_PZD9_0ZkELZ5XL8Ebh8eD-uFuSjXWKKVpGDeM_JDaqMA@mail.gmail.com> <8662kcigif.fsf@kopusha.home.net> <alpine.BSF.2.00.1109301432570.65269@fledge.watson.org> <CANf5e8ab=mUw-AJuRXZy1T6%2BZcryxjKfuCOsakPPfqatuA3HdA@mail.gmail.com> <86y5x0ooik.fsf@in138.ua3> <CANf5e8YtQ5P2euF7E-D6Wt7U38UuLc8KVU-NCehq74XV_WTvBg@mail.gmail.com> <CANf5e8bLcxYDe%2BmHssUndOqh2B0j-V28Ox2dZCfy6%2Bo7aURw=w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-=
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 8bit


On Wed, 12 Oct 2011 09:53:34 +0800 dave jones wrote:

 dj> On Fri, Oct 7, 2011 at 9:12 AM, dave jones  wrote:
 >> 2011/10/4 Mikolaj Golub :
 >>>
 >>> On Sat, 1 Oct 2011 14:15:45 +0800 dave jones wrote:
 >>>
 >>> šdj> On Fri, Sep 30, 2011 at 9:41 PM, Robert Watson wrote:
 >>> š>>
 >>> š>> On Wed, 28 Sep 2011, Mikolaj Golub wrote:
 >>> š>>
 >>> š>>> On Mon, 26 Sep 2011 16:12:55 +0200 K. Macy wrote:
 >>> š>>>
 >>> š>>> KM> Sorry, didn't look at the images (limited bw), I've seen something KM>
 >>> š>>> like this before in timewait. This "can't happen" with UDP so will be KM>
 >>> š>>> interested in learning more about the bug.
 >>> š>>>
 >>> š>>> The panic can be easily triggered by this:
 >>> š>>
 >>> š>> Hi:
 >>> š>>
 >>> š>> Just catching up on this thread. šI think the analysis here is generally
 >>> š>> right: in 9.0, you're much more likely to see an inpcb with its in_socket
 >>> š>> pointer cleared in the hash list than in prior releases, and
 >>> š>> in_pcbbind_setup() trips over this.
 >>> š>>
 >>> š>> However, at least on first glance (and from the perspective of invariants
 >>> š>> here), I think the bug is actualy that in_pcbbind_setup() is asking
 >>> š>> in_pcblookup_local() for an inpcb and then access the returned inpcb's
 >>> š>> in_socket pointer without acquiring a lock on the inpcb. šStructurally, it
 >>> š>> can't acquire this lock for lock order reasons -- it already holds the lock
 >>> š>> on its own inpcb. šTherefore, we should only access fields that are safe to
 >>> š>> follow in an inpcb when you hold a reference via the hash lock and not a
 >>> š>> lock on the inpcb itself, which appears generally OK (+/-) for all the
 >>> š>> fields in that clause but the t->inp_socket->so_options dereference.
 >>> š>>
 >>> š>> A preferred fix would cache the SO_REUSEPORT flag in an inpcb-layer field,
 >>> š>> such as inp_flags2, giving us access to its value without having to walk
 >>> š>> into the attached (or not) socket.
 >>> š>>
 >>> š>> This raises another structural question, which is whether we need a new
 >>> š>> inp_foo flags field that is protected explicitly by the hash lock, and not
 >>> š>> by the inpcb lock, which could hold fields relevant to address binding. šI
 >>> š>> don't think we need to solve that problem in this context, as a slightly
 >>> š>> race on SO_REUSEPORT is likely acceptable.
 >>> š>>
 >>> š>> The suggested fix does perform the desired function of explicitly detaching
 >>> š>> the inpcb from the hash list before the socket is disconnected from the
 >>> š>> inpcb. However, it's incomplete in that the invariant that's being broken is
 >>> š>> also relied on for other protocols (such as raw sockets). šThe correct
 >>> š>> invariant is that inp_socket is safe to follow unconditionally if an inpcb
 >>> š>> is locked and INP_DROPPED isn't set -- the bug is in "locked" not in
 >>> š>> "INP_DROPPED", which is why I think this is the wrong fix, even though it
 >>> š>> prevents a panic :-).
 >>>
 >>> šdj> Hello Robert,
 >>>
 >>> šdj> Thank you for taking your valuable time to find out the problem.
 >>> šdj> Since I don't have idea about network internals, would you have a patch
 >>> šdj> about this? I'd be glad to test it, thanks again.
 >>>
 >>> Here is the patch that implements what Robert suggests.
 >>>
 >>> Dave, could you test it?
 >>
 >> Sure. Thanks for cooking the patch.
 >> Machines have been running two days now without panic.

Thank you for testing it.

 dj> Is there any plan to commit your fix? Thank you.
 dj> I'd upgrade to 9.0-release from beta-2 once it's released.

I have an upgraded version of the patch, which is under review now. I have
been waiting for the response before asking you to test it, but it would be
great if you try it not waiting :-).

As it was pointed by Robert the previous version introduced a regression:
SO_REUSEPORT was ignored if setsockopt was called after bind (the old cached
value was still used). So the updated version fixes this and also contains
several other fixes, the most important among them is that it fixes the panic
for IPv6 bind case too.

-- 
Mikolaj Golub


--=-=-=
Content-Type: text/x-diff
Content-Disposition: attachment; filename=INP_REUSEPORT.patch

Index: sys/netinet/in_pcb.c
===================================================================
--- sys/netinet/in_pcb.c	(revision 226165)
+++ sys/netinet/in_pcb.c	(working copy)
@@ -575,8 +575,7 @@ in_pcbbind_setup(struct inpcb *inp, struct sockadd
 				     ntohl(t->inp_faddr.s_addr) == INADDR_ANY) &&
 				    (ntohl(sin->sin_addr.s_addr) != INADDR_ANY ||
 				     ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
-				     (t->inp_socket->so_options &
-					 SO_REUSEPORT) == 0) &&
+				     (t->inp_flags2 & INP_REUSEPORT) == 0) &&
 				    (inp->inp_cred->cr_uid !=
 				     t->inp_cred->cr_uid))
 					return (EADDRINUSE);
@@ -590,19 +589,19 @@ in_pcbbind_setup(struct inpcb *inp, struct sockadd
 				 * being in use (for now).  This is better
 				 * than a panic, but not desirable.
 				 */
-				tw = intotw(inp);
+				tw = intotw(t);
 				if (tw == NULL ||
 				    (reuseport & tw->tw_so_options) == 0)
 					return (EADDRINUSE);
-			} else if (t &&
-			    (reuseport & t->inp_socket->so_options) == 0) {
+			} else if (t && (reuseport == 0 ||
+			    (t->inp_flags2 & INP_REUSEPORT) == 0)) {
 #ifdef INET6
 				if (ntohl(sin->sin_addr.s_addr) !=
 				    INADDR_ANY ||
 				    ntohl(t->inp_laddr.s_addr) !=
 				    INADDR_ANY ||
-				    INP_SOCKAF(so) ==
-				    INP_SOCKAF(t->inp_socket))
+				    (inp->inp_vflag & INP_IPV6PROTO) == 0 ||
+				    (t->inp_vflag & INP_IPV6PROTO) == 0)
 #endif
 				return (EADDRINUSE);
 			}
Index: sys/netinet/in_pcb.h
===================================================================
--- sys/netinet/in_pcb.h	(revision 226165)
+++ sys/netinet/in_pcb.h	(working copy)
@@ -540,6 +540,7 @@ void 	inp_4tuple_get(struct inpcb *inp, uint32_t *
 #define	INP_LLE_VALID		0x00000001 /* cached lle is valid */	
 #define	INP_RT_VALID		0x00000002 /* cached rtentry is valid */
 #define	INP_PCBGROUPWILD	0x00000004 /* in pcbgroup wildcard list */
+#define	INP_REUSEPORT		0x00000008 /* socket SO_REUSEPORT option is set */
 
 /*
  * Flags passed to in_pcblookup*() functions.
Index: sys/netinet/ip_output.c
===================================================================
--- sys/netinet/ip_output.c	(revision 226165)
+++ sys/netinet/ip_output.c	(working copy)
@@ -895,12 +895,43 @@ ip_ctloutput(struct socket *so, struct sockopt *so
 
 	error = optval = 0;
 	if (sopt->sopt_level != IPPROTO_IP) {
-		if ((sopt->sopt_level == SOL_SOCKET) &&
-		    (sopt->sopt_name == SO_SETFIB)) {
-			inp->inp_inc.inc_fibnum = so->so_fibnum;
-			return (0);
+		error = EINVAL;
+
+		if (sopt->sopt_level == SOL_SOCKET &&
+		    sopt->sopt_dir == SOPT_SET) {
+			switch (sopt->sopt_name) {
+			case SO_REUSEADDR:
+				INP_WLOCK(inp);
+				if (IN_MULTICAST(ntohl(inp->inp_laddr.s_addr))) {
+					if ((so->so_options &
+					    (SO_REUSEADDR | SO_REUSEPORT)) != 0)
+						inp->inp_flags2 |= INP_REUSEPORT;
+					else
+						inp->inp_flags2 &= ~INP_REUSEPORT;
+				}
+				INP_WUNLOCK(inp);
+				error = 0;
+				break;
+			case SO_REUSEPORT:
+				INP_WLOCK(inp);
+				if ((so->so_options & SO_REUSEPORT) != 0)
+					inp->inp_flags2 |= INP_REUSEPORT;
+				else
+					inp->inp_flags2 &= ~INP_REUSEPORT;
+				INP_WUNLOCK(inp);
+				error = 0;
+				break;
+			case SO_SETFIB:
+				INP_WLOCK(inp);
+				inp->inp_inc.inc_fibnum = so->so_fibnum;
+				INP_WUNLOCK(inp);
+				error = 0;
+				break;
+			default:
+				break;
+			}
 		}
-		return (EINVAL);
+		return (error);
 	}
 
 	switch (sopt->sopt_dir) {
Index: sys/netinet6/ip6_output.c
===================================================================
--- sys/netinet6/ip6_output.c	(revision 226145)
+++ sys/netinet6/ip6_output.c	(working copy)
@@ -1421,7 +1421,38 @@ ip6_ctloutput(struct socket *so, struct sockopt *s
 	optval = 0;
 	uproto = (int)so->so_proto->pr_protocol;
 
-	if (level == IPPROTO_IPV6) {
+	if (level != IPPROTO_IPV6) {
+		error = EINVAL;
+
+		if (sopt->sopt_level == SOL_SOCKET &&
+		    sopt->sopt_dir == SOPT_SET) {
+			switch (sopt->sopt_name) {
+			case SO_REUSEADDR:
+				INP_WLOCK(in6p);
+				if (IN_MULTICAST(ntohl(in6p->inp_laddr.s_addr))) {
+					if ((so->so_options &
+					    (SO_REUSEADDR | SO_REUSEPORT)) != 0)
+						in6p->inp_flags2 |= INP_REUSEPORT;
+					else
+						in6p->inp_flags2 &= ~INP_REUSEPORT;
+				}
+				INP_WUNLOCK(in6p);
+				error = 0;
+				break;
+			case SO_REUSEPORT:
+				INP_WLOCK(in6p);
+				if ((so->so_options & SO_REUSEPORT) != 0)
+					in6p->inp_flags2 |= INP_REUSEPORT;
+				else
+					in6p->inp_flags2 &= ~INP_REUSEPORT;
+				INP_WUNLOCK(in6p);
+				error = 0;
+				break;
+			default:
+				break;
+			}
+		}
+	} else {		/* level == IPPROTO_IPV6 */
 		switch (op) {
 
 		case SOPT_SET:
@@ -2044,8 +2075,6 @@ do { \
 			}
 			break;
 		}
-	} else {		/* level != IPPROTO_IPV6 */
-		error = EINVAL;
 	}
 	return (error);
 }
Index: sys/netinet6/in6_pcb.c
===================================================================
--- sys/netinet6/in6_pcb.c	(revision 226145)
+++ sys/netinet6/in6_pcb.c	(working copy)
@@ -187,6 +187,7 @@ in6_pcbbind(register struct inpcb *inp, struct soc
 		}
 		if (lport) {
 			struct inpcb *t;
+			struct tcptw *tw;
 
 			/* GROSS */
 			if (ntohs(lport) <= V_ipport_reservedhigh &&
@@ -206,8 +207,8 @@ in6_pcbbind(register struct inpcb *inp, struct soc
 				     IN6_IS_ADDR_UNSPECIFIED(&t->in6p_faddr)) &&
 				    (!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr) ||
 				     !IN6_IS_ADDR_UNSPECIFIED(&t->in6p_laddr) ||
-				     (t->inp_socket->so_options & SO_REUSEPORT)
-				      == 0) && (inp->inp_cred->cr_uid !=
+				     (t->inp_flags2 & INP_REUSEPORT) == 0) &&
+				    (inp->inp_cred->cr_uid !=
 				     t->inp_cred->cr_uid))
 					return (EADDRINUSE);
 #ifdef INET
@@ -233,10 +234,21 @@ in6_pcbbind(register struct inpcb *inp, struct soc
 			}
 			t = in6_pcblookup_local(pcbinfo, &sin6->sin6_addr,
 			    lport, lookupflags, cred);
-			if (t && (reuseport & ((t->inp_flags & INP_TIMEWAIT) ?
-			    intotw(t)->tw_so_options :
-			    t->inp_socket->so_options)) == 0)
+			if (t && (t->inp_flags & INP_TIMEWAIT)) {
+				/*
+				 * XXXRW: If an incpb has had its timewait
+				 * state recycled, we treat the address as
+				 * being in use (for now).  This is better
+				 * than a panic, but not desirable.
+				 */
+				tw = intotw(t);
+				if (tw == NULL ||
+				    (reuseport & tw->tw_so_options) == 0)
+					return (EADDRINUSE);
+			} else if (t && (reuseport == 0 ||
+			    (t->inp_flags2 & INP_REUSEPORT) == 0)) {
 				return (EADDRINUSE);
+			}
 #ifdef INET
 			if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 &&
 			    IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
@@ -246,19 +258,19 @@ in6_pcbbind(register struct inpcb *inp, struct soc
 				t = in_pcblookup_local(pcbinfo, sin.sin_addr,
 				    lport, lookupflags, cred);
 				if (t && t->inp_flags & INP_TIMEWAIT) {
-					if ((reuseport &
-					    intotw(t)->tw_so_options) == 0 &&
-					    (ntohl(t->inp_laddr.s_addr) !=
+					tw = intotw(t);
+					if (tw == NULL)
+						return (EADDRINUSE);
+					if ((reuseport & tw->tw_so_options) == 0
+					    && (ntohl(t->inp_laddr.s_addr) !=
 					     INADDR_ANY || ((inp->inp_vflag &
 					     INP_IPV6PROTO) ==
 					     (t->inp_vflag & INP_IPV6PROTO))))
 						return (EADDRINUSE);
-				}
-				else if (t &&
-				    (reuseport & t->inp_socket->so_options)
-				    == 0 && (ntohl(t->inp_laddr.s_addr) !=
-				    INADDR_ANY || INP_SOCKAF(so) ==
-				     INP_SOCKAF(t->inp_socket)))
+				} else if (t && (reuseport == 0 ||
+				    (t->inp_flags2 & INP_REUSEPORT) == 0) &&
+				    (ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
+				    (t->inp_vflag & INP_IPV6PROTO) == 0))
 					return (EADDRINUSE);
 			}
 #endif

--=-=-=--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86fwiy91ty.fsf>