Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Sep 2007 09:59:37 -0700 (PDT)
From:      Doug Ambrisko <ambrisko@ambrisko.com>
To:        Ivan Voras <ivoras@freebsd.org>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Panic in rt_check
Message-ID:  <200709281659.l8SGxbBv072053@ambrisko.com>
In-Reply-To: <fddd8n$s82$1@sea.gmane.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Ivan Voras writes:
-- Start of PGP signed section.
[ Charset UTF-8 unsupported, converting... ]
| Hi,
| 
| I have a machine that panics almost daily in route.c, in rt_check(). 
| This panic has been reported by several users, including Marcel 
| Moolenaar for a machine in freebsd.org.
| 
| The problem is present in both 6-STABLE and 7-CURRENT, and apparently it 
| manifests on SMP machines, both i386 and AMD64.
| 
| The panic backtrace looks like this:
| 
| panic: mtx_lock() of destroyed mutex @ /usr/src/sys/net/route.c:1305
| cpuid = 1
| KDB: stack backtrace:
| db_trace_self_wrapper(c091bcf0,e38b690c,c0659fc1,c093f3cf,1,...) at 
| db_trace_self_wrapper+0x26
| kdb_backtrace(c093f3cf,1,c0917de2,e38b6918,1,...) at kdb_backtrace+0x29
| panic(c0917de2,c0925d40,519,0,0,...) at panic+0x111
| _mtx_lock_flags(c5d333a8,0,c0925d40,519,0,...) at _mtx_lock_flags+0x59
| rt_check(e38b6970,e38b698c,c55b7d10,0,0,...) at rt_check+0x11e
| arpresolve(c4e27000,c5d33d98,c50dbe00,c55b7d10,e38b69a6,...) at 
| arpresolve+0xaf
| ether_output(c4e27000,c50dbe00,c55b7d10,c5d33d98,ccf8b348,...) at 
| ether_output+0x7e
| ip_output(c50dbe00,0,e38b6a1c,0,0,...) at ip_output+0xa09
| tcp_output(ccefbac8,0,c0929785,91d,0,...) at tcp_output+0x1463
| tcp_do_segment(ccefbac8,28,0,1dd,901f,...) at tcp_do_segment+0x1c97
| tcp_input(c6095100,14,c4ea3c00,1,0,...) at tcp_input+0xd5e
| ip_input(c6095100,0,c09258bd,8c,c09efc38,...) at ip_input+0x662
| netisr_processqueue(e38b6cc4,c064df85,c09eb940,1,c4d03480,...) at 
| netisr_processqueue+0x98
| swi_net(0,0,c0915aee,471,c4d0bd64,...) at swi_net+0xdb
| ithread_loop(c4d0c270,e38b6d38,c0915862,315,c4d56558,...) at 
| ithread_loop+0x1c5
| fork_exit(c063e2d0,c4d0c270,e38b6d38) at fork_exit+0xc5
| fork_trampoline() at fork_trampoline+0x8
| 
| ...
| 
| #0  doadump () at pcpu.h:195
| 195     pcpu.h: No such file or directory.
|          in pcpu.h
| (kgdb) bt
| #0  doadump () at pcpu.h:195
| #1  0xc0659d2c in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
| #2  0xc0659ff0 in panic (fmt=Variable "fmt" is not available.
| ) at /usr/src/sys/kern/kern_shutdown.c:563
| #3  0xc064e699 in _mtx_lock_flags (m=0x0, opts=0, file=0xc0925d40 
| "/usr/src/sys/net/route.c", line=1305)
|      at /usr/src/sys/kern/kern_mutex.c:178
| #4  0xc06fe28e in rt_check (lrt=0xe38b6970, lrt0=0xe38b698c, 
| dst=0xc55b7d10) at /usr/src/sys/net/route.c:1305
| #5  0xc070282f in arpresolve (ifp=0xc4e27000, rt0=0xc5d33d98, 
| m=0xc50dbe00, dst=0xc55b7d10, desten=0xe38b69a6 "")
|      at /usr/src/sys/netinet/if_ether.c:373
| #6  0xc06f019e in ether_output (ifp=0xc4e27000, m=0xc50dbe00, 
| dst=0xc55b7d10, rt0=0xc5d33d98) at /usr/src/sys/net/if_ethersubr.c:175
| #7  0xc07127a9 in ip_output (m=0xc50dbe00, opt=0x0, ro=0xe38b6a1c, 
| flags=Variable "flags" is not available.
| ) at /usr/src/sys/netinet/ip_output.c:547
| #8  0xc076d6e3 in tcp_output (tp=0xccefbac8) at 
| /usr/src/sys/netinet/tcp_output.c:1125
| #9  0xc076ab87 in tcp_do_segment (m=0xc6095100, th=0xc6095158, 
| so=0xccdb67bc, tp=0xccefbac8, drop_hdrlen=40, tlen=0)
|      at /usr/src/sys/netinet/tcp_input.c:2345
| #10 0xc076bb0e in tcp_input (m=0xc6095100, off0=20) at 
| /usr/src/sys/netinet/tcp_input.c:843
| #11 0xc0710c42 in ip_input (m=0xc6095100) at 
| /usr/src/sys/netinet/ip_input.c:663
| #12 0xc06f9148 in netisr_processqueue (ni=0xc09efc38) at 
| /usr/src/sys/net/netisr.c:143
| #13 0xc06f925b in swi_net (dummy=0x0) at /usr/src/sys/net/netisr.c:256
| #14 0xc063e495 in ithread_loop (arg=0xc4d0c270) at 
| /usr/src/sys/kern/kern_intr.c:1036
| #15 0xc063b845 in fork_exit (callout=0xc063e2d0 <ithread_loop>, 
| arg=0xc4d0c270, frame=0xe38b6d38) at /usr/src/sys/kern/kern_fork.c:797
| #16 0xc0896f80 in fork_trampoline () at 
| /usr/src/sys/i386/i386/exception.s:205
| 
| I've been trying to solve this with Craig Rodrigues, and I've tried 
| several patches, without success. The backtrace above happens on the 
| following code from net/route.c:
| 
| 1299     /* XXX BSD/OS checks dst->sa_family != AF_NS */
| 1300     if (rt->rt_flags & RTF_GATEWAY) {
| 1301         struct rtentry *temp_rt_gwroute = rt->rt_gwroute;
| 1302         if (temp_rt_gwroute == NULL)
| 1303             goto lookup;
| 1304         rt = rt->rt_gwroute;
| 1305         RT_LOCK(rt);        /* NB: gwroute */
| 1306         if(rt0->rt_flags & 0x80000000U){
| 1307             /*This rt is under process...*/
| 1308             RT_UNLOCK(rt);
| 1309             RT_UNLOCK(rt0);
| 1310             goto try_again;
| 1311         }
| 1312         if ((rt->rt_flags & RTF_UP) == 0) {
| 1313             rt0->rt_flags |= 0x80000000U;
| 1314             RTFREE_LOCKED(rt);  /* unlock gwroute */
| 1315             rt = rt0;
| 1316         lookup:
| 1317             RT_UNLOCK(rt0);
| 1318             rt = rtalloc1(rt->rt_gateway, 1, 0UL);
| 1319             if (rt == rt0) {
| 1320                 rt0->rt_gwroute = NULL;
| 1321                 RT_REMREF(rt0);
| 1322                 RT_UNLOCK(rt0);
| 1323                 return (ENETUNREACH);
| 1324             }
| 1325             RT_LOCK(rt0);
| 1326             rt0->rt_gwroute = rt;
| 1327             rt0->rt_flags &= (~0x80000000U);
| 1328             if (rt == NULL) {
| 1329                 RT_UNLOCK(rt0);
| 1330                 return (EHOSTUNREACH);
| 1331             }
| 1332         }
| 1333         RT_UNLOCK(rt0);
| 1334     }
| 
| This code contains several patches we tried for workarounds, without any 
| success. The panic is always in RT_LOCK(rt) line: sometimes it's NULL 
| pointer reference, sometimes it's an operation on destroyed mutex.
| 
| This is a critical problem for me, but I believe it's also critical for 
| other users.
| 
| Does anyone have more ideas about how to solve this problem?

Something along the lines of:

Index: sys/net/route.c
===================================================================
RCS file: /usr/local/cvsroot/freebsd/src/sys/net/route.c,v
retrieving revision 1.109.2.3
diff -u -p -r1.109.2.3 route.c
--- sys/net/route.c	25 Feb 2007 05:36:25 -0000	1.109.2.3
+++ sys/net/route.c	27 Sep 2007 02:03:05 -0000
@@ -615,7 +615,8 @@ rtexpunge(struct rtentry *rt)
 	 * we held its last reference.
 	 */
 	if (rt->rt_gwroute) {
-		RTFREE(rt->rt_gwroute);
+		if (rt->rt_gwroute->rt_refcnt)
+			RTFREE(rt->rt_gwroute);
 		rt->rt_gwroute = NULL;
 	}
 
@@ -701,7 +702,8 @@ rtrequest1(int req, struct rt_addrinfo *
 		 * we held its last reference.
 		 */
 		if (rt->rt_gwroute) {
-			RTFREE(rt->rt_gwroute);
+			if (rt->rt_gwroute->rt_refcnt)
+				RTFREE(rt->rt_gwroute);
 			rt->rt_gwroute = NULL;
 		}
 
@@ -822,9 +824,11 @@ rtrequest1(int req, struct rt_addrinfo *
 		 */
 		if (rn == NULL) {
 			if (rt->rt_gwroute)
-				RTFREE(rt->rt_gwroute);
+				if (rt->rt_gwroute->rt_refcnt)
+					RTFREE(rt->rt_gwroute);
 			if (rt->rt_ifa)
-				IFAFREE(rt->rt_ifa);
+				if (rt->rt_ifa->ifa_refcnt)
+					IFAFREE(rt->rt_ifa);
 			Free(rt_key(rt));
 			RT_LOCK_DESTROY(rt);
 			uma_zfree(rtzone, rt);
@@ -1039,7 +1043,8 @@ rt_setgate(struct rtentry *rt, struct so
 			if (rt->rt_gwroute == gwrt) {
 				RT_REMREF(rt->rt_gwroute);
 			} else
-				RTFREE(rt->rt_gwroute);
+				if (rt->rt_gwroute->rt_refcnt)
+					RTFREE(rt->rt_gwroute);
 		}
 
 		if ((rt->rt_gwroute = gwrt) != NULL)

might help.  The problem here was a stale gateway route going away in flight.
You might try to check the refcnt of the route.  This is common to -current
and -stable.  In -stable you can "fix" it by turning off mpnetsafe.  Your 
panic looks different then this but it might raise some more questions that
could lead to a solution.  I'd be looking at rt_gwroute->rt_refcnt.

Note that I did get a panic before like yours until I settled on the above 
patch for another issue.  Then that problem and my others didn't occur
any more (well in a 6.1 I had to merge in jhb's bpf race fix).  So maybe
you might want to revert other patches and try just this one.

You should be able to poke around the route structure via kgdb.  On a cool
note I was using kgdb over IPMI serial over lan to the remote host and 
I could "flip" between various remote hosts :-)

Doug A.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200709281659.l8SGxbBv072053>