Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Nov 1999 12:36:08 -0800 (PST)
From:      John Polstra <jdp@polstra.com>
To:        wollman@khavrinen.lcs.mit.edu
Cc:        current@freebsd.org
Subject:   Re: Route table leaks
Message-ID:  <199911242036.MAA24659@vashon.polstra.com>
In-Reply-To: <199911221552.KAA84691@khavrinen.lcs.mit.edu>
References:  <199911220150.UAA78559@khavrinen.lcs.mit.edu> <XFMail.991121195840.jdp@polstra.com> <199911221552.KAA84691@khavrinen.lcs.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
I've been working on the cvsup-master route table leaks.  I haven't
found the bug yet, but I've got some clues now.  If this info inspires
a Eureka! from any of you, please let me know.

I started by running this script to print out key information every
2 seconds while I ran a test:

    #! /bin/sh

    while :; do
	    date "+%H:%M:%S"
	    vmstat -m | grep 'routetbl[     ]' || exit
	    netstat -ran | egrep 'default|206\.213\.73\.12' || exit
	    echo "=============================" || exit
	    sleep 2 || exit
    done

(Yes, this is crude.  But remember, the machine is 900 miles away and
I'm trying not to disrupt service too much.)

The output began like this:

19:41:56
     routetbl  3682   518K    522K 21221K    11473    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1823       60      wb0
=============================
19:41:58
     routetbl  3682   518K    522K 21221K    11473    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1823       60      wb0

I.e., there were 3682 route table structures in use, and 1823
references to the default route.  Then I made a connection to the
CVSup server from one of my own machines (206.213.73.12):

19:42:00
     routetbl  3684   518K    522K 21221K    11475    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1824       60      wb0
206.213.73.12      204.216.27.17      UGHW        1       64      wb0

So far, so good.  A cloned route was created and the refcount on the
default route was incremented by one.  Two new route table entries
were allocated, and that seems to be normal and OK.

I immediately closed the connection, and it entered the TIME_WAIT
state on cvsup-master.  The script output remained as above for 60
seconds (2 * MSL), after which it changed to this:

19:42:59
     routetbl  3684   518K    522K 21221K    11475    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1824       62      wb0
206.213.73.12      204.216.27.17      UGHW3       0       64      wb0   3600

This still looks OK.  The cloned route has gained the "3" flag and
a 1-hour expiration time.  That is because the TIME_WAIT state has
ended, the PCB has been discarded, and the cloned route is now being
managed by the caching code in "sys/netinet/in_rmx.c".  (That's what
the "3" flag signifies.)  The basic idea of this code (as I understand
it) is to keep cloned routes for dead connections around for awhile
in case they are needed again soon.  That's useful since the cloned
routes contain RTT estimates and so forth.

Now we get to the interesting part.  One would expect the route to
remain cached for 3600 seconds.  There are ways that in_rmx.c can
expire it sooner than that, but I confirmed that those situations
(e.g., too many cached routes) aren't arising.  Nevertheless, the
route is deleted after roughly another 200 seconds:

19:46:14
     routetbl  3684   518K    522K 21221K    11478    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1824       64      wb0
206.213.73.12      204.216.27.17      UGHW3       0       64      wb0   3405
=============================
19:46:16
     routetbl  3682   518K    522K 21221K    11480    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1823       64      wb0

Using DDB and some of routed's tracing options I determined that
routed is deleting the route.  More on that later.  Anyway, given
that the route is being deleted, things still look OK in the above.
The route is deleted, the 2 route table entries are freed again, and
the refcount on the default route is decremented back to its
original value.  But look what happens in the next 2 seconds:

19:46:18
     routetbl  3684   518K    522K 21221K    11482    0     0  16,32,64,128,256
default            204.216.27.17      UGc      1824       64      wb0

The 2 entries were allocated again and the refcount on the default
route was incremented.  Why?  I don't know (yet).  But the numbers
remain that way thereafter.  That's the leak, and I can reproduce it
reliably on cvsup-master.

Unfortunately, I cannot reproduce the problem here on my own -current
machine.  I tried to simulate the environment as accurately as
possible, including running routed.  On my machine, routed deletes the
cached route before it has expired too, but the leak doesn't happen.

One other thing.  Back on cvsup-master, I changed rc.conf so that it
sets the default route statically, and I disabled routed.  That has
completely eliminated the route table leak.

Any ideas?  Using DDB remotely through a console switch really isn't
much fun.  I'd prefer a Eureka! from somebody. :-)

John
-- 
  John Polstra                                               jdp@polstra.com
  John D. Polstra & Co., Inc.                        Seattle, Washington USA
  "No matter how cynical I get, I just can't keep up."        -- Nora Ephron


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911242036.MAA24659>