FreeBSD Mail Archives

Date:      Wed, 16 Jan 2002 15:50:47 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Chad David <davidc@acns.ab.ca>
Cc:        current@freebsd.org
Subject:   Re: socket shutdown delay?
Message-ID:  <3C4611D7.F99A5147@mindspring.com>
References:  <20020116070908.A803@colnta.acns.ab.ca> <3C45F32A.5B517F7E@mindspring.com> <20020116152908.A1476@colnta.acns.ab.ca>

Chad David wrote:
> > A connection goes into FIN_WAIT_2 when it has received the ACK
> > of the FIN, but not received a FIN (or sent an ACK) itself, thus
> > permitting it to enter TIME_WAIT state for 2MSL before proceeding
> > to the CLOSED state, as a result of a server initiated close.
> >
> > A connection goes into LAST_ACK when it has sent a FIN and not
> > received the ACK of the FIN before proceeding to the CLOSED
> > state, as a result of a client initiated close.
> 
> I've got TCP/IP Illistrated V1 right beside me, so I basically
> knew what was happening.  Just not why.
> 
> Like I said in the original email, connections from another machine
> end up in TIME_WAIT right away, it is only local connection.

Maybe there is a bug in the interrupt thread code, or in the
scheduler for NETISR processing.  Like I said before, I think
this is unlikely.

The other possibility is a bug in simultaneous client and
server closes, but without information about your client
and server program's operation (e.g. if it's an HTTP session,
and the client closes without waiting for a response, or the
server responsed and closes), that's as close as I can give
you.  I *really* doubt that, since I think it would have
shown before.

The other possibility might be the sequence numbers on a
re-used connection going backwards.  If that were to happen,
you might see the sate machien push pack into LAST_ACK when
it shouldn't.

Be sure that you use the sysctl to set the sequence number
algorithm to the one specified in the RFC, instead of the
broken OpenBSD version that supposedly prevents predictive
session hijack (which should be an application level thing
about verification of the peer, anyway).

Also make sure that the keepalive sysctl is set on (1).

> > Since it's showing IP addresses, you appear to be using real
> > network connections, rather than loopback connections.
> 
> In this case yes.  Connections to 127.0.0.1 result in the same thing.

OK, so it's not lost packets because of the use of the network
driver.  This makes me lean toward the sequence number or RST
with no mbufs available problem.

[ ... test net intentionally lossy ... ]
> Nothing like that on the box.

OK.  It was low hanging fruit, but unlikely, but had to be
asked.

> > 2)    You have intentionally disabled KEEPALIVE, so that
> >       a close results in an RST instead of a normal
> >       shutdown of the TCP connection (I can't tell if
> >       you are doing a real call to "shutdown(2)", or if
> >       you are just relying on the OS resource tracking
> >       behaviour that is implicit to "close(2)" (but only
> >       if you don't set KEEPALIVE, and have disabled the
> >       sysctl default of always doing KEEPALIVE on every
> >       connection).  In this case, it's possible that the
> >       RST was lost on the wire, and since RSTs are not
> >       retransmitted, you have shot yourself in the foot.
> >
> >       Note:   You often see this type of foolish foot
> >               shooting when running MAST, WAST, or
> >               webbench, which try to factor out response
> >               speed and measure connection speed, so that
> >               they benchmark the server, not the FS or
> >               other OS latencies in the document delivery
> >               path (which is why these tools suck as real
> >               world benchmarks go).  You could also cause
> >               this (unlikely) with a bad firewall rule.
> 
> I haven't changed any sysctls, and other than SO_REUSEADDR,
> the default sockopts are being used.

This doesn't tell me the setting of the keepalive sysctl.  By
default, it won't be on unless the sysctl forces it on, which
it does by default, unless it's been changed, or the default
has been changed in -current (don't know).  So check this one.

> I also do not call
> shutdown() on either end, and both the client and server
> processes have exited and the connections still do not clear
> up (in time they do, around 10 minutes).

You should probably call shutdown(2), if you want your code
to be mostly correct.

You also didn't say that they in fact drain after that
period of time.

I suspect that you are just doing a large number of connections.

I frequently ended up with 50,000+ connections in TIME_WAIT
state (I rarely use the same machine for both the client and
the server, since that is not representative of real world
use), and, of course, it takes 2MSL for TIME_WAIT to drain
connections out.

My guess is that you have ran out of mbufs (your usage stats
tell me nothing about the abailable number of real mbufs;
even the "0 requests for memory denied" is not really as
useful as it would appear in the stats), or you just have
an incredibly large number of files open.

The FreeBSD file allocation table entry allocation for a
large number of simultaneously open files is bad.

Similarly, the FreeBSD allocation of the port space is
a linear lookup that has exponential time increase as the
number of connections go up.  The same is true of the
lookup of the INPCB and TCPCB on incoming packets.

It would be useful to log state transitions for a connection
case known to be bad -- that is, log the states starting after
the problem has started with a new connection pair or ten, in
order to see what's getting lost where.

> > 3)    You've exhausted your mbufs before you've exhausted
> >       the number of simultaneous connections you are
> >       permitted, because you have incorrectly tuned your
> >       kernel, and therefore all your connections are sitting
> >       in a starvation deadlock, waiting for packets that can
> >       never be sent because there are no mbufs available.
> 
> The client eventually fails with EADDRNOTAVAIL.

Yes, this is the outbound connection limitation because of the
ports.  There's three bugs there, in FreeBSD, as well, but they
generally limit the outbound connections, rather than causing
problems.

One tuning variable you probably want on the machine making the
connections is to up the TCP port range to 65535; you will have
to do two sysctls in order to do this.  This will delay your
client failure by about a factor of 8-10 times as many
connections (outbound connections count against the total, but
inbound connections do not, since they do not use up socket/port
pairs be source).

>         Allocated mbuf types:
>           102 mbufs allocated to data

These are probably TCP options on otherwise idle connections.

>         0% of mbuf map consumed
> mbuf cluster usage:
>         GEN list:       0/0 (in use/in pool)
>         CPU #0 list:    58/86 (in use/in pool)
>         CPU #1 list:    43/88 (in use/in pool)
>         Total:          101/174 (in use/in pool)
>         Maximum number allowed on each CPU list: 128
>         Maximum possible: 33792
>         0% of cluster map consumed
> 420 KBytes of wired memory reserved (54% in use)

I'm not sure if the 54% is of the available or max wired.  If
the max, this could be your problem.

> colnta->netstat -an | grep FIN_WAIT_2 | wc
>     2814   16884  219492
> 
> and a few minutes later:
> colnta->netstat -an | grep FIN_WAIT_2 | wc
>     1434    8604  111852

This indicates a 2MSL draining.  The resource track close could
also be slow.  You could probably get an incredible speedup by
doing explicit closes in the client program, starting with the
highest used fd, and working down, instead of going the other
way (it's probably a good idea to modify the FreeBSD resource
track close to so the same thing).

There are some other inefficiencies in the fd code that can be
addressed... nominally, the allocation is a linear search at
the last valid one going higher.  For most servers, this could
be significantly improved by linking free fd's in a sparse
list onto a "freelist", and maintaining a pointer to that,
instead of the index to the first free one, but that should only
impact you on allocation (like the inpcb hash, which fails
pretty badly, even when you tune up the hash size to some
unreasonable amount, and the port allocation for outbound
connections, which is, frankly, broken.  Both could benefit from
a nice btree overhaul).

THe timer code is also pretty sucky, even with a very large
callout wheel.  It would be incredibly valuable to have fixed
interval timers ordered by entry on interval specific lists
(e.g. MSL and 2MSL lists, as well as other common ones), so
that the scan of the timer entries could be stopped at the
first one whose expiration time was after the current time for
the given interval callout.  This would save you almost all of
your callout list traversals, which, with the wheel, have to be
ordered (see the Rice University paper on opportunistic timers
for a glancing approach at solving the real problem here).

These aren't biting you, though, because the quick draining is
happening, indicating that it's not really the timer code or
the other code that's your immediate problem (though you might
speed draining by a factor of 3 just by fixing the timers to
use ordered lists per interval, rather than the callout wheel).

> The box currently has 630MB free memory, and is 98.8% idle.

OK, this means that you aren't getting anywhere near the KVA
limits, and that you aren't eating as much of core as you might
be otherwise.

In practice, you can reserve as much as 50% of physical memory
for use in mbufs, if you are tuned correctly.

The limits that implies, assuming you are sending a lot of data,
are 315MB/32k ~= 10,000 client and server connections, or 20,000
server only connections (if the client is on another machine).
After that, your transmit windows ont he server and receive
windows on the client are full.

OOPS.

Halve that for -current, since the default window size was
doubled.

OK... you very well could be hitting the limits here, with
the number of sockets available and the amount of memory you
have to burn in mbufs.

Try setting your max window size down to 16k, or even 8k (you
really want to set the transmit windows large and the receive
windows small for a server, but that's not an option in the
current code, I think, and anyway, since you are running both
on the same machine, that makes it impossible for you to tune
a single machine for optimal performance as only a client or
only a server.

> I'm not sure what other information would be useful?

See above.

> > 4)    You've got local hacks that your aren't telling us
> >       about (shame on you!).
> 
> Nope.  Stock -current, none of my patches applied.

Heh... "not useful information without a date of cvsup,
and then possibly not even then".  Moving target problems...

Can you repeat this on 4.5RC?  If so, try 4.4-RELEASE.  It
may be related to the SYN cache code.

The SYN-cookie code is vulnerable to the "ACK gun" attack,
and since the SYN cache code falls back into SYN cookie
(it assumes that the reason it didn't find the corresponding
SYN in the SYN cache is that it overflowed and was discarded,
turning naked ACK attempts into SYN-cookie attempts completely
automatically), you might be hitting it that way.

If that's the case, then I suggest leaving the SYN cache
enabled, and disabling the SYN cookie.  If that doesn't fix
it, then you may also want to try disabling the SYN cache.

Other than that, once you've tried this, then I will need to
know what the failure modes are, and then more about the
client and server code (kqueue based?  Standard sockets
based?), and then I can suggest more to narrow it down.

Another thing you may want to try is delay closing the
server side of the connection for 1-2 seconds after the
last write.  This is the canonical way of forcing a client
to do the close first in all cases, which totally avoids
the server-side-close-first case, which also avoids the
FIN_WAIT_2.  For real code, you would have to add a "close
cache" and timer.

Hope this helps...

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C4611D7.F99A5147>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation