From owner-freebsd-hackers  Sat May 25 15:37:47 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id PAA18167
          for hackers-outgoing; Sat, 25 May 1996 15:37:47 -0700 (PDT)
Received: from Root.COM (implode.Root.COM [198.145.90.17])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id PAA18162
          for <hackers@FreeBSD.ORG>; Sat, 25 May 1996 15:37:43 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by Root.COM (8.7.5/8.6.5) with SMTP id PAA23150; Sat, 25 May 1996 15:37:47 -0700 (PDT)
Message-Id: <199605252237.PAA23150@Root.COM>
X-Authentication-Warning: implode.Root.COM: Host localhost [127.0.0.1] didn't use HELO protocol
To: "Karl Denninger, MCSNet" <karl@mcs.com>
cc: hackers@FreeBSD.ORG
Subject: Re: Grrr.. is this is a FreeBSD problem (TIME_WAIT again) 
In-reply-to: Your message of "Sat, 25 May 1996 16:20:41 CDT."
             <m0uNQlN-000IDOC@venus.mcs.com> 
From: David Greenman <davidg@Root.COM>
Reply-To: davidg@Root.COM
Date: Sat, 25 May 1996 15:37:47 -0700
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

>If the caller and callee are on DIFFERENT machines, I get no stale sockets.
>This is reliable even if there are tens of new connections per minute.
>
>If the caller and callee are on the SAME machine, I get sockets in TIME_WAIT
>for 2 minutes each (grrrr) which, if the traffic is heavy enough, eventually
>blocks new connections for a few minutes until they clear up.  None of the 
>sockets in TIME_WAIT has output or input pending; both counts show zero.
>
>This is a serious problem!
>
>Interestingly enough, I can switch the end of the link which "netstat" thinks
>is the "local" end by changing who calls shutdown() first!  This is also
>unexpected; I would have thought that the caller ALWAYS would be the "local"
>side of the connection.
>
>I've checked and rechecked -- the same code, running across two machines,
>does not do this.  But when the calling and called code are on the same
>system (2.1-STABLE) it does -- repeatedly and reliably.
>
>Any ideas?  While one solution would be to get the code off the same
>(common) machine, there are reasons that I don't want to do this in normal
>production.  But, I need to use TCP (rather than local Unix domain sockets)
>because the BACKUP server is on a different system (in the event the first
>one crashes).
>
>Why would this happen when the caller and callee are on the same box, but
>not when the traffic actually goes across the network?  Has anyone else seen
>anything like this in their experience?  Due to the structure of this module
>(its a drop-in into a stock daemon from another source) I cannot leave the 
>socket open across requests, and I'd like to understand the reason for
>this behavior anyway.

   Based on what you've said thus far, it's working as it is supposed to.
There is a good discussion of the 2MSL wait ("TIME_WAIT") in "TCP/IP
Illustrated Volume 1", page 242, by W. Richard Stevens. Depending on how
your program handles it's ports/connections, you might be able to use the
SO_REUSEADDR socket option to avoid the problem. See page 244.

-DG

David Greenman
Core-team/Principal Architect, The FreeBSD Project