Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Nov 1996 17:50:51 -0600 (CST)
From:      Karl Denninger  <karl@Mcs.Net>
To:        current@freebsd.org
Subject:   SERIOUS TCP problem in 3.0 and the new compiler
Message-ID:  <199611142350.RAA08015@Jupiter.Mcs.Net>

next in thread | raw e-mail | index | archive | help
Hi folks,

I have uncovered a very serious problem in the new compiler and/or libraries 
(at least, I think it is) in the -CURRENT branch.

Unfortunately, I haven't been able to run it down as of yet.

This is what happens:

1)	Open a socket to a server, which forks off a copy of itself after
	accepting the socket connection.
2)	Send LOTS (thousands) of transactions (a "transaction" is defined
	as transmission of one packet of data with a known size and prefix, 
	the server end reads it, does something, and responds in some way 
	with data).

At some point a few thousand transactions into the process, you "lose" one
of the responses.  That is, the process which is doing the serving THINKS it
wrote a response, but the CLIENT never gets it!

Since this is a lock-step protocol, and we're relying on TCP to do the
reliability part of data delivery, and no more than one request can ever be
outstanding in this protocol, you're screwed.  The process locks up hard.

If we recompile under gcc 2.6.3, even running with a 3.0 (-current) kernel,
the problem DOES NOT happen.  If you compile under the current release (as
of 11/11 at least) it *DOES* -- reliably.

In both cases we're linked -static so that the binaries are portable, so 
the shared libraries should not be involved in this at all.

The problem appears to be that somewhere along the line one packet in the
TCP stream gets "lost" and never is delivered to the other end.

Now, I've seen this in rlogin before -- a rlogin'd connection will just
"hang" for no apparent reason, and sending something the other direction
(ie: "~<CR>" -- not "~.", which would disconnect me, gets it going again).

In this case the socket in question is full-duplex; we both read and write
from it.  Unfortuately, since the protocol is lock-step, I *can't* prime the
channel at that point with a message in the opposite direction.

This is a pretty serious problem folks.  I've never seen anything like it
before, but its definitely real and definitely a problem in the current
source base.

Again, this is NOT a library mismatch issue -- I linked the software in
question static on our codebase machine and was able to reliably reproduce
the problem.

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - The Finest Internet Connectivity
http://www.mcs.net/~karl     | T1's from $600 monthly to FULL DS-3 Service
			     | 32 Analog Prefixes, 13 ISDN, Web servers $75/mo
Voice: [+1 312 803-MCS1 x219]| Email to "info@mcs.net" WWW: http://www.mcs.net/
Fax:   [+1 312 248-9865]     | 2 FULL DS-3 Internet links; 400Mbps B/W Internal



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199611142350.RAA08015>