From owner-freebsd-current Thu Nov 14 15:53:34 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id PAA21838 for current-outgoing; Thu, 14 Nov 1996 15:53:34 -0800 (PST) Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id PAA21824 for ; Thu, 14 Nov 1996 15:53:23 -0800 (PST) Received: from Mailbox.mcs.com (Mailbox.mcs.com [192.160.127.87]) by Kitten.mcs.com (8.8.2/8.8.2) with ESMTP id RAA03473 for ; Thu, 14 Nov 1996 17:50:53 -0600 (CST) Received: from Jupiter.Mcs.Net (karl@Jupiter.mcs.net [192.160.127.88]) by Mailbox.mcs.com (8.8.2/8.8.2) with ESMTP id RAA19049 for ; Thu, 14 Nov 1996 17:50:52 -0600 (CST) Received: (from karl@localhost) by Jupiter.Mcs.Net (8.8.2/8.8.2) id RAA08015 for current@freebsd.org; Thu, 14 Nov 1996 17:50:51 -0600 (CST) From: Karl Denninger Message-Id: <199611142350.RAA08015@Jupiter.Mcs.Net> Subject: SERIOUS TCP problem in 3.0 and the new compiler To: current@freebsd.org Date: Thu, 14 Nov 1996 17:50:51 -0600 (CST) X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Hi folks, I have uncovered a very serious problem in the new compiler and/or libraries (at least, I think it is) in the -CURRENT branch. Unfortunately, I haven't been able to run it down as of yet. This is what happens: 1) Open a socket to a server, which forks off a copy of itself after accepting the socket connection. 2) Send LOTS (thousands) of transactions (a "transaction" is defined as transmission of one packet of data with a known size and prefix, the server end reads it, does something, and responds in some way with data). At some point a few thousand transactions into the process, you "lose" one of the responses. That is, the process which is doing the serving THINKS it wrote a response, but the CLIENT never gets it! Since this is a lock-step protocol, and we're relying on TCP to do the reliability part of data delivery, and no more than one request can ever be outstanding in this protocol, you're screwed. The process locks up hard. If we recompile under gcc 2.6.3, even running with a 3.0 (-current) kernel, the problem DOES NOT happen. If you compile under the current release (as of 11/11 at least) it *DOES* -- reliably. In both cases we're linked -static so that the binaries are portable, so the shared libraries should not be involved in this at all. The problem appears to be that somewhere along the line one packet in the TCP stream gets "lost" and never is delivered to the other end. Now, I've seen this in rlogin before -- a rlogin'd connection will just "hang" for no apparent reason, and sending something the other direction (ie: "~" -- not "~.", which would disconnect me, gets it going again). In this case the socket in question is full-duplex; we both read and write from it. Unfortuately, since the protocol is lock-step, I *can't* prime the channel at that point with a message in the opposite direction. This is a pretty serious problem folks. I've never seen anything like it before, but its definitely real and definitely a problem in the current source base. Again, this is NOT a library mismatch issue -- I linked the software in question static on our codebase machine and was able to reliably reproduce the problem. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - The Finest Internet Connectivity http://www.mcs.net/~karl | T1's from $600 monthly to FULL DS-3 Service | 32 Analog Prefixes, 13 ISDN, Web servers $75/mo Voice: [+1 312 803-MCS1 x219]| Email to "info@mcs.net" WWW: http://www.mcs.net/ Fax: [+1 312 248-9865] | 2 FULL DS-3 Internet links; 400Mbps B/W Internal