From owner-freebsd-net Wed Sep 5 2:20:11 2001 Delivered-To: freebsd-net@freebsd.org Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.121.12]) by hub.freebsd.org (Postfix) with ESMTP id 9052A37B407; Wed, 5 Sep 2001 02:19:46 -0700 (PDT) Received: from mindspring.com (dialup-209.245.138.192.Dial1.SanJose1.Level3.net [209.245.138.192]) by harrier.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id CAA05452; Wed, 5 Sep 2001 02:19:26 -0700 (PDT) Message-ID: <3B95EE4A.EF204095@mindspring.com> Date: Wed, 05 Sep 2001 02:20:10 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Mike Silbersack Cc: "Vladimir A. Jakovenko" , freebsd-net@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: SO_REUSEPORT on unicast UDP sockets References: <20010904231049.E7815-100000@achilles.silby.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Mike Silbersack wrote: > > Similarly, there are a number of bugs in the TCP sockets as > > well; specifically, there's a problem with all sockets being > > treated as being in the same collision domain, when doing > > automatic port assignment. This limits you to 65535 oubound > > TCP connections, even though you have multiple IP aliases on > > an interface (theoretically, you should get 64k connections > > per IP address, if you bind _not_ to IN_ADDR_ANY, but instead > > use a specific port, but the hash is broken). > > I like this problem's evil sibling: client side TIME_WAITs. If > you build them up, you just sit there unable to allocate outgoing > ports until they time out. If you fix or workaround the source IP address problem, and patch/tune the kernel for enough outbound sockets, you can go to 250,000 outbound connections very easily. I used a couple of 1GB memory systems in this configuration to get my 1M (actually, closer to 2M) inbound server connections... obviously, a server doesn't have the port limitation, when it comes to accepting connections. The client TIME_WAIT problem is more an issue for port reuse; for a 2MSL timer in the standard 60 second range, this will basically limit you to 65535/60, or ~1000 outbound connections a second per IP address, as a sustained rate, with a total outstanding count of 65535 * IP_address_count. Unless you set SO_REUSEPORT/SO_REUSEADDR. So for the client side, you are pretty much limited by the bug (or your fix), and whatever you set the 2MSL timer down to, as a sustained rate top end. For most real world uses, apart from test equipment, which will usually just use raw sockets directly, and fake the AYN/ACK for the SYN, and then accept the ACK without an RST, you never really get up into this number of client connections on a single box. > Maybe net or openbsd handle these situations better, I'll have > to check later. I doubt it. Until I did testing on 4.3, no one had really run over 32,766 open sockets in a production server, since at that point, the ucred reference count overflowed, which would result in some strange and very hard to identify crashes, when closing those connections. Alfred fixed this in -current, but it wasn't done consciously to address a known problem, it was done "just in case" (Alfred finds problems like that, and fixes them without necessarily being aware of it... 8-)). It hadn't been MFC'ed back to 4.3 until I identified an actual problem, and the root cause. NetBSD and OpenBSD have some hacks on the server side of the scaling problem (e.g. they have each implemented a SYN cache, which is OK as far as it goes, but is really inferior to the SYN cache and rate halving algorithm code (also against FreeBSD) out of the Pittsburgh Supercomputing Center. I've done a preliminary port of the PSC code to 4.x, actually, though I would need to strip out a number of local changes. One interesting thing about the SYN cache code is that it could use the tcptmpl allocation until it saw the ACK (or even the first data, as was suggested by some of the researchers at that startup in India, a while back, though that's very aggressive). Mostly, you aren't going to see the hashing on both source and detination IP's and ports -- what you'd see in an L2/L3 switch, if you were building one -- which would let you reuse the local pair, so long as it was associated with a different remote pair. That's probably the real long term fix, if there is one. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message