From owner-freebsd-hackers@FreeBSD.ORG  Mon Jul 12 11:33:53 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B32F8106564A
	for <freebsd-hackers@freebsd.org>; Mon, 12 Jul 2010 11:33:53 +0000 (UTC)
	(envelope-from pdegoeje@service2media.com)
Received: from s2m-is-001.service2media.com (rev-132-102.virtu.nl
	[217.114.102.132])
	by mx1.freebsd.org (Postfix) with ESMTP id 4F8198FC25
	for <freebsd-hackers@freebsd.org>; Mon, 12 Jul 2010 11:33:52 +0000 (UTC)
Received: from pieter-dev.localnet ([10.0.1.91] RDNS failed) by
	s2m-is-001.service2media.com with Microsoft SMTPSVC(6.0.3790.4675); 
	Mon, 12 Jul 2010 13:33:49 +0200
From: Pieter de Goeje <pdegoeje@service2media.com>
Organization: Service2Media
To: freebsd-hackers@freebsd.org
Date: Mon, 12 Jul 2010 13:33:48 +0200
User-Agent: KMail/1.13.3 (Linux/2.6.32-5-amd64; KDE/4.4.4; x86_64; ; )
References: <4C386208.291D2FB5@verizon.net>
In-Reply-To: <4C386208.291D2FB5@verizon.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201007121333.49017.pdegoeje@service2media.com>
X-OriginalArrivalTime: 12 Jul 2010 11:33:49.0629 (UTC)
	FILETIME=[18EF46D0:01CB21B6]
X-Mailman-Approved-At: Mon, 12 Jul 2010 12:43:43 +0000
Cc: Sergey Babkin <babkin@verizon.net>
Subject: Re: TCP over UDP
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jul 2010 11:33:53 -0000

On Saturday 10 July 2010 14:05:29 Sergey Babkin wrote:
> Hi guys,
> 
> I've got this idea, and I wonder if anyone has done it already,
> and if not then why. The idea is to put the TCP logic over UDP.
> 
> I've done some googling and all I've found is some academical
> user-space implementations of TCP that actually try to interoperate
> with "real" TCP. What I'm thinking about is different. It's
> to use the TCP-derived logic as a portable library that would
> do the good flow control, retransmitting, delivery confirmations
> etc over UDP.
> 
> Basically, every time you use UDP, you've got to reinvent your
> own retransmission and reliability protocol. And these protocols
> are typically no good at all, as the story with NFS switching
> from UDP to TCP and improving the performance shows. At the same
> time TCP provides a very good transport control logic, so why not
> just reuse this logic in a library to solve the UDP issues once
> and for all?
> 
> Then of course, why not just use TCP? The problem of TCP is that
> it's expensive. It uses the kernel memory for its contexts.
> It also requires a file descriptor per each connection. The file
> descriptors are an expensive resource, and besides, even if
> the limit is raised, there is the issue with historic select()
> fd_set allocating only 1024 bits and nobody checking for the
> overflow. Even if your own code is carefully designed to avoid using
> select() at all and/or create large enough bitmasks, it could
> always happen to use some stupid library that doesn't do that
> and causes the interesting one-bit memory corruptions.
> 
> Moving the connection logic to the user space makes the connections
> cheap. A hundred bytes or so per connection state is no big
> deal, you can easily create a million of these connections to
> the same process. All the state stays in the user-space pageable
> memory. Well, all of them sending data at the same time
> might not work so well, but caching a large number of currently
> inactive connections becomes cheap. Think of XMLRPC or SOAP
> or anything else over HTTP reusing the same TCP connection for
> multiple sequential requests. Now there is a painful balance
> of inactivity timeouts: make them too long and you
> overload the server, make them too short and the connections
> get dropped all the time. The cheap connections would allow
> to keep the much longer timeouts.
> 
> Then there are other interesting possibilities arising from the easy
> access to the protocol state. The underlying datagramness can be
> exposed to the top level, and this immediately gives the transactional
> TCP. Or we could look at the state and find out if the data has
> been actually delivered to and confirmed by the other side.
> Or we can even drop the inactive connections at the server without
> notifying the client. Then if the client sends more requests on this
> connection, the server could semi-transparently re-establish it
> (OK, this would require an extension from TCP). Or we can do
> the better keep-alives, not the TCP's hour-long ones, but
> something within a few seconds (would not work too well with
> millions of connections, but it's a different use case where
> we want to detect the lost peer fast). Or having "sub-channels",
> each with its own sequence number. If the data gets transferred
> over 100 parallel logical connections, few bytes at a time for
> each of them, combining the whole bunch into one datagram would
> be much more efficient tahn sending 100 datagrams. These are just
> the ideas off the bat, there's got to be more of these interesting
> usages.
> 
> It all looks like such an obviously good idea, that I wonder,
> why didn't anyone else try it before? Or have they tried
> it and found that it's not such a good idea after all?
> 
> -SB

TCP actually scales pretty well. All modern operating systems provide a way to 
do efficient select() operations, for example with FreeBSD's kqueue. Using a 
small bit of tuning one can effectively deal with 100k+ TCP connections on a 
single system. This mainly has to do with increasing the maximum number of 
filedescriptors and decreasing the maximum send/receive buffer sizes to 
conserve memory.

TCP provides very good throughput, and it achieves this using large send and 
receive buffers. Your userspace implementation will need to implement 
something similar. A few hundred bytes per connection is simply not enough.

If you want to deal with millions of clients, your protocol shall better not 
have any state at all. A good example of this is DNS.

I think that most applications can either use TCP directly with or without 
tuning or they have such specialized needs that a custom protocol is the only 
solution.

Regards,
Pieter de Goeje