From owner-freebsd-hackers@FreeBSD.ORG Mon Jul 12 11:33:53 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B32F8106564A for ; Mon, 12 Jul 2010 11:33:53 +0000 (UTC) (envelope-from pdegoeje@service2media.com) Received: from s2m-is-001.service2media.com (rev-132-102.virtu.nl [217.114.102.132]) by mx1.freebsd.org (Postfix) with ESMTP id 4F8198FC25 for ; Mon, 12 Jul 2010 11:33:52 +0000 (UTC) Received: from pieter-dev.localnet ([10.0.1.91] RDNS failed) by s2m-is-001.service2media.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 12 Jul 2010 13:33:49 +0200 From: Pieter de Goeje Organization: Service2Media To: freebsd-hackers@freebsd.org Date: Mon, 12 Jul 2010 13:33:48 +0200 User-Agent: KMail/1.13.3 (Linux/2.6.32-5-amd64; KDE/4.4.4; x86_64; ; ) References: <4C386208.291D2FB5@verizon.net> In-Reply-To: <4C386208.291D2FB5@verizon.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201007121333.49017.pdegoeje@service2media.com> X-OriginalArrivalTime: 12 Jul 2010 11:33:49.0629 (UTC) FILETIME=[18EF46D0:01CB21B6] X-Mailman-Approved-At: Mon, 12 Jul 2010 12:43:43 +0000 Cc: Sergey Babkin Subject: Re: TCP over UDP X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2010 11:33:53 -0000 On Saturday 10 July 2010 14:05:29 Sergey Babkin wrote: > Hi guys, > > I've got this idea, and I wonder if anyone has done it already, > and if not then why. The idea is to put the TCP logic over UDP. > > I've done some googling and all I've found is some academical > user-space implementations of TCP that actually try to interoperate > with "real" TCP. What I'm thinking about is different. It's > to use the TCP-derived logic as a portable library that would > do the good flow control, retransmitting, delivery confirmations > etc over UDP. > > Basically, every time you use UDP, you've got to reinvent your > own retransmission and reliability protocol. And these protocols > are typically no good at all, as the story with NFS switching > from UDP to TCP and improving the performance shows. At the same > time TCP provides a very good transport control logic, so why not > just reuse this logic in a library to solve the UDP issues once > and for all? > > Then of course, why not just use TCP? The problem of TCP is that > it's expensive. It uses the kernel memory for its contexts. > It also requires a file descriptor per each connection. The file > descriptors are an expensive resource, and besides, even if > the limit is raised, there is the issue with historic select() > fd_set allocating only 1024 bits and nobody checking for the > overflow. Even if your own code is carefully designed to avoid using > select() at all and/or create large enough bitmasks, it could > always happen to use some stupid library that doesn't do that > and causes the interesting one-bit memory corruptions. > > Moving the connection logic to the user space makes the connections > cheap. A hundred bytes or so per connection state is no big > deal, you can easily create a million of these connections to > the same process. All the state stays in the user-space pageable > memory. Well, all of them sending data at the same time > might not work so well, but caching a large number of currently > inactive connections becomes cheap. Think of XMLRPC or SOAP > or anything else over HTTP reusing the same TCP connection for > multiple sequential requests. Now there is a painful balance > of inactivity timeouts: make them too long and you > overload the server, make them too short and the connections > get dropped all the time. The cheap connections would allow > to keep the much longer timeouts. > > Then there are other interesting possibilities arising from the easy > access to the protocol state. The underlying datagramness can be > exposed to the top level, and this immediately gives the transactional > TCP. Or we could look at the state and find out if the data has > been actually delivered to and confirmed by the other side. > Or we can even drop the inactive connections at the server without > notifying the client. Then if the client sends more requests on this > connection, the server could semi-transparently re-establish it > (OK, this would require an extension from TCP). Or we can do > the better keep-alives, not the TCP's hour-long ones, but > something within a few seconds (would not work too well with > millions of connections, but it's a different use case where > we want to detect the lost peer fast). Or having "sub-channels", > each with its own sequence number. If the data gets transferred > over 100 parallel logical connections, few bytes at a time for > each of them, combining the whole bunch into one datagram would > be much more efficient tahn sending 100 datagrams. These are just > the ideas off the bat, there's got to be more of these interesting > usages. > > It all looks like such an obviously good idea, that I wonder, > why didn't anyone else try it before? Or have they tried > it and found that it's not such a good idea after all? > > -SB TCP actually scales pretty well. All modern operating systems provide a way to do efficient select() operations, for example with FreeBSD's kqueue. Using a small bit of tuning one can effectively deal with 100k+ TCP connections on a single system. This mainly has to do with increasing the maximum number of filedescriptors and decreasing the maximum send/receive buffer sizes to conserve memory. TCP provides very good throughput, and it achieves this using large send and receive buffers. Your userspace implementation will need to implement something similar. A few hundred bytes per connection is simply not enough. If you want to deal with millions of clients, your protocol shall better not have any state at all. A good example of this is DNS. I think that most applications can either use TCP directly with or without tuning or they have such specialized needs that a custom protocol is the only solution. Regards, Pieter de Goeje