Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Jul 2010 13:33:48 +0200
From:      Pieter de Goeje <pdegoeje@service2media.com>
To:        freebsd-hackers@freebsd.org
Cc:        Sergey Babkin <babkin@verizon.net>
Subject:   Re: TCP over UDP
Message-ID:  <201007121333.49017.pdegoeje@service2media.com>
In-Reply-To: <4C386208.291D2FB5@verizon.net>
References:  <4C386208.291D2FB5@verizon.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Saturday 10 July 2010 14:05:29 Sergey Babkin wrote:
> Hi guys,
> 
> I've got this idea, and I wonder if anyone has done it already,
> and if not then why. The idea is to put the TCP logic over UDP.
> 
> I've done some googling and all I've found is some academical
> user-space implementations of TCP that actually try to interoperate
> with "real" TCP. What I'm thinking about is different. It's
> to use the TCP-derived logic as a portable library that would
> do the good flow control, retransmitting, delivery confirmations
> etc over UDP.
> 
> Basically, every time you use UDP, you've got to reinvent your
> own retransmission and reliability protocol. And these protocols
> are typically no good at all, as the story with NFS switching
> from UDP to TCP and improving the performance shows. At the same
> time TCP provides a very good transport control logic, so why not
> just reuse this logic in a library to solve the UDP issues once
> and for all?
> 
> Then of course, why not just use TCP? The problem of TCP is that
> it's expensive. It uses the kernel memory for its contexts.
> It also requires a file descriptor per each connection. The file
> descriptors are an expensive resource, and besides, even if
> the limit is raised, there is the issue with historic select()
> fd_set allocating only 1024 bits and nobody checking for the
> overflow. Even if your own code is carefully designed to avoid using
> select() at all and/or create large enough bitmasks, it could
> always happen to use some stupid library that doesn't do that
> and causes the interesting one-bit memory corruptions.
> 
> Moving the connection logic to the user space makes the connections
> cheap. A hundred bytes or so per connection state is no big
> deal, you can easily create a million of these connections to
> the same process. All the state stays in the user-space pageable
> memory. Well, all of them sending data at the same time
> might not work so well, but caching a large number of currently
> inactive connections becomes cheap. Think of XMLRPC or SOAP
> or anything else over HTTP reusing the same TCP connection for
> multiple sequential requests. Now there is a painful balance
> of inactivity timeouts: make them too long and you
> overload the server, make them too short and the connections
> get dropped all the time. The cheap connections would allow
> to keep the much longer timeouts.
> 
> Then there are other interesting possibilities arising from the easy
> access to the protocol state. The underlying datagramness can be
> exposed to the top level, and this immediately gives the transactional
> TCP. Or we could look at the state and find out if the data has
> been actually delivered to and confirmed by the other side.
> Or we can even drop the inactive connections at the server without
> notifying the client. Then if the client sends more requests on this
> connection, the server could semi-transparently re-establish it
> (OK, this would require an extension from TCP). Or we can do
> the better keep-alives, not the TCP's hour-long ones, but
> something within a few seconds (would not work too well with
> millions of connections, but it's a different use case where
> we want to detect the lost peer fast). Or having "sub-channels",
> each with its own sequence number. If the data gets transferred
> over 100 parallel logical connections, few bytes at a time for
> each of them, combining the whole bunch into one datagram would
> be much more efficient tahn sending 100 datagrams. These are just
> the ideas off the bat, there's got to be more of these interesting
> usages.
> 
> It all looks like such an obviously good idea, that I wonder,
> why didn't anyone else try it before? Or have they tried
> it and found that it's not such a good idea after all?
> 
> -SB

TCP actually scales pretty well. All modern operating systems provide a way to 
do efficient select() operations, for example with FreeBSD's kqueue. Using a 
small bit of tuning one can effectively deal with 100k+ TCP connections on a 
single system. This mainly has to do with increasing the maximum number of 
filedescriptors and decreasing the maximum send/receive buffer sizes to 
conserve memory.

TCP provides very good throughput, and it achieves this using large send and 
receive buffers. Your userspace implementation will need to implement 
something similar. A few hundred bytes per connection is simply not enough.

If you want to deal with millions of clients, your protocol shall better not 
have any state at all. A good example of this is DNS.

I think that most applications can either use TCP directly with or without 
tuning or they have such specialized needs that a custom protocol is the only 
solution.

Regards,
Pieter de Goeje




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201007121333.49017.pdegoeje>