Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Feb 1997 21:26:57 -0500
From:      "David S. Miller" <davem@jenolan.rutgers.edu>
To:        netdev@roxanne.nuclecu.unam.mx
Subject:   overview of socket hashing changes, plus new patch
Message-ID:  <199703010226.VAA08431@jenolan.caipgeneral>

next in thread | raw e-mail | index | archive | help

Eric Schenk pointed out some problems to me that caused IP transparent
proxy to not even compile, this is believed to be fixed and it should
actually work.  Fetch the new patch from:

vger.rutgers.edu:/pub/linux/Net/patches/fordyson2.diff.gz

A brief overview of the changes, with some insight into my design
decisions where applicable:

	1) Anything referring to SOCK_ARRAY_SIZE and code which used
	   that was essentially just deleted.

	2) Socket demultiplexing, I determined, was a per protocol
	   problem.  Therefore I added the following methods to
	   struct proto:

	void hash(struct sock *sk)

	   This is where a socket is added to a protocols lookup
	   tables, the method by which this is done and the tables
	   used are completely opaque to the af_inet/af_inet6 code,
	   it really doesn't care how you arrange things.  For example
	   if you wanted to use a DP trie for TCP sockets (I considered
	   this approach, but decided against it, see below) the
	   callers simply would not care.

	void unhash(struct sock *sk)

	   This removes a sock from the protocols lookup tables.

	void rehash(struct sock *sk)

	   Here we get called when the "identity" of a sock has
	   changed in some way, yet it still exists and lookups
	   should find it just fine.  The theory here is that
	   a quick routine needs to be present when we just change
	   the address/port elements of a sock, since the hash
	   is probably different then.

	unsigned short good_socknum(unsigned short base)

	   The idea here is that a protocol knows best how to choose
	   the best possible port number.  For example, it can exploit
	   some properties of it's lookup tables and thus choose a
	   port which will seed better in those tables.

	int verify_bind(struct sock *sk, unsigned short snum)

	   This just verifies that attaching SK to the passed local
	   port is "ok" and doesn't conflict with any other existing
	   connection in that protocol.

	   To aide in the support of these things two elements were
	   added to struct sock:

	struct sock **hashtable;
	int hashent;

	   This allows the af_inet code to do things more freely and
	   only force it to call the rehash routine after all of it's
	   identity changes have been completed.  This way the per
	   protocol hashing code, during a rehash, can easily find
	   where the sock was beforehand even though the new hash
	   computation might be different.

	   Alas it was found that there still needed to exist a way
	   to walk the entire socket list.  But happily, this is only
	   needed in one place, the procfs stuff.  So who cares ;-)
	   This was implemented by adding:

	struct sock *sklist_next, *sklist_prev;

	   to the front of both struct sock and struct proto.

	3) Socket hash table protection.  It is very simple, when
	   things need to be changed, a BH atomic section is entered.
	   For the lookup code, during incoming packet processing,
	   since you know you are running within a BH you can safely
	   avoid even doing this and run full speed with no locking.
	   Eric Schenk has informed me that he thinks this can be
	   made to scream for fine grained SMP as well.  This is good.

	4) The af_inet code was fast pathed, and some special cases
	   were completely removed.  The code is generally much
	   cleaner now and easier to understand.  In particular
	   the RAW protocols for v4 and v6 were given their own
	   bind method, eliminating a lot of excess comparisons
	   in the af_inet bind code.

The implementations:

	SOCK_RAW

	They did not need the full 256 entry hash tables they
	previously did.  The size of the hash was chosen to
	be MAX_INET_PROTOS, this simplified and speeded up
	the raw packet delivery.  The hash is on sk->num, simple,
	stupid, nothing fancy necessary at all.

	SOCK_PACKET

	No lookups, in the sense we are discussing here, are even
	ever necessary.  Therefore this protocol lacks a table and
	also lacks the hash methods etc.

	SOCK_UDP

	The algorithm is the same as previous, hash on the local
	port number.  For UDP the one behind sock cache was retained
	as for UDP there are two things to exploit:

		1) Nearly all UDP sockets on a system are wildcarded
		   out the wazoo, so hashing on anything other than
		   the local port will only make it more complex and
		   cause the lookups to take longer.

		2) Numerous research papers and studies have shown
		   that the one behind UDP socket cache can have hit
		   rates approaching %80, this is the main reason it
		   was in fact retained.

	SOCK_TCP

	Ok, this one is a doozy, and it was the incentive behind all
	of this work I did, hold on to your seats.

	There are some nice properties about TCP which you can
	exploit, in particular:

	1) The overwhelming majority of TCP sockets have a full
	   identity, that is they are completely bound to a unique
	   [local address, local port], [foreign address, foreign
	   port] identity.  Furthermore, the only TCP sockets which
	   possess wildcards and can get hit during a demultiplex for
	   an incoming packet are those in the listening state, we
	   exploit this heavily.

	2) And for these listening connections, due to limitations
	   in the BSD sockets API, only the local address and port
	   can be non-wildcard.  The foreign port and address can
	   never be bound to anything for listening TCP socket, so
	   don't even bother checking for it.  Note the protocol does
	   allow for a listening socket to be bound, yet as stated the
	   API lacks any way in which you can actually do this.

	With that in mind, there are three hash tables for TCP
	sockets.  One for LISTENING sockets, one for sockets with
	a full identity yet not dead/TCP_CLOSE, and one further hash
	whose purpose will be described shortly.

	Sockets in the full identity hash are hashed based upon all
	four of laddr, lport, faddr, fport.  This seems to hash the
	tables rather nicely.  Listening sockets are only hashed
	based upon the local port, since this is the only thing we
	can guarentee is not wildcarded, in fact as just stated we
	in fact know that the foreign port and address are wildcarded.

	The algorithm for incoming socket lookup is essentially:

	1) Try to find perfect match in established hash table.
	2) If nothing found, look in listening table for closest match.

	Step (1) need only check for perfect matches (that is all it
	could ever find in there!), and step (2) need only prioritize
	a lookup by "local port and local address match" then
	"local port match, yet local address wildcarded".  And this
	is all you need to check.

	Possible improvement for the listening lookups.  When we add
	a socket to the listening hash (rare occurrance, does not
	happen all that often, except for ftp clients in non-passive
	mode) we look for a "perfect listener" on our selected port.
	If one is found, we place ourselves after him.  The lookup
	code for the listening case can thus be simplified to return
	the first socket where the local port matches, and this will
	always be correct due the way in which we have layed things
	out.

	Ok, that made incoming packet demultiplexing scale extremely
	well.  Let's see how I covered some other issues where
	performance sucked previously.

	TCP timers, one was tricky to solve the other was simple.

	Every 75 seconds the keepalive timer goes off and previously
	(FreeBSD does this as well, but much worse, I'll describe this
	is mass detail later) the entire socket list was walked
	to find TCP's which need a keepalive probe sent.

	The fix here is to walk the established hash chains 4 times
	every 75 seconds.  Each time only examining 1/4 of the chains.
	1/4 was chosen because for a HZ of at least 100, this has no
	lost of accuracy whatsoever.  Also note how it is only the
	established hash which is examined, listening sockets do not
	have keepalives go off, this saves a bunch of stupid checks.

	Second issue, the SYN reception timeout.  Again, we previously
	walked the entire socket list when this fired off, no longer!
	Only listening sockets need to be checked (and as I mentioned
	this is an extremely small portion of the TCP socket space on
	a machine) so we only walk the TCP listening hash table.  This
	case was a piece of cake to deal with. ;-)

	I will describe the good socket and bind verification
	techniques at some other point in time.  It is quick but
	there are some bad corner cases I'd like to eliminate before
	I explain my intentions ;-)  Suffice to say though that
	the third hash not yet described helps make these operations
	go like smoke.

Comparison with FreeBSD for TCP (I love this, in fact once you read
this you will wonder where their claims of scalability even come
from):

	1) Socket demultiplexing:

		FreeBSD -	Single hash for entire TCP socket
				space.  Listening sockets pollute
				the established connections.  Hash
				function involves a mod instruction
				because the table size is a prime.
				Extraneous testing in the hash lookup
				because this code in in_pcb.c must be
				generic and cannot exploit nice
				properties present in TCP.

		Linux -		Dual hash, listening sockets do not
				pollute the perfectly unique sockets.
				For the listening case, only the
				necessary pieces of the socket
				identity are examined to determine a
				match.  Hash table size is a power of
				2, eliminating the need of a costly
				multi cycle modulo instruction.

	(Some minutae, I have approximately 50 or so samples from some
	 of the largest ftp and web servers that I know about (some
	 of these samples have 5,000 or more connections), listing
	 the socket connections during high load.  A comparison of the
	 current  bound hash function under Linux and the one
	 under FreeBSD shows them to distribute about the same, if not
	 better for Linux (in particular Linux is better on machines
	 which run many clients as well as large single port servers)
	 the prime size of the hash does not seem to help FreeBSD and
	 they eat the multi-cycle modulo instrucion on top of it all,
	 so stupid)

	2) Timers:

		FreeBSD -	Every 500ms, and every 200ms, the
				_ENTIRE_ TCP socket list is walked,
				yes in it's entirety.  How the fuck
				does this scale?

		Linux -		Most TCP timers go off with the socket
				in question is an argument to the
				timer.  And due to Finne Gangstad's
				scalable timer implementation the
				setting and deletion of these timers
				are near zero overhead.  This also has
				the nice effect that our timers are
				also much more accurate than
				BSD. (this avoids the retransmit
				"explosions of death" every 200ms as
				seen on large BSD derived servers)

				For the timers which need not be that
				accurate, we either walk the small
				listener hash chains every half
				second, and every (75 / 4) seconds
				we walk (128 / 4) of the established
				hash chains.

				Now those are scalable TCP timers, beat
				that. ;-)

	3) Port selection, and bind verification.

	   I will discuss this issue at a later time, suffice to say
	   that for FreeBSD they essentially walk the entire socket
	   list for both operations looking for a match/miss.

---------------------------------------------////
Yow! 11.26 MB/s remote host TCP bandwidth & ////
199 usec remote TCP latency over 100Mb/s   ////
ethernet.  Beat that!                     ////
-----------------------------------------////__________  o
David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703010226.VAA08431>