Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Oct 2002 17:46:37 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Julian Elischer <julian@elischer.org>
Cc:        Vincent Jardin <vjardin@wanadoo.fr>, freebsd-net@freebsd.org, freebsd-hackers@freebsd.org
Subject:   Re: Netgraph TCP/IP
Message-ID:  <3DAF59ED.D14BD089@mindspring.com>
References:  <Pine.BSF.4.21.0210171450440.2971-100000@InterJet.elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Julian Elischer wrote:
> > There is also the m_pullup() issue of the TCP protocol that is
> > being passed IP datagrams which may be frags of TCP packets, in
> > order to get the full TCP header, with options.
> 
> The tcp code should handle this anyway.

It should, but it won't.  The issue is when you need to make a
decision based on TCP packet contents, but you don't have a
complete packet.  The expected behaviour is to call m_pullup.
For a Netgraph version of this, you will either need a context
(there isn't one at that point -- it runs at NETISR), or you
will need to be able to restart tcp_input().  The problem with
that is that it's expensive.  Effectively, you almost need to
seperate out the frag code before, and assemble whole packets
before going into the traditional tcp_input().  Lemon has some
good idea in this area; so do I.  I've got code here, where I've
moved around some of the operations to delay computations until
fill data is available (which would avoid recomputation).


> > Minimally, the approach has to be a seperate TCP stack, which is
> > given a different protocol number for the purposes of experiment,
> > so that you can have a duplicate TCP stack on both sides using
> > the normal mechanism, and replace it on one side with the Netgraph
> > version equivalen.
> 
> Not necessarily.. if each stack can 'reject' a packet.. ("not mine").

The problem with this one is that the packet in this case is TCP.

Ideally, what you would like for the developement case is to be able
to tag particular flows as going to one TCP stack vs. the other; then
by examining the flow tag, youy would be able to decide where to handle
the packet (this also implies moving the frag reassembly to a seperate
"layer").  Then you could flag a flow, and have it dealt with that way.

In FreeBSD's lower level code, though, IP flows aren't really treated
as flows; this is partly an artifact of the routing code, itself, and
partly an artifact if inpcbhash(), which is actually broken.  The hash
on iput for a flow vs. output on a flow allocation is also broken;
consider, that you can make a connection on an outbound socket which
is not bound, from a specific source IP address, with no specific
source port, and the source port contention is handled globally, rather
than locally -- thus limiting you to 65535 maximum outbound connections
on a single machine (the number of ports in a single globally contended
IP address space, despite the fact that your source IP was specified).

What this adss up to is that if you want to run stacks in parallel,
they can't share protocol numbers, because the code does not really
distinguish them at the proper layer, but instead, distinguishes them
off-by-one.

This actually makes processing slower overall, as well; consider that
the fast forwarding code does a lookup, which, on a miss, is then
passed up to the TCP to do another lookup, rather than passing the
lookup result as part of the context.

What this basically means is that the hash entries for the values are
not shared, with a single hit-per-flow, and the more "fast forwarding"
you do, the slower normal processing goes.  The same happens for the
SYN cache context lookup.  It basically really slows down the code, to
not do the hash lower down, and then make the decision on the basis of
the result of identifying the flow.

In a general sense, what you would need to do to do what you suggest is
to pass all packets through all possible stacks, until you hit the
"default" one -- the standard TCP -- and rely on all the other stacks
to not eat the packet.

In practice, this means that every IP protocol, except TCP, ends up
getting rewritten for Netgraph use, until TCP gets rewritten, and
then your lookup is O(N*MAX(1,flow_count/hash_size)), where N is the
number of IP protocols (TCP, UDP, RTP, etc.).

Also, in practice, this still doubles the overhead for things that need
to be pre-decided (flow identification for IP fast forwarding, DSR,
splicing, etc.), and that none of these things can really be safely
implemented as Netgraph modules.

Yeah, it can be made to function correctly that way, but it won't
function quickly.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3DAF59ED.D14BD089>