From owner-freebsd-arch@FreeBSD.ORG  Sat Jan  5 20:30:08 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C0CCF16A417;
	Sat,  5 Jan 2008 20:30:08 +0000 (UTC)
	(envelope-from vadim_nuclight@mail.ru)
Received: from mx34.mail.ru (mx34.mail.ru [194.67.23.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 57BD513C455;
	Sat,  5 Jan 2008 20:30:08 +0000 (UTC)
	(envelope-from vadim_nuclight@mail.ru)
Received: from [78.140.2.250] (port=2959 helo=nuclight.avtf.net)
	by mx34.mail.ru with esmtp 
	id 1JBFeg-000FAk-00; Sat, 05 Jan 2008 23:30:06 +0300
Date: Sun, 06 Jan 2008 02:30:02 +0600
To: "Julian Elischer" <julian@elischer.org>
References: <4772F123.5030303@elischer.org>	<f85d6aa70712261728h331eadb8p205d350dc7fb7f4c@mail.gmail.com>	<477416CC.4090906@elischer.org>
	<opt4c0imk24fjv08@nuclight.avtf.net>
	<477D2EF3.2060909@elischer.org>
From: "Vadim Goncharov" <vadim_nuclight@mail.ru>
Organization: AVTF TPU Hostel
Content-Type: text/plain; format=flowed; delsp=yes; charset=koi8-r
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-ID: <opt4g4kcis17d6mn@nuclight.avtf.net>
In-Reply-To: <477D2EF3.2060909@elischer.org>
User-Agent: Opera M2/7.54 (Win32, build 3865)
Cc: arch@freebsd.org, Ivo Vachkov <ivo.vachkov@gmail.com>,
	Robert Watson <rwatson@freebsd.org>, Qing Li <qingli@freebsd.org>,
	FreeBSD Net <freebsd-net@freebsd.org>
Subject: Re: resend: multiple routing table roadmap (format fix)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 05 Jan 2008 20:30:09 -0000

04.01.08 @ 00:52 Julian Elischer wrote:

>>> By the way, I might add that in the 6.x compat. version I may end up
>>> limiting the feature to 8 tables. This is because I need to store some
>>> stuff in an efficient way in the mbuf, and in a compatible manner this  
>>> is easiest done by stealing the top 4 bits in the mbuf dlags word
>>> and defining them as:
>>>
>>>   #define M_HAVEFIB    0x10000000
>>>   #define M_FIBMASK    0x07
>>>   #define M_FIBNUM    0xe0000000
>>>   #define M_FIBSHIFT    29
>>>   #define m_getfib(_m, _default) ((m->m_flags & M_HAVE_FIBNUM) ?  
>>> ((m->m_flags >> M_FIBSHIFT) & M_FIBMASK) : _default)
>>>   #M_SETFIB(_m, _fib) do { \
>>>     _m->m_flags &= ~M_FIBNUM; \
>>>     _m->m_flags |= (M_HAVEFIB|((_fib & M_FIBMASK) << M_FIBSHIFT));\
>>> } while (0)
>>>
>>> This then becomes very easy to change to use a tag or
>>> whatever is needed in later versions , and the number can
>>> be expanded past 8 predefined  FIBs at that time..
>>  If you want it to be a tag, why spent bits in m_flags and not just do  
>> it as a tag at once? Or it is supposed to completely throw away 6.x  
>> (possibly 7.x too) implementation in favor of right thing in 8.0 ?
>
> basically yes..
>
> I'm looking at just doing tags to start with, but haven't done it yet..  
> I'm looking for a good bit of tag code to copy :-)

Look at ipfw's O_ALTQ/O_TAG/O_TAGGED (ands some other parts), ng_tag.c,  
ng_ipfw.c, ng_ksocket.c and some other stuff :-) Tags are simple, if 16  
bits are enough to you then even do not have to allocate data, just use  
tag_id member. Also they are easy to manipulate within netgraph with  
ng_tag, etc. But as drawback - you have to allocate memory for them, an as  
it is M_NOWAIT, malloc() can return NULL in interrupt threads... So a new  
field in mbuf (or flags) would be better in terms of performance, but it  
will break ABI :(

I don't have m_tag_alloc() measurements, though. Doing 'ipfw add 1 tag 1  
ip from any to any' on a 15 kpps 6.2 router didn't cause any noticeable  
slowdown while looking for half a minute at 'systat -vm 1'...

>   setfib 3 /bin/sh
>
> now by default everythign you do uses table 3.
> or even
>
> setfib 3 jail {blah}
>
> and all the procs in the jail use table 3. You also need to do
> setfib 3 jexec xxx
> for extra processes you add to the jail afterwards.

May be introduce a field in a struct prison to make it possible without  
additional commands?

>>>>> 2/ packets received on an interface for forwarding.
>>>>>     By default these packets would use table 0,
>>>>>     (or possibly a number settable in a sysctl(not yet)).
>>>>>     but prior to routing the firewall can inspect them (see below).
>>>>>
>>>>> 3/ packets inspected by a packet classifier, which can arbitrarily
>>>>>     associate a fib with it on a packet by packet basis.
>>>>>     A fib assigned to a packet by a packet classifier
>>>>>     (such as ipfw) would over-ride a fib associated by
>>>>>     a more default source. (such as cases 1 or 2).
>>  Sounds good. I like idea to do routing decisions in firewall, to not  
>> double kernel code and userspace utilities, like in Linux' iproute2  
>> (which, however, still have a few parameters and relies on firewall  
>> marks for others). However, there are some cases, I think, where it  
>> could be done outisde firewall. For example, make an ifconfig option to  
>> use a specific FIB as a default for all packets outgoing from this  
>> interface's address. But here arises another related question - Linux  
>> allows to select a specific src IP based on a routing table entry -  
>> destination address (thoughts about pf reply-to/route-ro, huh).
>
> that is default here too if I understand what you are talking about.
> teh src address is selected from the routing table's exit interface.
> In the code I'm showing in perforce, that address would depend on which  
> table your process was associated with. (or just the socket if you have  
> used the socket option on it before doing the bind/connect)

What I'm talking about is adding possibility for future MPLS/VRF/etc. For  
example, if we make an interface option to use a specific FIB on that  
interface, for every incoming packet (put a tag on early input?), then ARP  
replies, ICMP redirects (yes, make stack to process them to particular FIB  
if specified, not to main) and so on will affect only this table. Then, it  
will be possible, say, to have 192.168.0.0/24 on em0 and also have  
192.168.0.0/24 on em1, but that networks are completely independent of  
each other on both L2 and L3 (different customers) - after that, a change  
allowing to have the same IP address on different interfaces will lead to  
complete virtual independence. Without any vimages - why do we need  
separate TCP stacks etc. copies on a router without any jails, under a  
single administrator's control?

Yes, this may be difficult with planned L2/L3 separation (currently ARP  
table is in fact part of FIB), but it is solvable - say, by binding an ARP  
table to one or several FIBs. Moreover, I think that complete stack  
virtulization in each jail/vimage is waste of resources - instead one or  
several FIBs/interfaces/ARP tables can be bound to each vimage/jail,  
possibly with write permissions.

And even all of above is considered a far future and/or will be made  
different way, FIB binding to interface is still useful for (both incoming  
and) outgoing packets to make a firewall ruleset simpler.

>> In relation to this I can remember multipath routing (different  
>> metrics?), addresses from one subnet on different ifaces (mask wider  
>> /32) and so on.
>> Also it is interesting, how multiple FIBs would interact with host-wide  
>> events, such as ICMP redirects (which table should be updated?),  
>> storing of TCP stack metrics (MTU, etc.) and hostcache, and so on. How  
>> these and above will be solved?..
>
> I'm not really too knowledgeable about multicast..

Is multicast and multipath routing the same?

>> per ifconfig (>1 host per subnet)/icmp redirects/src to prefer,  
>> multipath/metrics, tcp stack parameters interaction, iproute2
>
> I'm not trying to solve problems that need vimage to solve them..

Umm, what vimage?.. :) I forgot to clear these keywords written for myself  
when writing draft and expaining them in detail,sorry :)

>>>>> Routing messages would be associated with their
>>>>> process, and thus select one FIB or another.
>>  This is not clear. How should the 'route' command work with different  
>> FIBs, if they are supposed by admin to be used for forwarding, and not  
>> the straight per-process? I think a setfib option is more consistent  
>> than running route under setfib command. Also, routing sockets and  
>> routing daemons - should they work with only one table?..
>
> if you do
> setfib 3 route get 1.1.1.1
>
> you may get a different result from
>
> setfib 2 route get 1.1.1.1
>
> I will add a fibnum argument to route itself as well but it's not needed  
> immediately as long as I have the setfib command.

OK, but we should think about it in the future. In theory, routing  
socket's messages are easily extendable with FIB number in uint16_t, as  
message keeps it's length...

>>>>> I have not yet added the changes to ipfw.
>>  Action modifier, like 'ipfw add count setfib 3 ip from any to any' ?  
>> There were thoughts (I heard,t as a hack before multiple FIBs) about  
>> making an additional, say, 'nexthop' ipfw action, which acts like fwd,  
>> but does not accept packet, allowing to continue it through firewall  
>> ruleset - thus making it more comfortable to separate routing (imagine  
>> 'nexthop tablearg') and filtering. There are questions with both fwd  
>> and new supposed option: will fwd still survive? Will it change the  
>> output interface, like as complete rerouting before calling pfil(9)  
>> hooks, so that *oif will be changed to be mathed iin rules below? pf  
>> route-to/reply-to is hanging around...
>
> The 'nexthop' cal you suggest is problematic because it needs to return  
> information immediately. which is why it is terminal.

Um, why? Why it can't continue through ruleset? I don't know  
implementation details of routing and 'ipfw fwd', alas,

> As for the setfib ipfw action, I have now done this in p4.
>
> ipfw add 200 setfib 3 ip from any to any in receive em0
>
> now works.
> This lessens the need for associating a fib with an interface as the  
> firewall can do that too..
>
> the setfib rule is not terminal. (hmm need to check I did that right.)

Oh, it it works, that's cool.

> you can also do
> ipfw add 200 skipto 300 ip from any to any hasfib
>   # to select on a packet that has a fib associated with it already.
> ipfw add 200 skipto 300 ip from any to any fib 4
>   # to slelect packets that are associated with fib 4
> ipfw add 200 clrfib ip from any to any
>   # to remove a fib association from the packet.

Do we need a separate keyword 'clrfib' while it could be 'setfib 0' ? Or  
at least save one opcode in kernel's ipfw. Also, it would be nice to have  
'setfib tablearg' together with reserving 16 bits for FIB number - some  
systems with hundreds of vlans will want to have more than 256 tables, I  
think...

>>>>> Interaction with the ARP layer/ LL layer would need to be
>>>>> revisited as well. Qing Li has been working on this already.
>>  Oh yes, L2 interaction is interesting. How it should work in case of  
>> planned separation of routing and ARP tables?..

I've explained my views about it above...

-- 
WBR, Vadim Goncharov