Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Mar 2013 14:58:26 +0400
From:      "Alexander V. Chernikov" <melifaro@ipfw.ru>
To:        Andre Oppermann <andre@freebsd.org>
Cc:        "Alexander V. Chernikov" <melifaro@FreeBSD.org>, Sami Halabi <sodynet1@gmail.com>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: MPLS
Message-ID:  <514C3952.2010903@ipfw.ru>
In-Reply-To: <51471974.3090300@freebsd.org>
References:  <CAEW%2Bogb_b6fYLvcEJdhzRnoyjr0ORto9iNyJ-iiNfniBRnPxmA@mail.gmail.com> <CAEW%2BogZTE4Uw-0ROEoSex=VtC%2B0tChupE2RAW5RFOn=OQEuLLw@mail.gmail.com> <CAEW%2BogYbCkCfbFHT0t2v-VmqUkXLGVHgAHPET3X5c2DnsT=Enw@mail.gmail.com> <5146121B.5080608@FreeBSD.org> <514649A5.4090200@freebsd.org> <3659B942-7C37-431F-8945-C8A5BCD8DC67@ipfw.ru> <51471974.3090300@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18.03.2013 17:41, Andre Oppermann wrote:
> On 18.03.2013 13:20, Alexander V. Chernikov wrote:
>> On 17.03.2013, at 23:54, Andre Oppermann <andre@freebsd.org> wrote:
>>
>>> On 17.03.2013 19:57, Alexander V. Chernikov wrote:
>>>> On 17.03.2013 13:20, Sami Halabi wrote:
>>>>>> ITOH OpenBSD has a complete implementation of MPLS out of the 
>>>>>> box, maybe
>>>> Their control plane code is mostly useless due to design approach 
>>>> (routing daemons talk via kernel).
>>>
>>> What's your approach?
>> It is actually not mine. We have discussed this a bit in 
>> radix-related thread. Generally quagga/bird (and other hiperf 
>> hardware-accelerated and software routers) have feature-rich RIb from 
>> which best routes (possibly multipath) are installed to kernel/fib. 
>> Kernel main task should be to do efficient lookups while every other 
>> advanced feature should be implemented in userland.
>
> Yes, we have started discussing it but haven't reached a conclusion 
> among the
> two philosophies.  We have also agreed that the current radix code is 
> horrible
> in terms of cache misses per lookup.  That however doesn't preclude an 
> agnostic
> FIB+RIB approach.  It's mostly a matter of structure layout to keep it 
> efficient.
Yes. Additionally, we have problems with misuse of rtalloc api (rte 
grabbing).
My point of view is to use separate FIB for 'data plane' e.g. forwarding 
while keeping some kind of higher-level
kernel RIB used for route socket, multipath, other subsystem intercation 
and so on.
>
>>>> Their data plane code, well.. Yes, we can use some defines from 
>>>> their headers, but that's all :)
>>>>>> porting it would be short and more straight forward than porting 
>>>>>> linux LDP
>>>>>> implementation of BIRD.
>>>>
>>>> It is not 'linux' implementation. LDP itself is cross-platform.
>>>> The most tricky place here is control plane.
>>>> However, making _fast_ MPLS switching is tricky too, since it 
>>>> requires chages in our netisr/ethernet
>>>> handling code.
>>>
>>> Can you explain what changes you think are necessary and why?
> >
>> We definitely need ability to dispatch chain of mbufs - this was 
>> already discussed in intel rx ring lock thread in -net.
>
> Actually I'm not so convinced of that.  Packet handling is a tradeoff 
> between
Yes. But I'm talking on mixed way (one part as batches, to eliminate 
contentions, and than - process-to-completion)
> doing process-to-completion on each packet and doing context switches 
> on batches
> of packets.
Context switches?

Batches are efficient:
it is noted explicitly in
1) Luigi's VALE paper 
http://info.iet.unipi.it/~luigi/papers/20121026-vale.pdf (Section 5.2)
2) Intel/6wind uses batches to move packets to their 'netisr' rings in 
their DPDK
3) PacketShader ( http://shader.kaist.edu/packetshader/ ) uses batches too.

>
> Every few years the balance tilts forth and back between 
> process-to-completion
> and batch processing.  DragonFly went with a batch-lite token-passing 
> approach
> throughout their kernel.  It seems it didn't work out to the extent 
> they expected.
There are other, more successful solutions with _much_ better results 
(100x faster that our code, for example).
>
> Now many parts are moving back to the more traditional locking approach.
>
>> Currently significant number of drivers support interrupt moderation 
>> permitting several/tens/hundreds of packets to be received on interrupt.
>
> But they've also started to provide multiple queues.
Yes, but hashing function is pre-defined, and bursty flows can still 
fall on single CPU.
>
>> For each packet we have to run some basic checks, PFIL hooks, netisr 
>> code, l3 code resulting in many locks being acquired/released per 
>> each packet.
>
> Right, on the other hand you'll likely run into serious interlock and 
> latency
> issues when large batches of packets monopolize certain locks 
> preventing other
> interfaces from sending their batches up.
>
>> Typically we rely on NIC to put packet in given queue (direct isr), 
>> which works bad for non-hashable types of traffic like gre, PPPoE, 
>> MPLS. Additionally, hashing function is either standard (from M$ 
>> NDIS) or documented permitting someone malicious to generate 
>> 'special' traffic matching single queue.
>
> Malicious traffic is always a problem, no matter how many queues you 
> have.
>
>> Currently even if we can add m2flowid/m2cpu function able to hash, 
>> say, gre or MPLS, it is unefficient since we have to lock/unlock 
>> netisr queues for every packet.
>
> Yes, however I'm arguing that our locking strategy may be broken or 
> sub-optimal.
Various solution reports, say, 50MPPS (or scalable 10-15MPPS per-core) 
of IPv4 forwarding. Currenty stock kernel can do ~1MPPS. There are, of 
course, other reasons (structure alignment, radix, arp code), but the 
difference is _too_ huge.
>
>> I'm thinking of
>> * utilizing m_nextpkt field in mbuf header
>
> OK.  That's what it is there for.
>
>> * adding some nh_chain flag to netisr
>> If given netisr does not support flag and nextpkt is not null we 
>> simply call such netisr in cycle.
>> * netisr hash function accepts mbuf 'chain' and pointer to array 
>> (Sizeof N * ptr),  sorts mbuf to N netisr queues saving list heads to 
>> supplied array. After that we put given lists to appropriate queues.
>> * teach ethersubr RX code to deal with mbuf chains (not easy one)
>> * add some partial support of handling chains to fastfwd code
>
> I really don't think this going to help much.  You're just adding a 
> lot of
> latency and context switches to while packet path.  Also you're making it
> much more complicated.
>
> The interface drivers and how they manage the boundary between RX ring 
> and
> the stack is not optimal yet.  I think there's a lot of potential 
> there.  In
> my tcp_workqueue branch I started to experiment with a couple of 
> approaches.
> It's not complete yet though.
>
> The big advantage of having the interface RX thread pushing the 
> packets is
> that it provides a natural feedback loop regarding system load. Once you
> have more packets coming in than you can process, the RX dma ring gets
> naturally starved and the load is stabilized on the input side preventing
I see no difference here.
>
> a live-lock that can easily happen in batch mode.  Only a well-adjusted
> driver works properly though and we don't have any yet in that regard.
That's true.
>
>
> Before we start to invent complicated mbuf batching methods lets make 
> sure
> that the single packet path at its maximal possible efficiency. And only
> then evaluate more complicated approaches on whether they deliver 
> additional
> gains.
>
> From that follows that we should:
>
>  1. fix longest prefix match radix to minimize cache misses.
Personally I think that 'fix' is rewriting this entirely..
For example, common solution for IPv6 lookup is to use the fact that you 
have either /64 or wider routes, or host route, so radix lookup is done 
on first 64 bit (and there can be another tree in given element if there 
are several more specific routes).
That's why I'm talking on RIB/fib approach (one common 'academic' 
implementation for control, and family-depended effective lookup code).

We also need to fix fundamental rte usage paradigm, currently it is 
unsafe for both ingress and egress interface..
>
>  2. fix drivers to optimize RX dequeuing and TX enqueuing.
>
>  3. have a critical look at other parts of the packet path to avoid
For IPv4 fastforwarding we have:
1) RX ring mtx lock, (BPF rlock) (L3 PFIL in, FW lock), Ifaddr RLOCK, 
Radix Rlock, rte mtx_lock (twice by default), (L3 PFIL out, FW lock), 
ARP rlock, ARP entry rlock, TX ring lock?
(And +2  rlocks for VLAN, and another 2 for LAGG)

There was RX ring lock/unlock thead for 8299 ended with nothing.
I've changed BPF 2 mtx_lock to be 1 RLOCK
There are patches permitting IPFW to use PFIL lock (however they are not 
committed due to possibility that we can make lockless PFIL)
I've removed 1 rte mtx_lock (dynamic routes, turned off if forwarding is ON)
I'm working on ARP stack rewrite to (first stage to enable L2 multipath 
(lagg-aware) and remove ARP entry lock from forwarding path), second 
stage - to store pointer to full L2 prepend header to rte (yes, that was 
removed in 7.x)).

I'm a bit stuck of other ideas to eliminate remaining locks (except of 
simply moving forwarding code to userland netmap-based solution like 
others do)
>
>





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?514C3952.2010903>