FreeBSD Mail Archives

Date:      Tue, 24 Sep 2013 11:42:24 +0200
From:      Marko Zec <zec@fer.hr>
To:        <freebsd-net@freebsd.org>
Cc:        Joe Holden <lists@rewt.org.uk>
Subject:   Re: Network stack changes
Message-ID:  <201309241142.24542.zec@fer.hr>
In-Reply-To: <5241519C.9040908@rewt.org.uk>
References:  <521E41CB.30700@yandex-team.ru> <201309240958.06172.zec@fer.hr> <5241519C.9040908@rewt.org.uk>

On Tuesday 24 September 2013 10:47:24 Joe Holden wrote:
> On 24/09/2013 08:58, Marko Zec wrote:
> > On Tuesday 24 September 2013 00:46:46 Sami Halabi wrote:
> >> Hi,
> >>
> >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info
> >>>.i et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab
> >>>.f er.hr/dxr/stable_8_20120824.diff>
> >>
> >> I've tried the diff in 10-current, applied cleanly but had errors
> >> compiling new kernel... is there any work to make it work? i'd love to
> >> test it.
> >
> > Even if you'd make it compile on current, you could only run synthetic
> > tests measuring lookup performance using streams of random keys, as
> > outlined in the paper (btw. the paper at Luigi's site is an older
> > draft, the final version with slightly revised benchmarks is available
> > here:
> > http://www.sigcomm.org/sites/default/files/ccr/papers/2012/October/2378
> >956-2378961.pdf)
> >
> > I.e. the code only hooks into the routing API for testing purposes, but
> > is completely disconnected from the forwarding path.
>
> aha!  How much work would it be to enable it to be used?

The inefficiency of our forwarding path is caused by so many factors 
combined (as others have put out previously in this thread) that simply 
plugging DXR into it wouldn't make a noticeable difference, except probably 
in scenarios with full BGP views loaded into the FIB, where our radix tree 
lookups can cause extreme cache trashing.

Once we come closer to lockless operation from RX to TX queue, start 
processing packets and allocating / freeing buffers in batches, simplify 
mbufs or replace them entirely with a more efficient abstraction, then 
replacing radix tree lookups with DXR or something else might start making 
sense.

Marko

> > We have a prototype in the works which combines DXR with Netmap in
> > userspace and is capable of sustaining well above line rate forwarding
> > with full-sized BGP views using Intel 10G cards on commodity multicore
> > machines. The work was somewhat stalled during the summer but I plan to
> > wrap it up and release the code until the end of this year.  With
> > recent advances in netmap it might also be feasible to merge DXR and
> > netmap entirely inside the kernel but I've not explored that path
> > yet...
>
> mmm, forwarding using netmap would be pretty awesome...
>
> > Marko
> >
> >> Sami
> >>
> >>
> >> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
> >>
> >> melifaro@yandex-team.ru> wrote:
> >>> On 29.08.2013 15:49, Adrian Chadd wrote:
> >>>> Hi,
> >>>
> >>> Hello Adrian!
> >>> I'm very sorry for the looong reply.
> >>>
> >>>> There's a lot of good stuff to review here, thanks!
> >>>>
> >>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless
> >>>> to keep locking things like that on a per-packet basis. We should be
> >>>> able to do this in a cleaner way - we can defer RX into a CPU pinned
> >>>> taskqueue and convert the interrupt handler to a fast handler that
> >>>> just schedules that taskqueue. We can ignore the ithread entirely
> >>>> here.
> >>>>
> >>>> What do you think?
> >>>
> >>> Well, it sounds good :) But performance numbers and Jack opinion is
> >>> more important :)
> >>>
> >>> Are you going to Malta?
> >>>
> >>>> Totally pie in the sky handwaving at this point:
> >>>>
> >>>> * create an array of mbuf pointers for completed mbufs;
> >>>> * populate the mbuf array;
> >>>> * pass the array up to ether_demux().
> >>>>
> >>>> For vlan handling, it may end up populating its own list of mbufs to
> >>>> push up to ether_demux(). So maybe we should extend the API to have
> >>>> a bitmap of packets to actually handle from the array, so we can
> >>>> pass up a larger array of mbufs, note which ones are for the
> >>>> destination and then the upcall can mark which frames its consumed.
> >>>>
> >>>> I specifically wonder how much work/benefit we may see by doing:
> >>>>
> >>>> * batching packets into lists so various steps can batch process
> >>>> things rather than run to completion;
> >>>> * batching the processing of a list of frames under a single lock
> >>>> instance - eg, if the forwarding code could do the forwarding lookup
> >>>> for 'n' packets under a single lock, then pass that list of frames
> >>>> up to inet_pfil_hook() to do the work under one lock, etc, etc.
> >>>
> >>> I'm thinking the same way, but we're stuck with 'forwarding lookup'
> >>> due to problem with egress interface pointer, as I mention earlier.
> >>> However it is interesting to see how much it helps, regardless of
> >>> locking.
> >>>
> >>> Currently I'm thinking that we should try to change radix to
> >>> something different (it seems that it can be checked fast) and see
> >>> what happened. Luigi's performance numbers for our radix are too
> >>> awful, and there is a patch implementing alternative trie:
> >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info
> >>>.i et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab
> >>>.f er.hr/dxr/stable_8_20120824.diff>
> >>>
> >>>> Here, the processing would look less like "grab lock and process to
> >>>> completion" and more like "mark and sweep" - ie, we have a list of
> >>>> frames that we mark as needing processing and mark as having been
> >>>> processed at each layer, so we know where to next dispatch them.
> >>>>
> >>>> I still have some tool coding to do with PMC before I even think
> >>>> about tinkering with this as I'd like to measure stuff like
> >>>> per-packet latency as well as top-level processing overhead (ie,
> >>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
> >>>> interrupts on that core, etc.)
> >>>
> >>> That will be great to see!
> >>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>>
> >>>> -adrian
> >>>
> >>> ______________________________**_________________
> >>> freebsd-net@freebsd.org mailing list
> >>> http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.
> >>>fr eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any
> >>> mail to
> >>> "freebsd-net-unsubscribe@**freebsd.org<freebsd-net-unsubscribe@freebs
> >>>d. org> "
> >>
> >> --
> >> Sami Halabi
> >> Information Systems Engineer
> >> NMS Projects Expert
> >> FreeBSD SysAdmin Expert
> >>
> >>
> >> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
> >>
> >> melifaro@yandex-team.ru> wrote:
> >>> On 29.08.2013 15:49, Adrian Chadd wrote:
> >>>> Hi,
> >>>
> >>> Hello Adrian!
> >>> I'm very sorry for the looong reply.
> >>>
> >>>> There's a lot of good stuff to review here, thanks!
> >>>>
> >>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless
> >>>> to keep locking things like that on a per-packet basis. We should be
> >>>> able to do this in a cleaner way - we can defer RX into a CPU pinned
> >>>> taskqueue and convert the interrupt handler to a fast handler that
> >>>> just schedules that taskqueue. We can ignore the ithread entirely
> >>>> here.
> >>>>
> >>>> What do you think?
> >>>
> >>> Well, it sounds good :) But performance numbers and Jack opinion is
> >>> more important :)
> >>>
> >>> Are you going to Malta?
> >>>
> >>>> Totally pie in the sky handwaving at this point:
> >>>>
> >>>> * create an array of mbuf pointers for completed mbufs;
> >>>> * populate the mbuf array;
> >>>> * pass the array up to ether_demux().
> >>>>
> >>>> For vlan handling, it may end up populating its own list of mbufs to
> >>>> push up to ether_demux(). So maybe we should extend the API to have
> >>>> a bitmap of packets to actually handle from the array, so we can
> >>>> pass up a larger array of mbufs, note which ones are for the
> >>>> destination and then the upcall can mark which frames its consumed.
> >>>>
> >>>> I specifically wonder how much work/benefit we may see by doing:
> >>>>
> >>>> * batching packets into lists so various steps can batch process
> >>>> things rather than run to completion;
> >>>> * batching the processing of a list of frames under a single lock
> >>>> instance - eg, if the forwarding code could do the forwarding lookup
> >>>> for 'n' packets under a single lock, then pass that list of frames
> >>>> up to inet_pfil_hook() to do the work under one lock, etc, etc.
> >>>
> >>> I'm thinking the same way, but we're stuck with 'forwarding lookup'
> >>> due to problem with egress interface pointer, as I mention earlier.
> >>> However it is interesting to see how much it helps, regardless of
> >>> locking.
> >>>
> >>> Currently I'm thinking that we should try to change radix to
> >>> something different (it seems that it can be checked fast) and see
> >>> what happened. Luigi's performance numbers for our radix are too
> >>> awful, and there is a patch implementing alternative trie:
> >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info
> >>>.i et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab
> >>>.f er.hr/dxr/stable_8_20120824.diff>
> >>>
> >>>> Here, the processing would look less like "grab lock and process to
> >>>> completion" and more like "mark and sweep" - ie, we have a list of
> >>>> frames that we mark as needing processing and mark as having been
> >>>> processed at each layer, so we know where to next dispatch them.
> >>>>
> >>>> I still have some tool coding to do with PMC before I even think
> >>>> about tinkering with this as I'd like to measure stuff like
> >>>> per-packet latency as well as top-level processing overhead (ie,
> >>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
> >>>> interrupts on that core, etc.)
> >>>
> >>> That will be great to see!
> >>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>>
> >>>> -adrian
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201309241142.24542.zec>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation