Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Sep 2013 09:55:10 -0700
From:      Vijay Singh <vijju.singh@gmail.com>
To:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: Network stack changes
Message-ID:  <CALCNsJRe3GvmFHx5gUgpGnRPTp0pK6wwRBjL-E0xvqvemb6k6w@mail.gmail.com>
In-Reply-To: <201309241142.24542.zec@fer.hr>
References:  <521E41CB.30700@yandex-team.ru> <201309240958.06172.zec@fer.hr> <5241519C.9040908@rewt.org.uk> <201309241142.24542.zec@fer.hr>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi, Robert Watson and Adrian have asked me to share some details around an
MP design that we have at $work that is based on Robert's PCBGROUP work.
Here are some details that I have put together.

Design of the network MP system
-=96-------------------------------------------

1. Reduce locking in the stack.
2. Should not require the use of specialized HW (NICs).
3. Should be able to leverage the capabilities of modern HW.

The basic design is to introduce parallelism in the stack using the
connection groups design implemented by Robert. We have a number of
critical differences from the original design through.

1. The overall idea is not so much to provide CPU locality in our case,
though the design doesn't necessarily preclude that. Since we run a complex
workload, and have a run to completion model in the kernel, pinning threads
causes unpredictable latencies. We find that a thread level CPU affinity
implementation works quite well.

2. The number of connection groups created in the system are a constant,
and as an optimization  it is possible to set this number to the number of
Rx/Tx queues in the NIC, but we don't do this. I will talk more about this
later.

3. The number of threads created to service these queues is derived from
the number of CPUs in the system. The thread library we used for our design
is close to, but not quite the taskqueue(9). I can provide details if
needed.

4. Input as well as output in this design for a given CG is processed by
the thread currently assigned to the CG. Robert's design uses netisr but we
found that it had 2 big issues. One, that it handles only mbufs as the
objects that are queued. Second, and more importantly, it handles
unidirectional traffic only. We felt that the original design was
susceptible to locking at the connection level from the input & output
threads contexts.

5. In order to avoid running too many threads, we use the fast interrupt
handler, and enqueue the interrupt handler as a task for one of the threads
created for the CG handling, as mentioned above.

6. We have changed the drivers to collect a fixed maximum number of mbufs,
linked with m_nextpkt, and pass them up. I think this idea has been
discussed previously on this list. The advantage we see is reduced locking
in the Rx routine, as well as some possible optimizations using sw
prefetching.

7. The driver also fills in the RSS hash computed by the HW in the mbuf.
Code looks something like this:

                        u16 pkt_info;
                        pkt_info =3D (u16)
(le16toh(cur->wb.lower.lo_dword.hs_rss.pkt_info)
                                          & IXGBE_RXDADV_RSSTYPE_MASK);
                        if ((pkt_info =3D=3D IXGBE_RXDADV_RSSTYPE_IPV4_TCP)=
 ||
                            (pkt_info =3D=3D IXGBE_RXDADV_RSSTYPE_IPV6_TCP)=
 ) {
                                sendmp->m_pkthdr.flowid =3D
cur->wb.lower.hi_dword.rss;
                                M_HASHTYPE_SET(sendmp, M_HASHTYPE_NIC);
                                sendmp->m_flags |=3D M_FLOWID;
                        }

Here M_HASHTYPE_NIC is something we've added.

8. As packets flow up the stack, we've added an extension to the PCBGROUP
design that allows a netisr module to provide a custom classifier
(nh_m2flow) and a dispatch routine (nh_dispatcher).

static struct netisr_handler ip_nh =3D {
        .nh_name =3D "ip",
        .nh_handler =3D ip_input,
        .nh_proto =3D NETISR_IP,
        .nh_policy =3D NETISR_POLICY_FLOW,
        .nh_dispatch =3D NETISR_DISPATCH_HYBRID,
        .nh_m2flow =3D bsd_cg_m2flow,
        .nh_dispatcher =3D bsd_cg_dispatcher,

9. In the m2flow routine we look at the mbuf. If the driver has supplied
the RSS hash we use it, or else (for the loopback path for e.g.) we compute
the RSS hash in sw. Something like..

                if ((m !=3D NULL) && (m->m_flags & M_FLOWID) &&
                    (M_HASHTYPE_GET(m) =3D=3D M_HASHTYPE_NIC)) {
                        if (is_v4) {
                                M_HASHTYPE_SET(m, M_HASHTYPE_RSS_TCP_IPV4);
...
                else if (m) {
                        th =3D (struct tcphdr *)(m->m_data + hlen);

                        if (is_v4) {
                                hash =3D
bsd_cg_toeplitz_ipv4_hash(ip->ip_src.s_addr,
                                                ip->ip_dst.s_addr,
                                                th->th_sport, th->th_dport)=
;

10. In netisr_select_cpuid() we convert the flow id to CPU/CG ID using a
simple modulo with the number of connection groups. Then we invoke the
customer dispatcher to send the packets (with some batching) to the
taskqueue like threads. The handler routine is the netisr handler of the
protocol being processed, usually IP or IPv6.

                if (m->m_flags & M_FLOWID) {
                        if (((dispatch_policy =3D=3D NETISR_DISPATCH_HYBRID=
) ||
                             (dispatch_policy =3D=3D NETISR_DISPATCH_DEFERR=
ED))
&&
                            (npp->np_dispatcher !=3D NULL)) {
                                *cpuidp =3D

netisr_custom_flow2cpu(m->m_pkthdr.flowid);
                                m =3D npp->np_dispatcher(m, *cpuidp, proto)=
;
                                if (m =3D=3D NULL)
                                        return (NULL);
                        }
                }

11. The hash placed in the mbuf is then used to lookup the pcb in the
corresponding pcbgroup table:

static __inline u_int
in_pcbgroup_getbucket(struct inpcbinfo *pcbinfo, uint32_t hash)
{

        return (hash % pcbinfo->ipi_npcbgroups);
}

/*
 * Map a (hashtype, hash) tuple into a connection group, or NULL if the has=
h
 * information is insufficient to identify the pcbgroup.
 */
struct inpcbgroup *
in_pcbgroup_byhash(struct inpcbinfo *pcbinfo, u_int hashtype, uint32_t hash=
)
{

        if ((pcbinfo->ipi_hashfields =3D=3D IPI_HASHFIELDS_4TUPLE &&
            hashtype =3D=3D M_HASHTYPE_RSS_TCP_IPV4) ||
            (pcbinfo->ipi_hashfields =3D=3D IPI_HASHFIELDS_2TUPLE &&
            hashtype =3D=3D M_HASHTYPE_RSS_IPV4)) {
                return (&pcbinfo->ipi_pcbgroups[
                        in_pcbgroup_getbucket(pcbinfo, hash)]);
        }
        return (NULL);
}

I wanted to pass on this information as a starting point. I'll be at the
vendor summit in Nov. I think it has been suggested that network
performance be one of the discussion points. I'd be happy to participate
and provide inputs based on our own experience.

-vijay


On Tue, Sep 24, 2013 at 2:42 AM, Marko Zec <zec@fer.hr> wrote:

> On Tuesday 24 September 2013 10:47:24 Joe Holden wrote:
> > On 24/09/2013 08:58, Marko Zec wrote:
> > > On Tuesday 24 September 2013 00:46:46 Sami Halabi wrote:
> > >> Hi,
> > >>
> > >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<;
> http://info
> > >>>.i et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<;
> http://www.nxlab
> > >>>.f er.hr/dxr/stable_8_20120824.diff>
> > >>
> > >> I've tried the diff in 10-current, applied cleanly but had errors
> > >> compiling new kernel... is there any work to make it work? i'd love =
to
> > >> test it.
> > >
> > > Even if you'd make it compile on current, you could only run syntheti=
c
> > > tests measuring lookup performance using streams of random keys, as
> > > outlined in the paper (btw. the paper at Luigi's site is an older
> > > draft, the final version with slightly revised benchmarks is availabl=
e
> > > here:
> > >
> http://www.sigcomm.org/sites/default/files/ccr/papers/2012/October/2378
> > >956-2378961.pdf)
> > >
> > > I.e. the code only hooks into the routing API for testing purposes, b=
ut
> > > is completely disconnected from the forwarding path.
> >
> > aha!  How much work would it be to enable it to be used?
>
> The inefficiency of our forwarding path is caused by so many factors
> combined (as others have put out previously in this thread) that simply
> plugging DXR into it wouldn't make a noticeable difference, except probab=
ly
> in scenarios with full BGP views loaded into the FIB, where our radix tre=
e
> lookups can cause extreme cache trashing.
>
> Once we come closer to lockless operation from RX to TX queue, start
> processing packets and allocating / freeing buffers in batches, simplify
> mbufs or replace them entirely with a more efficient abstraction, then
> replacing radix tree lookups with DXR or something else might start makin=
g
> sense.
>
> Marko
>
> > > We have a prototype in the works which combines DXR with Netmap in
> > > userspace and is capable of sustaining well above line rate forwardin=
g
> > > with full-sized BGP views using Intel 10G cards on commodity multicor=
e
> > > machines. The work was somewhat stalled during the summer but I plan =
to
> > > wrap it up and release the code until the end of this year.  With
> > > recent advances in netmap it might also be feasible to merge DXR and
> > > netmap entirely inside the kernel but I've not explored that path
> > > yet...
> >
> > mmm, forwarding using netmap would be pretty awesome...
> >
> > > Marko
> > >
> > >> Sami
> > >>
> > >>
> > >> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
> > >>
> > >> melifaro@yandex-team.ru> wrote:
> > >>> On 29.08.2013 15:49, Adrian Chadd wrote:
> > >>>> Hi,
> > >>>
> > >>> Hello Adrian!
> > >>> I'm very sorry for the looong reply.
> > >>>
> > >>>> There's a lot of good stuff to review here, thanks!
> > >>>>
> > >>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointles=
s
> > >>>> to keep locking things like that on a per-packet basis. We should =
be
> > >>>> able to do this in a cleaner way - we can defer RX into a CPU pinn=
ed
> > >>>> taskqueue and convert the interrupt handler to a fast handler that
> > >>>> just schedules that taskqueue. We can ignore the ithread entirely
> > >>>> here.
> > >>>>
> > >>>> What do you think?
> > >>>
> > >>> Well, it sounds good :) But performance numbers and Jack opinion is
> > >>> more important :)
> > >>>
> > >>> Are you going to Malta?
> > >>>
> > >>>> Totally pie in the sky handwaving at this point:
> > >>>>
> > >>>> * create an array of mbuf pointers for completed mbufs;
> > >>>> * populate the mbuf array;
> > >>>> * pass the array up to ether_demux().
> > >>>>
> > >>>> For vlan handling, it may end up populating its own list of mbufs =
to
> > >>>> push up to ether_demux(). So maybe we should extend the API to hav=
e
> > >>>> a bitmap of packets to actually handle from the array, so we can
> > >>>> pass up a larger array of mbufs, note which ones are for the
> > >>>> destination and then the upcall can mark which frames its consumed=
.
> > >>>>
> > >>>> I specifically wonder how much work/benefit we may see by doing:
> > >>>>
> > >>>> * batching packets into lists so various steps can batch process
> > >>>> things rather than run to completion;
> > >>>> * batching the processing of a list of frames under a single lock
> > >>>> instance - eg, if the forwarding code could do the forwarding look=
up
> > >>>> for 'n' packets under a single lock, then pass that list of frames
> > >>>> up to inet_pfil_hook() to do the work under one lock, etc, etc.
> > >>>
> > >>> I'm thinking the same way, but we're stuck with 'forwarding lookup'
> > >>> due to problem with egress interface pointer, as I mention earlier.
> > >>> However it is interesting to see how much it helps, regardless of
> > >>> locking.
> > >>>
> > >>> Currently I'm thinking that we should try to change radix to
> > >>> something different (it seems that it can be checked fast) and see
> > >>> what happened. Luigi's performance numbers for our radix are too
> > >>> awful, and there is a patch implementing alternative trie:
> > >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<;
> http://info
> > >>>.i et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<;
> http://www.nxlab
> > >>>.f er.hr/dxr/stable_8_20120824.diff>
> > >>>
> > >>>> Here, the processing would look less like "grab lock and process t=
o
> > >>>> completion" and more like "mark and sweep" - ie, we have a list of
> > >>>> frames that we mark as needing processing and mark as having been
> > >>>> processed at each layer, so we know where to next dispatch them.
> > >>>>
> > >>>> I still have some tool coding to do with PMC before I even think
> > >>>> about tinkering with this as I'd like to measure stuff like
> > >>>> per-packet latency as well as top-level processing overhead (ie,
> > >>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NI=
C
> > >>>> interrupts on that core, etc.)
> > >>>
> > >>> That will be great to see!
> > >>>
> > >>>> Thanks,
> > >>>>
> > >>>>
> > >>>>
> > >>>> -adrian
> > >>>
> > >>> ______________________________**_________________
> > >>> freebsd-net@freebsd.org mailing list
> > >>> http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://list=
s
> .
> > >>>fr eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any
> > >>> mail to
> > >>> "freebsd-net-unsubscribe@**freebsd.org
> <freebsd-net-unsubscribe@freebs
> > >>>d. org> "
> > >>
> > >> --
> > >> Sami Halabi
> > >> Information Systems Engineer
> > >> NMS Projects Expert
> > >> FreeBSD SysAdmin Expert
> > >>
> > >>
> > >> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
> > >>
> > >> melifaro@yandex-team.ru> wrote:
> > >>> On 29.08.2013 15:49, Adrian Chadd wrote:
> > >>>> Hi,
> > >>>
> > >>> Hello Adrian!
> > >>> I'm very sorry for the looong reply.
> > >>>
> > >>>> There's a lot of good stuff to review here, thanks!
> > >>>>
> > >>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointles=
s
> > >>>> to keep locking things like that on a per-packet basis. We should =
be
> > >>>> able to do this in a cleaner way - we can defer RX into a CPU pinn=
ed
> > >>>> taskqueue and convert the interrupt handler to a fast handler that
> > >>>> just schedules that taskqueue. We can ignore the ithread entirely
> > >>>> here.
> > >>>>
> > >>>> What do you think?
> > >>>
> > >>> Well, it sounds good :) But performance numbers and Jack opinion is
> > >>> more important :)
> > >>>
> > >>> Are you going to Malta?
> > >>>
> > >>>> Totally pie in the sky handwaving at this point:
> > >>>>
> > >>>> * create an array of mbuf pointers for completed mbufs;
> > >>>> * populate the mbuf array;
> > >>>> * pass the array up to ether_demux().
> > >>>>
> > >>>> For vlan handling, it may end up populating its own list of mbufs =
to
> > >>>> push up to ether_demux(). So maybe we should extend the API to hav=
e
> > >>>> a bitmap of packets to actually handle from the array, so we can
> > >>>> pass up a larger array of mbufs, note which ones are for the
> > >>>> destination and then the upcall can mark which frames its consumed=
.
> > >>>>
> > >>>> I specifically wonder how much work/benefit we may see by doing:
> > >>>>
> > >>>> * batching packets into lists so various steps can batch process
> > >>>> things rather than run to completion;
> > >>>> * batching the processing of a list of frames under a single lock
> > >>>> instance - eg, if the forwarding code could do the forwarding look=
up
> > >>>> for 'n' packets under a single lock, then pass that list of frames
> > >>>> up to inet_pfil_hook() to do the work under one lock, etc, etc.
> > >>>
> > >>> I'm thinking the same way, but we're stuck with 'forwarding lookup'
> > >>> due to problem with egress interface pointer, as I mention earlier.
> > >>> However it is interesting to see how much it helps, regardless of
> > >>> locking.
> > >>>
> > >>> Currently I'm thinking that we should try to change radix to
> > >>> something different (it seems that it can be checked fast) and see
> > >>> what happened. Luigi's performance numbers for our radix are too
> > >>> awful, and there is a patch implementing alternative trie:
> > >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<;
> http://info
> > >>>.i et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<;
> http://www.nxlab
> > >>>.f er.hr/dxr/stable_8_20120824.diff>
> > >>>
> > >>>> Here, the processing would look less like "grab lock and process t=
o
> > >>>> completion" and more like "mark and sweep" - ie, we have a list of
> > >>>> frames that we mark as needing processing and mark as having been
> > >>>> processed at each layer, so we know where to next dispatch them.
> > >>>>
> > >>>> I still have some tool coding to do with PMC before I even think
> > >>>> about tinkering with this as I'd like to measure stuff like
> > >>>> per-packet latency as well as top-level processing overhead (ie,
> > >>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NI=
C
> > >>>> interrupts on that core, etc.)
> > >>>
> > >>> That will be great to see!
> > >>>
> > >>>> Thanks,
> > >>>>
> > >>>>
> > >>>>
> > >>>> -adrian
> >
> > _______________________________________________
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CALCNsJRe3GvmFHx5gUgpGnRPTp0pK6wwRBjL-E0xvqvemb6k6w>