Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Aug 2013 12:37:10 -0700
From:      Jack Vogel <jfvogel@gmail.com>
To:        "Alexander V. Chernikov" <melifaro@yandex-team.ru>
Cc:        Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>, FreeBSD Net <net@freebsd.org>, Luigi Rizzo <luigi@freebsd.org>, "Andrey V. Elsukov" <ae@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: Network stack changes
Message-ID:  <CAFOYbcnbcp4z60SeDXTQ%2BacPGC55DCYfhZZuRvHvu7HhyWTang@mail.gmail.com>
In-Reply-To: <521E41CB.30700@yandex-team.ru>
References:  <521E41CB.30700@yandex-team.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
Very interesting material Alexander, only had time to glance at it now,
will look in more
depth later, thanks!

Jack



On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov <
melifaro@yandex-team.ru> wrote:

> Hello list!
>
> There is a lot constantly raising  discussions related to networking stack
> performance/changes.
>
> I'll try to summarize current problems and possible solutions from my
> point of view.
> (Generally this is one problem: stack is slooooooooooooooooooooooooooow**,
> but we need to know why and what to do).
>
> Let's start with current IPv4 packet flow on a typical router:
> http://static.ipfw.ru/images/**freebsd_ipv4_flow.png<http://static.ipfw.ru/images/freebsd_ipv4_flow.png>;
>
> (I'm sorry I can't provide this as text since Visio don't have any
> 'ascii-art' exporter).
>
> Note that we are using process-to-completion model, e.g. process any
> packet in ISR until it is either
> consumed by L4+ stack or dropped or put to egress NIC queue.
>
> (There is also deferred ISR model implemented inside netisr but it does
> not change much:
> it can help to do more fine-grained hashing (for GRE or other similar
> traffic), but
> 1) it uses per-packet mutex locking which kills all performance
> 2) it currently does not have _any_ hashing functions (see absence of
> flags in `netstat -Q`)
> People using http://static.ipfw.ru/patches/**netisr_ip_flowid.diff<http://static.ipfw.ru/patches/netisr_ip_flowid.diff>(or modified PPPoe/GRE version)
> report some profit, but without fixing (1) it can't help much
> )
>
> So, let's start:
>
> 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since
> there is nearly no contention
> (the only thing that can happen is driver reconfiguration which is rare
> and, more signifficant, we do this once
> for the batch of packets received in given interrupt). However, due to
> some (im)possible deadlocks current code
> does per-packet ring unlock/lock (see ixgbe_rx_input()).
> There was a discussion ended with nothing: http://lists.freebsd.org/**
> pipermail/freebsd-net/2012-**October/033520.html<http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html>;
>
> 1*) Possible BPF users. Here we have one rlock if there are any readers
> present
> (and mutex for any matching packets, but this is more or less OK.
> Additionally, there is WIP to implement multiqueue BPF
> and there is chance that we can reduce lock contention there). There is
> also an "optimize_writers" hack permitting applications
> like CDP to use BPF as writers but not registering them as receivers
> (which implies rlock)
>
> 2/3) Virtual interfaces (laggs/vlans over lagg and other simular
> constructions).
> Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more
> funny - we use complex vlan_hash with another rlock to
> get vlan interface from underlying one.
>
> This is definitely not like things should be done and this can be changed
> more or less easily.
>
> There are some useful terms/techniques in world of software/hardware
> routing: they have clear 'control plane' and 'data plane' separation.
> Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg
> hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with
> options, destined to hosts without ARP/NDP record, and similar). Latter one
> is done in hardware (or effective software implementation).
> Control plane is responsible to provide data for efficient data plane
> operations. This is the point we are missing nearly everywhere.
>
> What I want to say is: lagg is pure control-plane stuff and vlan is nearly
> the same. We can't apply this approach to complex cases like
> lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0)
> but we definitely can do this for most common setups like (igb* or ix* in
> lagg with or without vlans on top of lagg).
>
> We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> some more. We even have per-driver hooks to program HW filtering.
>
> One small step to do is to throw packet to vlan interface directly (P1),
> proof-of-concept(working in production):
> http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html<http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html>;
>
> Another is to change lagg packet accounting: http://lists.freebsd.org/**
> pipermail/svn-src-all/2013-**April/067570.html<http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html>;
> Again, this is more like HW boxes do (aggregate all counters including
> errors) (and I can't imagine what real error we can get from _lagg_).
>
> 4) If we are router, we can do either slooow ip_input() -> ip_forward() ->
> ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow'
> path for multicast/options/local traffic (e.g. works exactly like 'data
> plane' part).
> (Btw, we can consider net.inet.ip.fastforwarding to be turned on by
> default at least for non-IPSEC kernels)
>
> Here we have to determine if this is local packet or not, e.g. F(dst_ip)
> returning 1 or 0. Currently we are simply using standard rlock + hash of
> iface addresses.
> (And some consumers like ipfw(4) do the same, but without lock).
> We don't need to do this! We can build sorted array of IPv4 addresses or
> other efficient structure on every address change and use it unlocked with
> delayed garbage collection (proof-of-concept attached)
> (There is another thing to discuss: maybe we can do this once somewhere in
> ip_input and mark mbuf as 'local/non-local' ? )
>
> 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks.
> This is OK.
>
> However, 6) and 7) are not.
> Firewall can use the same pfil lock as reader protection without imposing
> its own lock. currently pfil&ipfw code is ready to do this.
>
> 8) Radix/rt* api. This is probably the worst place in entire stack. It is
> toooo generic, tooo slow and buggy (do you use IPv6? you definitely know
> what I'm talking about).
> A) It really is too generic and assumption that it can be (effectively)
> used for every family is wrong. Two examples:
> we don't need to lookup all 128 bits of IPv6 address. Subnets with mask
> >/64 are not used widely (actually the only reason to use them are p2p
> links due to ND potential problems).
> One of common solutions is to lookup 64bits, and build another trie (or
> other structure) in case of collision.
> Another example is MPLS where we can simply do direct array lookup based
> on ingress label.
>
> B) It is terribly slow (AFAIR luigi@ did some performance management,
> numbers available in one of netmap pdfs)
> C) It is not multipath-capable. Stateful (and non-working) multipath is
> definitely not the right way.
>
> 8*) rtentry
> We are doing it wrong.
> Currently _every_ lookup locks/unlocks given rte twice.
> First lock is related to and old-old story for trusting IP redirects (and
> auto-adding host routes for them). Hopefully currently it is disabled
> automatically when you turn forwarding on.
> The second one is much more complicated: we are assuming that rte's with
> non-zero refcount value can stop egress interface from being destroyed.
> This is wrong (but widely used) assumption.
>
> We can use delayed GC instead of locking for rte's and this won't break
> things more than they are broken now (patch attached).
> We can't do the same for ifp structures since
> a) virtual ones can assume some state in underlying physical NIC
> b) physical ones just _can_ be destroyed (maybe regardless of user wants
> this or not, like: SFP being unplugged from NIC) or simply lead to kernel
> crash due to SW/HW inconsistency
>
> One of possible solution is to implement stable refcounts based on PCPU
> counters, and apply thos counters to ifp, but seem to be non-trivial.
>
>
> Another rtalloc(9) problem is the fact that radix is used as both 'control
> plane' and 'data plane' structure/api. Some users always want to put more
> information in rte, while others
> want to make rte more compact. We just need _different_ structures for
> that.
> Feature-rich, lot-of-data control plane one (to store everything we want
> to store, including, for example, PID of process originating the route) -
> current radix can be modified to do this.
> And address-family-depended another structure (array, trie, or anything)
> which contains _only_ data necessary to put packet on the wire.
>
> 11) arpresolve. Currently (this was decoupled in 8.x) we have
> a) ifaddr rlock
> b) lle rlock.
>
> We don't need those locks.
> We need to
> a) make lle layer per-interface instead of global (and this can also solve
> multiple fibs and L2 mappings done in fib.0 issue)
> b) use rtalloc(9)-provided lock instead of separate locking
> c) actually, we need to do rewrite this layer because
> d) lle actually is the place to do real multipath:
>
> briefly,
> you have rte pointing to some special nexthop structure pointing to lle,
> which has the following data:
> num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend
> to header
> Separate post will follow.
>
> With the following, we can achieve lagg traffic distribution without
> actually using lagg_transmit and similar stuff (at least in most common
> scenarious)
> (for example, TCP output definitely can benefit from this, since we can
> account flowid once for TCP session and use in in every mbuf)
>
>
> So. Imagine we have done all this. How we can estimate the difference?
>
> There was a thread, started a year ago, describing 'stock' performance and
> difference for various modifications.
> It is done on 8.x, however I've got similar results on recent 9.x
>
> http://lists.freebsd.org/**pipermail/freebsd-net/2012-**July/032680.html<http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html>;
>
> Briefly:
>
> 2xE5645 @ Intel 82599 NIC.
> Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
> no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte
> IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in
> vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all
> destination addresses. Traffic level is slightly above or slightly below
> system performance.
>
> we start from 1.4MPPS (if we are using several routes to minimize mutex
> contention).
>
> My 'current' result for the same test, on same HW, with the following
> modifications:
>
> * 1) ixgbe per-packet ring unlock removed
> * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
> * 4) separate lockless in_localip() version
> * 6) - using existing pfil lock
> * 7) using lockless version
> * 8) radix converted to use rmlock instead of rlock. Delayed GC is used
> instead of mutexes
> * 10) - using existing pfil lock
> * 11) using radix lock to do arpresolve(). Not using lle rlock
>
> (so the rmlocks are the only locks used on data path).
>
> Additionally, ipstat counters are converted to PCPU (no real performance
> implications).
> ixgbe does not do per-packet accounting (as in head).
> if_vlan counters are converted to PCPU
> lagg is converted to rmlock, per-packet accounting is removed (using stat
> from underlying interfaces)
> lle hash size is bumped to 1024 instead of 32 (not applicable here, but
> slows things down for large L2 domains)
>
> The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16
> cores), nearly the same for HT on and 22 cores.
>
> ..
> while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on
> the same-class hardware and _userland_ forwarding.
>
> One of key features making all such products possible (DPDK, netmap,
> packetshader, Cisco SW forwarding) - is use of batching instead of
> process-to-completion model.
> Batching mitigates locking cost, batching does not wash out CPU cache, and
> so on.
>
> So maybe we can consider passing batches from NIC to at least L2 layer
> with netisr? or even up to ip_input() ?
>
> Another question is about making some sort of reliable GC like ("passive
> serialization" or other similar not-to-pronounce-words about Linux and
> lockless objects).
>
>
> P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how
> can this be done and what benefit can be achieved.
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFOYbcnbcp4z60SeDXTQ%2BacPGC55DCYfhZZuRvHvu7HhyWTang>