Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 Jul 2014 11:34:38 -0400
From:      John Jasen <jjasen@gmail.com>
To:        FreeBSD Net <freebsd-net@freebsd.org>, Navdeep Parhar <nparhar@gmail.com>
Subject:   packet forwarding and possible mitigation of Intel QuickPath Interconnect ugliness in multi cpu systems
Message-ID:  <53CD330E.6090407@gmail.com>

next in thread | raw e-mail | index | archive | help
Executive Summary:

Appropriate use of cpuset(1) can mitigate performance bottlenecks over
the Intel QPI processor interconnection, and improve packets-per-second
processing rate by over 100%.

Test Environment:

My test system is a Dell dual CPU R820, populated with evaluation cards
graciously provided by Chelsio. Currently, each dual port chelsio card
is populated in a 16x slot, one physically attached to each CPU.

My load generators are 20 CentOS-based linux systems, using Mellanox VPI
ConnectX-3 cards in ethernet mode.

The test environment divides the load generators into 4 distinct subnets
of 5 systems, with each one utilizing a Chelsio interface as its route
to the other networks. I use iperf3 on the linux systems to generate
packets.

The test runs select two systems on each subnet to be a sender, and
three on each to be receivers. The sending systems establish 4 UDP
streams to each receiver.

Results:

netstat -w 1 -q 100 -d before each run

I summarized results with the following.
awk '{ipackets+=$1} {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print
"input " ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR,
"odrops " odrops/NR}'

Without any cpuset tuning at all:

input 7.25464e+06 idrops 5.89939e+06 opackets 1.34888e+06 odrops 947.409

With cpuset assigning interrupts equally to each physical processor:

input 1.10886e+07 idrops 9.85347e+06 opackets 1.22887e+06 odrops 3384.86

cpuset assigning interrupts across cores on the first physical processor:

input 1.14046e+07 idrops 8.6674e+06 opackets 2.73365e+06 odrops 2420.75

cpuset assigning interrupts across cores on the second physical processor:

input 1.16746e+07 idrops 8.96412e+06 opackets 2.70652e+06 odrops 3076.52

I will follow this up with both cards being in PCIE slots physically
connected to the first CPU, but for a rule of thumb comparision, with
cpuset'ing the interrupts appropriately, it was usually about 10-15%
higher than cpuset-one-processor-low and cpuset-one-processor-high.

Conclusion:

The best solution for highest performance is still to avoid QPI as much
as possible, by appropriate physical placement of the PCIe cards.
However, in cases where that may not be possible or desirable, using
cpuset to assign all the interrupt affinity to one processor will help
mitigate performance loss.

Credits:

Thanks to Dell for the loan of the Dell R820 using for testing; Thanks
to Chelsio for the loan of the two T580-CR cards; and thanks to the
CXGBE maintainer, Navdeep Parhar, for his assistance and patience during
debugging and testing.

Feedback is always welcome.

I can provide detailed results upon request.

Scripts provided by a vendor, I need to get their permission to
redistribute/publish, but I do not think thats a problem.




















Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53CD330E.6090407>