From owner-freebsd-net@FreeBSD.ORG  Mon Jul 21 17:16:09 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DEBFAADF
 for <freebsd-net@freebsd.org>; Mon, 21 Jul 2014 17:16:09 +0000 (UTC)
Received: from mail-ie0-x22c.google.com (mail-ie0-x22c.google.com
 [IPv6:2607:f8b0:4001:c03::22c])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id AB1602E29
 for <freebsd-net@freebsd.org>; Mon, 21 Jul 2014 17:16:09 +0000 (UTC)
Received: by mail-ie0-f172.google.com with SMTP id lx4so7071040iec.3
 for <freebsd-net@freebsd.org>; Mon, 21 Jul 2014 10:16:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:subject:references
 :in-reply-to:content-type:content-transfer-encoding;
 bh=8bPi4PHy5BEaThWEyDrqOzGNdEZH+qzYQ4tIeXNF7Z4=;
 b=iUyujrM2UPVRxy+XHjAzBJSlw3Nfk1gY6Zkb+8MR4nY6rdd1JvUxYZMVz3m6wJTz15
 ja8VBCxGtljItiQ8MqvatNC73egEssQuF45JylOLfPKC1HBxKaM9iaFb2f8/td7k8VaT
 VPsqitelPgybLk7kZu84rl++pf1NxuTazV9C3b2kldOs4n+X/maHLdro89OBVbeQlm3y
 iJxyXuS0tNd6UwdwBU5NVpJMltBeJ2EocjCfvSF0LMycKp8GhrxUR8pkT0yj3HemXIWO
 Ix4oO8+YtnNxj+ltz13vR95k3LrH133etkhVkcwhXazShZr4VUkKYOhXaoWaoUzZ2/ue
 uLwA==
X-Received: by 10.42.130.133 with SMTP id v5mr14584414ics.57.1405962969157;
 Mon, 21 Jul 2014 10:16:09 -0700 (PDT)
Received: from [10.1.68.187] (gs-sv-1-49-ac1.gsfc.nasa.gov. [198.119.56.43])
 by mx.google.com with ESMTPSA id qp11sm33586401igb.21.2014.07.21.10.16.07
 for <freebsd-net@freebsd.org>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Mon, 21 Jul 2014 10:16:07 -0700 (PDT)
Message-ID: <53CD4AD5.4030901@gmail.com>
Date: Mon, 21 Jul 2014 13:16:05 -0400
From: John Jasen <jjasen@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: FreeBSD Net <freebsd-net@freebsd.org>
Subject: Re: packet forwarding and possible mitigation of Intel QuickPath
 Interconnect ugliness in multi cpu systems
References: <53CD330E.6090407@gmail.com>
In-Reply-To: <53CD330E.6090407@gmail.com>
X-Enigmail-Version: 1.5.2
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Jul 2014 17:16:09 -0000

For completeness, the following are numbers for putting both dual port
cards on PCIe lanes attached to physical CPU0.

I'll note that the numbers for cpuset-1cpu-low are skewed. During
testing, over 3.5 million packets per second was observed via netstat.
I'll also note, untuned cpuset and cpuset across all available logical
processors actually came in around 1 million and 887k packets-per-second
forwarded, respectively.

$ test3.cpuset-1cpu-low 
input  1.17679e+07 idrops 9.06871e+06 opackets 2.69356e+06 odrops 5577.45
$ test3.cpuset-1cpu-high 
input  1.34393e+07 idrops 1.05269e+07 opackets 2910347 odrops 1943.65


On 07/21/2014 11:34 AM, John Jasen wrote:
> Executive Summary:
>
> Appropriate use of cpuset(1) can mitigate performance bottlenecks over
> the Intel QPI processor interconnection, and improve packets-per-second
> processing rate by over 100%.
>
> Test Environment:
>
> My test system is a Dell dual CPU R820, populated with evaluation cards
> graciously provided by Chelsio. Currently, each dual port chelsio card
> is populated in a 16x slot, one physically attached to each CPU.
>
> My load generators are 20 CentOS-based linux systems, using Mellanox VPI
> ConnectX-3 cards in ethernet mode.
>
> The test environment divides the load generators into 4 distinct subnets
> of 5 systems, with each one utilizing a Chelsio interface as its route
> to the other networks. I use iperf3 on the linux systems to generate
> packets.
>
> The test runs select two systems on each subnet to be a sender, and
> three on each to be receivers. The sending systems establish 4 UDP
> streams to each receiver.
>
> Results:
>
> netstat -w 1 -q 100 -d before each run
>
> I summarized results with the following.
> awk '{ipackets+=$1} {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print
> "input " ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR,
> "odrops " odrops/NR}'
>
> Without any cpuset tuning at all:
>
> input 7.25464e+06 idrops 5.89939e+06 opackets 1.34888e+06 odrops 947.409
>
> With cpuset assigning interrupts equally to each physical processor:
>
> input 1.10886e+07 idrops 9.85347e+06 opackets 1.22887e+06 odrops 3384.86
>
> cpuset assigning interrupts across cores on the first physical processor:
>
> input 1.14046e+07 idrops 8.6674e+06 opackets 2.73365e+06 odrops 2420.75
>
> cpuset assigning interrupts across cores on the second physical processor:
>
> input 1.16746e+07 idrops 8.96412e+06 opackets 2.70652e+06 odrops 3076.52
>
> I will follow this up with both cards being in PCIE slots physically
> connected to the first CPU, but for a rule of thumb comparision, with
> cpuset'ing the interrupts appropriately, it was usually about 10-15%
> higher than cpuset-one-processor-low and cpuset-one-processor-high.
>
> Conclusion:
>
> The best solution for highest performance is still to avoid QPI as much
> as possible, by appropriate physical placement of the PCIe cards.
> However, in cases where that may not be possible or desirable, using
> cpuset to assign all the interrupt affinity to one processor will help
> mitigate performance loss.
>
> Credits:
>
> Thanks to Dell for the loan of the Dell R820 using for testing; Thanks
> to Chelsio for the loan of the two T580-CR cards; and thanks to the
> CXGBE maintainer, Navdeep Parhar, for his assistance and patience during
> debugging and testing.
>
> Feedback is always welcome.
>
> I can provide detailed results upon request.
>
> Scripts provided by a vendor, I need to get their permission to
> redistribute/publish, but I do not think thats a problem.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>