Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Jul 2014 12:03:54 -0700
From:      Navdeep Parhar <nparhar@gmail.com>
To:        John Jasem <jjasen@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: tuning routing using cxgbe and T580-CR cards?
Message-ID:  <53C4299A.3000900@gmail.com>
In-Reply-To: <53C3EFDC.2030100@gmail.com>
References:  <53C01EB5.6090701@gmail.com> <53C03BB4.2090203@gmail.com> <53C3EFDC.2030100@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Use UDP if you want more control over your experiments.
- It's easier to directly control the frame size on the wire.  No TSO,
  LRO, segmentation to worry about.
- UDP has no flow control so the transmitters will not let up even if a
  frame goes missing.  TCP will go into recovery.  Lack of protocol
  level flow control also means the transmitters cannot be influenced by
  the receivers in any way.
- frames go only in the direction you want them to.  With TCP you have
  the receiver transmitting all the time too (ACKs).

Regards,
Navdeep

On 07/14/14 07:57, John Jasem wrote:
> The two physical CPUs are: 
> Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz (2400.05-MHz K8-class CPU)
> 
> Hyperthreading, at least from initial appearances, seems to offer no
> benefits or drawbacks.
> 
> I tested iperf3, using a packet generator on each subnet, each sending 4
> streams to a server on another subnet.
> 
> maximum segment size of 128 and 1460 used, with little variance. (iperf3
> -M).
> 
> A snapshot of netstat -d -b -w1 -W -h included. Midway through, the
> numbers dropped. This coincides with launching  this was when I launched
> 16 more streams, 4 new clients, 4 new servers on different nets, 4
> streams each.
> 
>             input        (Total)           output
>    packets  errs idrops      bytes    packets  errs      bytes colls drops
>       1.6M     0   514       254M       1.6M     0       252M     0     5
>       1.6M     0   294       244M       1.6M     0       246M     0     6
>       1.6M     0    95       255M       1.5M     0       236M     0     6
>       1.4M     0     0       216M       1.5M     0       224M     0     3
>       1.5M     0     0       225M       1.4M     0       219M     0     4
>       1.4M     0   389       214M       1.4M     0       216M     0     1
>       1.4M     0   270       207M       1.4M     0       207M     0     1
>       1.4M     0   279       210M       1.4M     0       209M     0     2
>       1.4M     0    12       207M       1.3M     0       204M     0     1
>       1.4M     0   303       206M       1.4M     0       214M     0     2
>       1.3M     0  2.3K       190M       1.4M     0       212M     0     1
>       1.1M     0  1.1K       175M       1.1M     0       176M     0     1
>       1.1M     0  1.6K       176M       1.1M     0       175M     0     1
>       1.1M     0   830       176M       1.1M     0       174M     0     0
>       1.2M     0  1.5K       187M       1.2M     0       187M     0     0
>       1.2M     0  1.1K       183M       1.2M     0       184M     0     1
>       1.2M     0  1.5K       197M       1.2M     0       196M     0     2
>       1.3M     0  2.2K       199M       1.2M     0       196M     0     0
>       1.3M     0  2.8K       200M       1.3M     0       202M     0     4
>       1.3M     0  1.5K       199M       1.2M     0       198M     0     1
> 
> 
> vmstat also included. You see similar drops in faults.
> 
> 
>  procs      memory      page                    disks     faults         cpu
>  r b w     avm    fre   flt  re  pi  po    fr  sr mf0 cd0   in   sy   cs
> us sy id
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 188799  224
> 387419  0 74 26
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 207447  150
> 425576  0 72 28
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 205638  202
> 421659  0 75 25
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 200292  150
> 411257  0 74 26
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 200338  197
> 411537  0 77 23
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 199289  156
> 409092  0 75 25
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 200504  200
> 411992  0 76 24
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 165042  152
> 341207  0 78 22
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 171360  200
> 353776  0 78 22
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 197557  150
> 405937  0 74 26
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 170696  204
> 353197  0 78 22
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 174927  150
> 361171  0 77 23
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 153836  200
> 319227  0 79 21
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 159056  150
> 329517  0 78 22
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 155240  200
> 321819  0 78 22
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 166422  156
> 344184  0 78 22
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 162065  200
> 335215  0 79 21
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 172857  150
> 356852  0 78 22
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 81267  197
> 176539  0 92  8
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 82151  150
> 177434  0 91  9
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 73904  204
> 160887  0 91  9
>  0 0 0    574M    15G     2   0   0   0     8   6   0   0 73820  150
> 161201  0 91  9
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 73926  196
> 161850  0 92  8
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 77215  150
> 166886  0 91  9
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 77509  198
> 169650  0 91  9
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 69993  156
> 154783  0 90 10
>  0 0 0    574M    15G    82   0   0   0     0   6   0   0 69722  199
> 153525  0 91  9
>  0 0 0    574M    15G     2   0   0   0     0   6   0   0 66353  150
> 147027  0 91  9
>  0 0 0    550M    15G   102   0   0   0   101   6   0   0 67906  259
> 149365  0 90 10
>  0 0 0    550M    15G     0   0   0   0     0   6   0   0 71837  125
> 157253  0 92  8
>  0 0 0    550M    15G    80   0   0   0     0   6   0   0 73508  179
> 161498  0 92  8
>  0 0 0    550M    15G     0   0   0   0     0   6   0   0 72673  125
> 159449  0 92  8
>  0 0 0    550M    15G    80   0   0   0     0   6   0   0 75630  175
> 164614  0 91  9
> 
> 
> 
> 
> On 07/11/2014 03:32 PM, Navdeep Parhar wrote:
>> On 07/11/14 10:28, John Jasem wrote:
>>> In testing two Chelsio T580-CR dual port cards with FreeBSD 10-STABLE,
>>> I've been able to use a collection of clients to generate approximately
>>> 1.5-1.6 million TCP packets per second sustained, and routinely hit
>>> 10GB/s, both measured by netstat -d -b -w1 -W (I usually use -h for the
>>> quick read, accepting the loss of granularity).
>> When forwarding, the pps rate is often more interesting, and almost
>> always the limiting factor, as compared to the total amount of data
>> being passed around.  10GB at this pps probably means 9000 MTU.  Try
>> with 1500 too if possible.
>>
>> "netstat -d 1" and "vmstat 1" for a few seconds when your system is
>> under maximum load would be useful.  And what kind of CPU is in this system?
>>
>>> While performance has so far been stellar, and I'm honestly speculating
>>> I will need more CPU depth and horsepower to get much faster, I'm
>>> curious if there is any gain to tweaking performance settings. I'm
>>> seeing, under multiple streams, with N targets connecting to N servers,
>>> interrupts on all CPUs peg at 99-100%, and I'm curious if tweaking
>>> configs will help, or its a free clue to get more horsepower.
>>>
>>> So, far, except for temporarily turning off pflogd, and setting the
>>> following sysctl variables, I've not done any performance tuning on the
>>> system yet.
>>>
>>> /etc/sysctl.conf
>>> net.inet.ip.fastforwarding=1
>>> kern.random.sys.harvest.ethernet=0
>>> kern.random.sys.harvest.point_to_point=0
>>> kern.random.sys.harvest.interrupt=0
>>>
>>> a) One of the first things I did in prior testing was to turn
>>> hyperthreading off. I presume this is still prudent, as HT doesn't help
>>> with interrupt handling?
>> It is always worthwhile to try your workload with and without
>> hyperthreading.
>>
>>> b) I briefly experimented with using cpuset(1) to stick interrupts to
>>> physical CPUs, but it offered no performance enhancements, and indeed,
>>> appeared to decrease performance by 10-20%. Has anyone else tried this?
>>> What were your results?
>>>
>>> c) the defaults for the cxgbe driver appear to be 8 rx queues, and N tx
>>> queues, with N being the number of CPUs detected. For a system running
>>> multiple cards, routing or firewalling, does this make sense, or would
>>> balancing tx and rx be more ideal? And would reducing queues per card
>>> based on NUMBER-CPUS and NUM-CHELSIO-PORTS make sense at all?
>> The defaults are nrxq = min(8, ncores) and ntxq = min(16, ncores).  The
>> man page mentions this.  The reason for 8 vs. 16 is that tx queues are
>> "cheaper" as they don't have to be backed by rx buffers.  It only needs
>> some memory for the tx descriptor ring and some hardware resources.
>>
>> It appears that your system has >= 16 cores.  For forwarding it probably
>> makes sense to have nrxq = ntxq.  If you're left with 8 or fewer cores
>> after disabling hyperthreading you'll automatically get 8 rx and tx
>> queues.  Otherwise you'll have to fiddle with the hw.cxgbe.nrxq10g and
>> ntxq10g tunables (documented in the man page).
>>
>>
>>> d) dev.cxl.$PORT.qsize_rxq: 1024 and dev.cxl.$PORT.qsize_txq: 1024.
>>> These appear to not be writeable when if_cxgbe is loaded, so I speculate
>>> they are not to be messed with, or are loader.conf variables? Is there
>>> any benefit to messing with them?
>> Can't change them after the port has been administratively brought up
>> even once.  This is mentioned in the man page.  I don't really recommend
>> changing them any way.
>>
>>> e) dev.t5nex.$CARD.toe.sndbuf: 262144. These are writeable, but messing
>>> with values did not yield an immediate benefit. Am I barking up the
>>> wrong tree, trying?
>> The TOE tunables won't make a difference unless you have enabled TOE,
>> the TCP endpoints lie on the system, and the connections are being
>> handled by the TOE on the chip.  This is not the case on your systems.
>> The driver does not enable TOE by default and the only way to use it is
>> to switch it on explicitly.  There is no possibility that you're using
>> it without knowing that you are.
>>
>>> f) based on prior experiments with other vendors, I tried tweaks to
>>> net.isr.* settings, but did not see any benefits worth discussing. Am I
>>> correct in this speculation, based on others experience?
>>>
>>> g) Are there other settings I should be looking at, that may squeeze out
>>> a few more packets?
>> The pps rates that you've observed are within the chip's hardware limits
>> by at least an order of magnitude.  Tuning the kernel rather than the
>> driver may be the best bang for your buck.
>>
>> Regards,
>> Navdeep
>>
>>> Thanks in advance!
>>>
>>> -- John Jasen (jjasen@gmail.com)
> 




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53C4299A.3000900>