Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Jul 2014 12:30:27 -0700
From:      Adrian Chadd <adrian@freebsd.org>
To:        John Jasen <jjasen@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org>
Cc:        Navdeep Parhar <nparhar@gmail.com>
Subject:   Re: fastforward/routing: a 3 million packet-per-second system?
Message-ID:  <CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw@mail.gmail.com>
In-Reply-To: <53CEB9B5.7020609@gmail.com>
References:  <53CE80DD.9090109@gmail.com> <CAJ-VmomWpc=3dtasbDhhrUpGywPio3_9W2b-RTAeJjq3nahhOQ@mail.gmail.com> <53CEB090.7030701@gmail.com> <CAJ-Vmok8eu-GhaNa%2Bi%2BBLv1ZLtKQt4yNfU7ZXW3H%2BY=2HFj=1w@mail.gmail.com> <53CEB670.9060600@gmail.com> <CAJ-VmonhCg9TvQArtP51rAUjFSe4FpFL8SNCTS6jNwk_Esk%2BEA@mail.gmail.com> <53CEB9B5.7020609@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
hi!

You can use 'pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out' (then ctrl-C
it after say 5 seconds), which will log the data to pmc.out;
then 'pmcannotate -k /boot/kernel pmc.out /boot/kernel/kernel' to find
out where the most cpu cycles are being spent.

It should give us the location(s) inside the top CPU users.

Hopefully it'll then be much more obvious!

I'm glad you're digging into it!

-a



On 22 July 2014 12:21, John Jasen <jjasen@gmail.com> wrote:
> Navdeep;
>
> I was struck by spending so much time in transmit as well.
>
> Adrian's suggestion on mining lock profiling gave me an excuse to up the
> tx queues in /boot/loader.conf. Our prior conversations indicated that
> up to 64 should be acceptable?
>
>
>
>
>
> On 07/22/2014 03:10 PM, Adrian Chadd wrote:
>> Hi
>>
>> Right. Time to figure out why you're spending so much time in
>> cxgbe_transmit() and t4_eth_tx(). Maybe ask Navdeep for some ideas?
>>
>>
>> -a
>>
>> On 22 July 2014 12:07, John Jasen <jjasen@gmail.com> wrote:
>>> The first is pretty easy to turn around. Reading on dtrace now. Thanks
>>> for the pointers and help!
>>>
>>> PMC: [CPU_CLK_UNHALTED_CORE] Samples: 142654 (100.0%) , 124560 unresolved
>>>
>>> %SAMP IMAGE      FUNCTION             CALLERS
>>>  34.0 if_cxgbe.k t4_eth_tx            cxgbe_transmit:24.0 t4_tx_task:10.0
>>>  28.8 if_cxgbe.k cxgbe_transmit
>>>   7.6 if_cxgbe.k service_iq           t4_intr
>>>   6.4 if_cxgbe.k get_scatter_segment  service_iq
>>>   4.9 if_cxgbe.k reclaim_tx_descs     t4_eth_tx
>>>   3.2 if_cxgbe.k write_sgl_to_txd     t4_eth_tx
>>>   2.8 hwpmc.ko   pmclog_process_callc pmc_process_samples
>>>   2.1 libc.so.7  bcopy                pmclog_read
>>>   1.9 if_cxgbe.k t4_eth_rx            service_iq
>>>   1.7 hwpmc.ko   pmclog_reserve       pmclog_process_callchain
>>>   1.2 libpmc.so. pmclog_read
>>>   1.0 if_cxgbe.k write_txpkts_wr      t4_eth_tx
>>>   0.8 kernel     e1000_read_i2c_byte_ e1000_set_i2c_bb
>>>   0.6 libc.so.7  memset
>>>   0.5 if_cxgbe.k refill_fl            service_iq
>>>
>>>
>>>
>>>
>>> On 07/22/2014 02:45 PM, Adrian Chadd wrote:
>>>> Hi,
>>>>
>>>> Well, start with CPU profiling with pmc:
>>>>
>>>> kldload hwpmc
>>>> pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 1
>>>>
>>>> .. look at the freebsd dtrace onliners (google that) for lock
>>>> contention and CPU usage.
>>>>
>>>> You could compile with LOCK_PROFILING (which will slow things down a
>>>> little even when not in use) then enable it for a few seconds (which
>>>> will definitely slow things down) to gather fine grained lock
>>>> contention data.
>>>>
>>>> (sysctl debug.lock.prof when you have it compiled in. sysctl
>>>> debug.lock.prof.enable=1; wait 10 seconds; sysctl
>>>> debug.lock.prof.enable=0; sysctl debug.lock.prof.stats)
>>>>
>>>>
>>>> -a
>>>>
>>>>
>>>> On 22 July 2014 11:42, John Jasen <jjasen@gmail.com> wrote:
>>>>> If you have ideas on what to instrument, I'll be happy to do it.
>>>>>
>>>>> I'm faintly familiar with dtrace, et al, so it might take me a few tries
>>>>> to get it right -- or bludgeoning with the documentation.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 07/22/2014 02:07 PM, Adrian Chadd wrote:
>>>>>> Hi!
>>>>>>
>>>>>> Well, what's missing is some dtrace/pmc/lockdebugging investigations
>>>>>> into the system to see where it's currently maxing out at.
>>>>>>
>>>>>> I wonder if you're seeing contention on the transmit paths as drivers
>>>>>> queue frames from one set of driver threads/queues to another
>>>>>> potentially completely different set of driver transmit
>>>>>> threads/queues.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -a
>>>>>>
>>>>>>
>>>>>> On 22 July 2014 08:18, John Jasen <jjasen@gmail.com> wrote:
>>>>>>> Feedback and/or tips and tricks more than welcome.
>>>>>>>
>>>>>>> Outstanding questions:
>>>>>>>
>>>>>>> Would increasing the number of processor cores help?
>>>>>>>
>>>>>>> Would a system where both processor QPI ports connect to each other
>>>>>>> mitigate QPI bottlenecks?
>>>>>>>
>>>>>>> Are there further performance optimizations I am missing?
>>>>>>>
>>>>>>> Server Description:
>>>>>>>
>>>>>>> The system in question is a Dell Poweredge R820, 16GB of RAM, and two
>>>>>>> Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz.
>>>>>>>
>>>>>>> Onboard, in a 16x PCIe slot, I have one Chelsio T-580-CR two-port 40GbE
>>>>>>> NIC, and in an 8x slot, another T-580-CR dual port.
>>>>>>>
>>>>>>> I am running FreeBSD 10.0-STABLE.
>>>>>>>
>>>>>>> BIOS tweaks:
>>>>>>>
>>>>>>> Hyperthreading (or Logical Processors) is turned off.
>>>>>>> Memory Node Interleaving is turned off, but did not appear to impact
>>>>>>> performance.
>>>>>>>
>>>>>>> /boot/loader.conf contents:
>>>>>>> #for CARP+PF testing
>>>>>>> carp_load="YES"
>>>>>>> #load cxgbe drivers.
>>>>>>> cxgbe_load="YES"
>>>>>>> #maxthreads appears to not exceed CPU.
>>>>>>> net.isr.maxthreads=12
>>>>>>> #bindthreads may be indicated when using cpuset(1) on interrupts
>>>>>>> net.isr.bindthreads=1
>>>>>>> #random guess based on googling
>>>>>>> net.isr.maxqlimit=60480
>>>>>>> net.link.ifqmaxlen=90000
>>>>>>> #discussions with cxgbe maintainer and list led me to trying this.
>>>>>>> Allows more interrupts
>>>>>>> #to be fixed to CPUs, which in some cases, improves interrupt balancing.
>>>>>>> hw.cxgbe.ntxq10g=16
>>>>>>> hw.cxgbe.nrxq10g=16
>>>>>>>
>>>>>>> /etc/sysctl.conf contents:
>>>>>>>
>>>>>>> #the following is also enabled by rc.conf gateway_enable.
>>>>>>> net.inet.ip.fastforwarding=1
>>>>>>> #recommendations from BSD router project
>>>>>>> kern.random.sys.harvest.ethernet=0
>>>>>>> kern.random.sys.harvest.point_to_point=0
>>>>>>> kern.random.sys.harvest.interrupt=0
>>>>>>> #probably should be removed, as cxgbe does not seem to affect/be
>>>>>>> affected by irq storm settings
>>>>>>> hw.intr_storm_threshold=25000000
>>>>>>> #based on Calomel.Org performance suggestions. 4x40GbE, seemed
>>>>>>> reasonable to use 100GbE settings
>>>>>>> kern.ipc.maxsockbuf=1258291200
>>>>>>> net.inet.tcp.recvbuf_max=1258291200
>>>>>>> net.inet.tcp.sendbuf_max=1258291200
>>>>>>> #attempting to play with ULE scheduler, making it serve packets versus
>>>>>>> netstat
>>>>>>> kern.sched.slice=1
>>>>>>> kern.sched.interact=1
>>>>>>>
>>>>>>> /etc/rc.conf contains:
>>>>>>>
>>>>>>> hostname="fbge1"
>>>>>>> #should remove, especially given below duplicate entry
>>>>>>> ifconfig_igb0="DHCP"
>>>>>>> sshd_enable="YES"
>>>>>>> #ntpd_enable="YES"
>>>>>>> # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
>>>>>>> dumpdev="AUTO"
>>>>>>> # OpenBSD PF options to play with later. very bad for raw packet rates.
>>>>>>> #pf_enable="YES"
>>>>>>> #pflog_enable="YES"
>>>>>>> # enable packet forwarding
>>>>>>> # these enable forwarding and fastforwarding sysctls. inet6 does not
>>>>>>> have fastforward
>>>>>>> gateway_enable="YES"
>>>>>>> ipv6_gateway_enable="YES"
>>>>>>> # enable OpenBSD ftp-proxy
>>>>>>> # should comment out until actively playing with PF
>>>>>>> ftpproxy_enable="YES"
>>>>>>> #left in place, commented out from prior testing
>>>>>>> #ifconfig_mlxen1="inet 172.16.2.1 netmask 255.255.255.0 mtu 9000"
>>>>>>> #ifconfig_mlxen0="inet 172.16.1.1 netmask 255.255.255.0 mtu 9000"
>>>>>>> #ifconfig_mlxen3="inet 172.16.7.1 netmask 255.255.255.0 mtu 9000"
>>>>>>> #ifconfig_mlxen2="inet 172.16.8.1 netmask 255.255.255.0 mtu 9000"
>>>>>>> # -lro and -tso options added per mailing list suggestion from Bjoern A.
>>>>>>> Zeeb (bzeeb-lists at lists.zabbadoz.net)
>>>>>>> ifconfig_cxl0="inet 172.16.3.1 netmask 255.255.255.0 mtu 9000 -lro -tso up"
>>>>>>> ifconfig_cxl1="inet 172.16.4.1 netmask 255.255.255.0 mtu 9000 -lro -tso up"
>>>>>>> ifconfig_cxl2="inet 172.16.5.1 netmask 255.255.255.0 mtu 9000 -lro -tso up"
>>>>>>> ifconfig_cxl3="inet 172.16.6.1 netmask 255.255.255.0 mtu 9000 -lro -tso up"
>>>>>>> # aliases instead of reconfiguring test clients. See above commented out
>>>>>>> entries
>>>>>>> ifconfig_cxl0_alias0="172.16.7.1 netmask 255.255.255.0"
>>>>>>> ifconfig_cxl1_alias0="172.16.8.1 netmask 255.255.255.0"
>>>>>>> ifconfig_cxl2_alias0="172.16.1.1 netmask 255.255.255.0"
>>>>>>> ifconfig_cxl3_alias0="172.16.2.1 netmask 255.255.255.0"
>>>>>>> # for remote monitoring/admin of the test device
>>>>>>> ifconfig_igb0="inet 172.30.60.60 netmask 255.255.0.0"
>>>>>>>
>>>>>>> Additional configurations:
>>>>>>> cpuset-chelsio-6cpu-high
>>>>>>> # Original provided by  Navdeep Parhar <nparhar@gmail.com>
>>>>>>> # takes vmstat -ai output into a list, and assigns interrupts in order to
>>>>>>> # the available CPU cores.
>>>>>>> # Modified: to assign only to the 'high CPUs', ie: on core1.
>>>>>>> # See: http://lists.freebsd.org/pipermail/freebsd-net/2014-July/039317.html
>>>>>>> #!/usr/local/bin/bash
>>>>>>> ncpu=12
>>>>>>> irqlist=$(vmstat -ia | egrep 't4nex|t5nex|cxgbc' | cut -f1 -d: | cut -c4-)
>>>>>>> i=6
>>>>>>> for irq in $irqlist; do
>>>>>>>         cpuset -l $i -x $irq
>>>>>>>         i=$((i+1))
>>>>>>>         [ $i -ge $ncpu ] && i=6
>>>>>>> done
>>>>>>>
>>>>>>> Client Description:
>>>>>>>
>>>>>>> Two Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz processors
>>>>>>> 64 GB ram
>>>>>>> Mellanox Technologies MT27500 Family [ConnectX-3]
>>>>>>> Centos 6.4 with updates
>>>>>>> iperf3 installed from yum repositories: iperf3-3.0.3-3.el6.x86_64
>>>>>>>
>>>>>>> Test setup:
>>>>>>>
>>>>>>> I've found about 3 streams between Centos clients is about the best way
>>>>>>> to get the most out of them.
>>>>>>> Above certain points, the -b flag does not change results.
>>>>>>> -N is an artifact from using TCP
>>>>>>> -l is needed, as -M doesn't work for UDP.
>>>>>>>
>>>>>>> I usually use launch scripts similar to the following:
>>>>>>>
>>>>>>>  for i in `seq 41 60`; do ssh loader$i "export TIME=120; export
>>>>>>> STREAMS=1; export PORT=52$i; export PKT=64; export RATE=2000m;
>>>>>>> /root/iperf-test-8port-udp" & done
>>>>>>>
>>>>>>> The scripts execute the following on each host.
>>>>>>>
>>>>>>> #!/bin/bash
>>>>>>> PORT1=$PORT
>>>>>>> PORT2=$(($PORT+1000))
>>>>>>> PORT3=$(($PORT+2000))
>>>>>>> iperf3 -c loader41-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>> -P$STREAMS -p$PORT1 &
>>>>>>> iperf3 -c loader42-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>> -P$STREAMS -p$PORT1 &
>>>>>>> iperf3 -c loader43-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>> -P$STREAMS -p$PORT1 &
>>>>>>> ... (through all clients and all three ports) ...
>>>>>>> iperf3 -c loader60-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>> -P$STREAMS -p$PORT3 &
>>>>>>>
>>>>>>>
>>>>>>> Results:
>>>>>>>
>>>>>>> Summarized, netstat -w 1 -q 240 -bd, run through:
>>>>>>> cat test4-tuning | egrep -v {'packets | input '} | awk '{ipackets+=$1}
>>>>>>> {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input "
>>>>>>> ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR, "odrops "
>>>>>>> odrops/NR}'
>>>>>>>
>>>>>>> input 1.10662e+07 idrops 8.01783e+06 opackets 3.04516e+06 odrops 3152.4
>>>>>>>
>>>>>>> Snapshot of raw output:
>>>>>>>
>>>>>>>            input        (Total)           output
>>>>>>>    packets  errs idrops      bytes    packets  errs      bytes colls drops
>>>>>>>   11189148     0 7462453 1230805216    3725006     0  409750710     0   799
>>>>>>>   10527505     0 6746901 1158024978    3779096     0  415700708     0   127
>>>>>>>   10606163     0 6850760 1166676673    3751780     0  412695761     0  1535
>>>>>>>   10749324     0 7132014 1182425799    3617558     0  397930956     0  5972
>>>>>>>   10695667     0 7022717 1176521907    3669342     0  403627236     0  1461
>>>>>>>   10441173     0 6762134 1148528662    3675048     0  404255540     0  6021
>>>>>>>   10683773     0 7005635 1175215014    3676962     0  404465671     0  2606
>>>>>>>   10869859     0 7208696 1195683372    3658432     0  402427698     0   979
>>>>>>>   11948989     0 8310926 1314387881    3633773     0  399714986     0   725
>>>>>>>   12426195     0 8864415 1366877194    3562311     0  391853156     0  2762
>>>>>>>   13006059     0 9432389 1430661751    3570067     0  392706552     0  5158
>>>>>>>   12822243     0 9098871 1410443600    3715177     0  408668500     0  4064
>>>>>>>   13317864     0 9683602 1464961374    3632156     0  399536131     0  3684
>>>>>>>   13701905     0 10182562 1507207982    3523101     0  387540859     0
>>>>>>> 8690
>>>>>>>   13820227     0 10244870 1520221820    3562038     0  391823322     0
>>>>>>> 2426
>>>>>>>   14437060     0 10955483 1588073033    3480105     0  382810557     0
>>>>>>> 2619
>>>>>>>   14518471     0 11119573 1597028105    3397439     0  373717355     0
>>>>>>> 5691
>>>>>>>   14890287     0 11675003 1637926521    3199812     0  351978304     0
>>>>>>> 11007
>>>>>>>   14923610     0 11749091 1641594441    3171436     0  348857468     0
>>>>>>> 7389
>>>>>>>   14738704     0 11609730 1621254991    3117715     0  342948394     0
>>>>>>> 2597
>>>>>>>   14753975     0 11549735 1622935026    3207393     0  352812846     0
>>>>>>> 4798
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> freebsd-net@freebsd.org mailing list
>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw>