Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 27 Jul 2014 21:42:17 +0100
From:      "George Neville-Neil" <gnn@neville-neil.com>
To:        "Adrian Chadd" <adrian@freebsd.org>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>, Navdeep Parhar <nparhar@gmail.com>, John Jasen <jjasen@gmail.com>
Subject:   Re: fastforward/routing: a 3 million packet-per-second system?
Message-ID:  <83597B15-63B3-4AD7-A458-00B67C9E5396@neville-neil.com>
In-Reply-To: <CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw@mail.gmail.com>
References:  <53CE80DD.9090109@gmail.com> <CAJ-VmomWpc=3dtasbDhhrUpGywPio3_9W2b-RTAeJjq3nahhOQ@mail.gmail.com> <53CEB090.7030701@gmail.com> <CAJ-Vmok8eu-GhaNa%2Bi%2BBLv1ZLtKQt4yNfU7ZXW3H%2BY=2HFj=1w@mail.gmail.com> <53CEB670.9060600@gmail.com> <CAJ-VmonhCg9TvQArtP51rAUjFSe4FpFL8SNCTS6jNwk_Esk%2BEA@mail.gmail.com> <53CEB9B5.7020609@gmail.com> <CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 22 Jul 2014, at 20:30, Adrian Chadd wrote:

> hi!
>
> You can use 'pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out' (then ctrl-C
> it after say 5 seconds), which will log the data to pmc.out;
> then 'pmcannotate -k /boot/kernel pmc.out /boot/kernel/kernel' to find
> out where the most cpu cycles are being spent.
>

Chiming in late, but don't you mean instruction-retired instead of 
CPU_CLK_UNHALTED_CORE?

Best,
George


> It should give us the location(s) inside the top CPU users.
>
> Hopefully it'll then be much more obvious!
>
> I'm glad you're digging into it!
>
> -a
>
>
>
> On 22 July 2014 12:21, John Jasen <jjasen@gmail.com> wrote:
>> Navdeep;
>>
>> I was struck by spending so much time in transmit as well.
>>
>> Adrian's suggestion on mining lock profiling gave me an excuse to up 
>> the
>> tx queues in /boot/loader.conf. Our prior conversations indicated 
>> that
>> up to 64 should be acceptable?
>>
>>
>>
>>
>>
>> On 07/22/2014 03:10 PM, Adrian Chadd wrote:
>>> Hi
>>>
>>> Right. Time to figure out why you're spending so much time in
>>> cxgbe_transmit() and t4_eth_tx(). Maybe ask Navdeep for some ideas?
>>>
>>>
>>> -a
>>>
>>> On 22 July 2014 12:07, John Jasen <jjasen@gmail.com> wrote:
>>>> The first is pretty easy to turn around. Reading on dtrace now. 
>>>> Thanks
>>>> for the pointers and help!
>>>>
>>>> PMC: [CPU_CLK_UNHALTED_CORE] Samples: 142654 (100.0%) , 124560 
>>>> unresolved
>>>>
>>>> %SAMP IMAGE      FUNCTION             CALLERS
>>>> 34.0 if_cxgbe.k t4_eth_tx            cxgbe_transmit:24.0 
>>>> t4_tx_task:10.0
>>>> 28.8 if_cxgbe.k cxgbe_transmit
>>>> 7.6 if_cxgbe.k service_iq           t4_intr
>>>> 6.4 if_cxgbe.k get_scatter_segment  service_iq
>>>> 4.9 if_cxgbe.k reclaim_tx_descs     t4_eth_tx
>>>> 3.2 if_cxgbe.k write_sgl_to_txd     t4_eth_tx
>>>> 2.8 hwpmc.ko   pmclog_process_callc pmc_process_samples
>>>> 2.1 libc.so.7  bcopy                pmclog_read
>>>> 1.9 if_cxgbe.k t4_eth_rx            service_iq
>>>> 1.7 hwpmc.ko   pmclog_reserve       pmclog_process_callchain
>>>> 1.2 libpmc.so. pmclog_read
>>>> 1.0 if_cxgbe.k write_txpkts_wr      t4_eth_tx
>>>> 0.8 kernel     e1000_read_i2c_byte_ e1000_set_i2c_bb
>>>> 0.6 libc.so.7  memset
>>>> 0.5 if_cxgbe.k refill_fl            service_iq
>>>>
>>>>
>>>>
>>>>
>>>> On 07/22/2014 02:45 PM, Adrian Chadd wrote:
>>>>> Hi,
>>>>>
>>>>> Well, start with CPU profiling with pmc:
>>>>>
>>>>> kldload hwpmc
>>>>> pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 1
>>>>>
>>>>> .. look at the freebsd dtrace onliners (google that) for lock
>>>>> contention and CPU usage.
>>>>>
>>>>> You could compile with LOCK_PROFILING (which will slow things down 
>>>>> a
>>>>> little even when not in use) then enable it for a few seconds 
>>>>> (which
>>>>> will definitely slow things down) to gather fine grained lock
>>>>> contention data.
>>>>>
>>>>> (sysctl debug.lock.prof when you have it compiled in. sysctl
>>>>> debug.lock.prof.enable=1; wait 10 seconds; sysctl
>>>>> debug.lock.prof.enable=0; sysctl debug.lock.prof.stats)
>>>>>
>>>>>
>>>>> -a
>>>>>
>>>>>
>>>>> On 22 July 2014 11:42, John Jasen <jjasen@gmail.com> wrote:
>>>>>> If you have ideas on what to instrument, I'll be happy to do it.
>>>>>>
>>>>>> I'm faintly familiar with dtrace, et al, so it might take me a 
>>>>>> few tries
>>>>>> to get it right -- or bludgeoning with the documentation.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 07/22/2014 02:07 PM, Adrian Chadd wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>> Well, what's missing is some dtrace/pmc/lockdebugging 
>>>>>>> investigations
>>>>>>> into the system to see where it's currently maxing out at.
>>>>>>>
>>>>>>> I wonder if you're seeing contention on the transmit paths as 
>>>>>>> drivers
>>>>>>> queue frames from one set of driver threads/queues to another
>>>>>>> potentially completely different set of driver transmit
>>>>>>> threads/queues.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -a
>>>>>>>
>>>>>>>
>>>>>>> On 22 July 2014 08:18, John Jasen <jjasen@gmail.com> wrote:
>>>>>>>> Feedback and/or tips and tricks more than welcome.
>>>>>>>>
>>>>>>>> Outstanding questions:
>>>>>>>>
>>>>>>>> Would increasing the number of processor cores help?
>>>>>>>>
>>>>>>>> Would a system where both processor QPI ports connect to each 
>>>>>>>> other
>>>>>>>> mitigate QPI bottlenecks?
>>>>>>>>
>>>>>>>> Are there further performance optimizations I am missing?
>>>>>>>>
>>>>>>>> Server Description:
>>>>>>>>
>>>>>>>> The system in question is a Dell Poweredge R820, 16GB of RAM, 
>>>>>>>> and two
>>>>>>>> Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz.
>>>>>>>>
>>>>>>>> Onboard, in a 16x PCIe slot, I have one Chelsio T-580-CR 
>>>>>>>> two-port 40GbE
>>>>>>>> NIC, and in an 8x slot, another T-580-CR dual port.
>>>>>>>>
>>>>>>>> I am running FreeBSD 10.0-STABLE.
>>>>>>>>
>>>>>>>> BIOS tweaks:
>>>>>>>>
>>>>>>>> Hyperthreading (or Logical Processors) is turned off.
>>>>>>>> Memory Node Interleaving is turned off, but did not appear to 
>>>>>>>> impact
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> /boot/loader.conf contents:
>>>>>>>> #for CARP+PF testing
>>>>>>>> carp_load="YES"
>>>>>>>> #load cxgbe drivers.
>>>>>>>> cxgbe_load="YES"
>>>>>>>> #maxthreads appears to not exceed CPU.
>>>>>>>> net.isr.maxthreads=12
>>>>>>>> #bindthreads may be indicated when using cpuset(1) on 
>>>>>>>> interrupts
>>>>>>>> net.isr.bindthreads=1
>>>>>>>> #random guess based on googling
>>>>>>>> net.isr.maxqlimit=60480
>>>>>>>> net.link.ifqmaxlen=90000
>>>>>>>> #discussions with cxgbe maintainer and list led me to trying 
>>>>>>>> this.
>>>>>>>> Allows more interrupts
>>>>>>>> #to be fixed to CPUs, which in some cases, improves interrupt 
>>>>>>>> balancing.
>>>>>>>> hw.cxgbe.ntxq10g=16
>>>>>>>> hw.cxgbe.nrxq10g=16
>>>>>>>>
>>>>>>>> /etc/sysctl.conf contents:
>>>>>>>>
>>>>>>>> #the following is also enabled by rc.conf gateway_enable.
>>>>>>>> net.inet.ip.fastforwarding=1
>>>>>>>> #recommendations from BSD router project
>>>>>>>> kern.random.sys.harvest.ethernet=0
>>>>>>>> kern.random.sys.harvest.point_to_point=0
>>>>>>>> kern.random.sys.harvest.interrupt=0
>>>>>>>> #probably should be removed, as cxgbe does not seem to 
>>>>>>>> affect/be
>>>>>>>> affected by irq storm settings
>>>>>>>> hw.intr_storm_threshold=25000000
>>>>>>>> #based on Calomel.Org performance suggestions. 4x40GbE, seemed
>>>>>>>> reasonable to use 100GbE settings
>>>>>>>> kern.ipc.maxsockbuf=1258291200
>>>>>>>> net.inet.tcp.recvbuf_max=1258291200
>>>>>>>> net.inet.tcp.sendbuf_max=1258291200
>>>>>>>> #attempting to play with ULE scheduler, making it serve packets 
>>>>>>>> versus
>>>>>>>> netstat
>>>>>>>> kern.sched.slice=1
>>>>>>>> kern.sched.interact=1
>>>>>>>>
>>>>>>>> /etc/rc.conf contains:
>>>>>>>>
>>>>>>>> hostname="fbge1"
>>>>>>>> #should remove, especially given below duplicate entry
>>>>>>>> ifconfig_igb0="DHCP"
>>>>>>>> sshd_enable="YES"
>>>>>>>> #ntpd_enable="YES"
>>>>>>>> # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
>>>>>>>> dumpdev="AUTO"
>>>>>>>> # OpenBSD PF options to play with later. very bad for raw 
>>>>>>>> packet rates.
>>>>>>>> #pf_enable="YES"
>>>>>>>> #pflog_enable="YES"
>>>>>>>> # enable packet forwarding
>>>>>>>> # these enable forwarding and fastforwarding sysctls. inet6 
>>>>>>>> does not
>>>>>>>> have fastforward
>>>>>>>> gateway_enable="YES"
>>>>>>>> ipv6_gateway_enable="YES"
>>>>>>>> # enable OpenBSD ftp-proxy
>>>>>>>> # should comment out until actively playing with PF
>>>>>>>> ftpproxy_enable="YES"
>>>>>>>> #left in place, commented out from prior testing
>>>>>>>> #ifconfig_mlxen1="inet 172.16.2.1 netmask 255.255.255.0 mtu 
>>>>>>>> 9000"
>>>>>>>> #ifconfig_mlxen0="inet 172.16.1.1 netmask 255.255.255.0 mtu 
>>>>>>>> 9000"
>>>>>>>> #ifconfig_mlxen3="inet 172.16.7.1 netmask 255.255.255.0 mtu 
>>>>>>>> 9000"
>>>>>>>> #ifconfig_mlxen2="inet 172.16.8.1 netmask 255.255.255.0 mtu 
>>>>>>>> 9000"
>>>>>>>> # -lro and -tso options added per mailing list suggestion from 
>>>>>>>> Bjoern A.
>>>>>>>> Zeeb (bzeeb-lists at lists.zabbadoz.net)
>>>>>>>> ifconfig_cxl0="inet 172.16.3.1 netmask 255.255.255.0 mtu 9000 
>>>>>>>> -lro -tso up"
>>>>>>>> ifconfig_cxl1="inet 172.16.4.1 netmask 255.255.255.0 mtu 9000 
>>>>>>>> -lro -tso up"
>>>>>>>> ifconfig_cxl2="inet 172.16.5.1 netmask 255.255.255.0 mtu 9000 
>>>>>>>> -lro -tso up"
>>>>>>>> ifconfig_cxl3="inet 172.16.6.1 netmask 255.255.255.0 mtu 9000 
>>>>>>>> -lro -tso up"
>>>>>>>> # aliases instead of reconfiguring test clients. See above 
>>>>>>>> commented out
>>>>>>>> entries
>>>>>>>> ifconfig_cxl0_alias0="172.16.7.1 netmask 255.255.255.0"
>>>>>>>> ifconfig_cxl1_alias0="172.16.8.1 netmask 255.255.255.0"
>>>>>>>> ifconfig_cxl2_alias0="172.16.1.1 netmask 255.255.255.0"
>>>>>>>> ifconfig_cxl3_alias0="172.16.2.1 netmask 255.255.255.0"
>>>>>>>> # for remote monitoring/admin of the test device
>>>>>>>> ifconfig_igb0="inet 172.30.60.60 netmask 255.255.0.0"
>>>>>>>>
>>>>>>>> Additional configurations:
>>>>>>>> cpuset-chelsio-6cpu-high
>>>>>>>> # Original provided by  Navdeep Parhar <nparhar@gmail.com>
>>>>>>>> # takes vmstat -ai output into a list, and assigns interrupts 
>>>>>>>> in order to
>>>>>>>> # the available CPU cores.
>>>>>>>> # Modified: to assign only to the 'high CPUs', ie: on core1.
>>>>>>>> # See: 
>>>>>>>> http://lists.freebsd.org/pipermail/freebsd-net/2014-July/039317.html
>>>>>>>> #!/usr/local/bin/bash
>>>>>>>> ncpu=12
>>>>>>>> irqlist=$(vmstat -ia | egrep 't4nex|t5nex|cxgbc' | cut -f1 -d: 
>>>>>>>> | cut -c4-)
>>>>>>>> i=6
>>>>>>>> for irq in $irqlist; do
>>>>>>>>      cpuset -l $i -x $irq
>>>>>>>>      i=$((i+1))
>>>>>>>>      [ $i -ge $ncpu ] && i=6
>>>>>>>> done
>>>>>>>>
>>>>>>>> Client Description:
>>>>>>>>
>>>>>>>> Two Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz processors
>>>>>>>> 64 GB ram
>>>>>>>> Mellanox Technologies MT27500 Family [ConnectX-3]
>>>>>>>> Centos 6.4 with updates
>>>>>>>> iperf3 installed from yum repositories: 
>>>>>>>> iperf3-3.0.3-3.el6.x86_64
>>>>>>>>
>>>>>>>> Test setup:
>>>>>>>>
>>>>>>>> I've found about 3 streams between Centos clients is about the 
>>>>>>>> best way
>>>>>>>> to get the most out of them.
>>>>>>>> Above certain points, the -b flag does not change results.
>>>>>>>> -N is an artifact from using TCP
>>>>>>>> -l is needed, as -M doesn't work for UDP.
>>>>>>>>
>>>>>>>> I usually use launch scripts similar to the following:
>>>>>>>>
>>>>>>>> for i in `seq 41 60`; do ssh loader$i "export TIME=120; export
>>>>>>>> STREAMS=1; export PORT=52$i; export PKT=64; export RATE=2000m;
>>>>>>>> /root/iperf-test-8port-udp" & done
>>>>>>>>
>>>>>>>> The scripts execute the following on each host.
>>>>>>>>
>>>>>>>> #!/bin/bash
>>>>>>>> PORT1=$PORT
>>>>>>>> PORT2=$(($PORT+1000))
>>>>>>>> PORT3=$(($PORT+2000))
>>>>>>>> iperf3 -c loader41-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>>> -P$STREAMS -p$PORT1 &
>>>>>>>> iperf3 -c loader42-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>>> -P$STREAMS -p$PORT1 &
>>>>>>>> iperf3 -c loader43-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>>> -P$STREAMS -p$PORT1 &
>>>>>>>> ... (through all clients and all three ports) ...
>>>>>>>> iperf3 -c loader60-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
>>>>>>>> -P$STREAMS -p$PORT3 &
>>>>>>>>
>>>>>>>>
>>>>>>>> Results:
>>>>>>>>
>>>>>>>> Summarized, netstat -w 1 -q 240 -bd, run through:
>>>>>>>> cat test4-tuning | egrep -v {'packets | input '} | awk 
>>>>>>>> '{ipackets+=$1}
>>>>>>>> {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input "
>>>>>>>> ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR, 
>>>>>>>> "odrops "
>>>>>>>> odrops/NR}'
>>>>>>>>
>>>>>>>> input 1.10662e+07 idrops 8.01783e+06 opackets 3.04516e+06 
>>>>>>>> odrops 3152.4
>>>>>>>>
>>>>>>>> Snapshot of raw output:
>>>>>>>>
>>>>>>>>         input        (Total)           output
>>>>>>>> packets  errs idrops      bytes    packets  errs      bytes 
>>>>>>>> colls drops
>>>>>>>> 11189148     0 7462453 1230805216    3725006     0  409750710   
>>>>>>>>   0   799
>>>>>>>> 10527505     0 6746901 1158024978    3779096     0  415700708   
>>>>>>>>   0   127
>>>>>>>> 10606163     0 6850760 1166676673    3751780     0  412695761   
>>>>>>>>   0  1535
>>>>>>>> 10749324     0 7132014 1182425799    3617558     0  397930956   
>>>>>>>>   0  5972
>>>>>>>> 10695667     0 7022717 1176521907    3669342     0  403627236   
>>>>>>>>   0  1461
>>>>>>>> 10441173     0 6762134 1148528662    3675048     0  404255540   
>>>>>>>>   0  6021
>>>>>>>> 10683773     0 7005635 1175215014    3676962     0  404465671   
>>>>>>>>   0  2606
>>>>>>>> 10869859     0 7208696 1195683372    3658432     0  402427698   
>>>>>>>>   0   979
>>>>>>>> 11948989     0 8310926 1314387881    3633773     0  399714986   
>>>>>>>>   0   725
>>>>>>>> 12426195     0 8864415 1366877194    3562311     0  391853156   
>>>>>>>>   0  2762
>>>>>>>> 13006059     0 9432389 1430661751    3570067     0  392706552   
>>>>>>>>   0  5158
>>>>>>>> 12822243     0 9098871 1410443600    3715177     0  408668500   
>>>>>>>>   0  4064
>>>>>>>> 13317864     0 9683602 1464961374    3632156     0  399536131   
>>>>>>>>   0  3684
>>>>>>>> 13701905     0 10182562 1507207982    3523101     0  387540859  
>>>>>>>>    0
>>>>>>>> 8690
>>>>>>>> 13820227     0 10244870 1520221820    3562038     0  391823322  
>>>>>>>>    0
>>>>>>>> 2426
>>>>>>>> 14437060     0 10955483 1588073033    3480105     0  382810557  
>>>>>>>>    0
>>>>>>>> 2619
>>>>>>>> 14518471     0 11119573 1597028105    3397439     0  373717355  
>>>>>>>>    0
>>>>>>>> 5691
>>>>>>>> 14890287     0 11675003 1637926521    3199812     0  351978304  
>>>>>>>>    0
>>>>>>>> 11007
>>>>>>>> 14923610     0 11749091 1641594441    3171436     0  348857468  
>>>>>>>>    0
>>>>>>>> 7389
>>>>>>>> 14738704     0 11609730 1621254991    3117715     0  342948394  
>>>>>>>>    0
>>>>>>>> 2597
>>>>>>>> 14753975     0 11549735 1622935026    3207393     0  352812846  
>>>>>>>>    0
>>>>>>>> 4798
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> freebsd-net@freebsd.org mailing list
>>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>>>> To unsubscribe, send any mail to 
>>>>>>>> "freebsd-net-unsubscribe@freebsd.org"
>>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?83597B15-63B3-4AD7-A458-00B67C9E5396>