Date: Sun, 27 Jul 2014 21:42:17 +0100 From: "George Neville-Neil" <gnn@neville-neil.com> To: "Adrian Chadd" <adrian@freebsd.org> Cc: FreeBSD Net <freebsd-net@freebsd.org>, Navdeep Parhar <nparhar@gmail.com>, John Jasen <jjasen@gmail.com> Subject: Re: fastforward/routing: a 3 million packet-per-second system? Message-ID: <83597B15-63B3-4AD7-A458-00B67C9E5396@neville-neil.com> In-Reply-To: <CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw@mail.gmail.com> References: <53CE80DD.9090109@gmail.com> <CAJ-VmomWpc=3dtasbDhhrUpGywPio3_9W2b-RTAeJjq3nahhOQ@mail.gmail.com> <53CEB090.7030701@gmail.com> <CAJ-Vmok8eu-GhaNa%2Bi%2BBLv1ZLtKQt4yNfU7ZXW3H%2BY=2HFj=1w@mail.gmail.com> <53CEB670.9060600@gmail.com> <CAJ-VmonhCg9TvQArtP51rAUjFSe4FpFL8SNCTS6jNwk_Esk%2BEA@mail.gmail.com> <53CEB9B5.7020609@gmail.com> <CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 22 Jul 2014, at 20:30, Adrian Chadd wrote: > hi! > > You can use 'pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out' (then ctrl-C > it after say 5 seconds), which will log the data to pmc.out; > then 'pmcannotate -k /boot/kernel pmc.out /boot/kernel/kernel' to find > out where the most cpu cycles are being spent. > Chiming in late, but don't you mean instruction-retired instead of CPU_CLK_UNHALTED_CORE? Best, George > It should give us the location(s) inside the top CPU users. > > Hopefully it'll then be much more obvious! > > I'm glad you're digging into it! > > -a > > > > On 22 July 2014 12:21, John Jasen <jjasen@gmail.com> wrote: >> Navdeep; >> >> I was struck by spending so much time in transmit as well. >> >> Adrian's suggestion on mining lock profiling gave me an excuse to up >> the >> tx queues in /boot/loader.conf. Our prior conversations indicated >> that >> up to 64 should be acceptable? >> >> >> >> >> >> On 07/22/2014 03:10 PM, Adrian Chadd wrote: >>> Hi >>> >>> Right. Time to figure out why you're spending so much time in >>> cxgbe_transmit() and t4_eth_tx(). Maybe ask Navdeep for some ideas? >>> >>> >>> -a >>> >>> On 22 July 2014 12:07, John Jasen <jjasen@gmail.com> wrote: >>>> The first is pretty easy to turn around. Reading on dtrace now. >>>> Thanks >>>> for the pointers and help! >>>> >>>> PMC: [CPU_CLK_UNHALTED_CORE] Samples: 142654 (100.0%) , 124560 >>>> unresolved >>>> >>>> %SAMP IMAGE FUNCTION CALLERS >>>> 34.0 if_cxgbe.k t4_eth_tx cxgbe_transmit:24.0 >>>> t4_tx_task:10.0 >>>> 28.8 if_cxgbe.k cxgbe_transmit >>>> 7.6 if_cxgbe.k service_iq t4_intr >>>> 6.4 if_cxgbe.k get_scatter_segment service_iq >>>> 4.9 if_cxgbe.k reclaim_tx_descs t4_eth_tx >>>> 3.2 if_cxgbe.k write_sgl_to_txd t4_eth_tx >>>> 2.8 hwpmc.ko pmclog_process_callc pmc_process_samples >>>> 2.1 libc.so.7 bcopy pmclog_read >>>> 1.9 if_cxgbe.k t4_eth_rx service_iq >>>> 1.7 hwpmc.ko pmclog_reserve pmclog_process_callchain >>>> 1.2 libpmc.so. pmclog_read >>>> 1.0 if_cxgbe.k write_txpkts_wr t4_eth_tx >>>> 0.8 kernel e1000_read_i2c_byte_ e1000_set_i2c_bb >>>> 0.6 libc.so.7 memset >>>> 0.5 if_cxgbe.k refill_fl service_iq >>>> >>>> >>>> >>>> >>>> On 07/22/2014 02:45 PM, Adrian Chadd wrote: >>>>> Hi, >>>>> >>>>> Well, start with CPU profiling with pmc: >>>>> >>>>> kldload hwpmc >>>>> pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 1 >>>>> >>>>> .. look at the freebsd dtrace onliners (google that) for lock >>>>> contention and CPU usage. >>>>> >>>>> You could compile with LOCK_PROFILING (which will slow things down >>>>> a >>>>> little even when not in use) then enable it for a few seconds >>>>> (which >>>>> will definitely slow things down) to gather fine grained lock >>>>> contention data. >>>>> >>>>> (sysctl debug.lock.prof when you have it compiled in. sysctl >>>>> debug.lock.prof.enable=1; wait 10 seconds; sysctl >>>>> debug.lock.prof.enable=0; sysctl debug.lock.prof.stats) >>>>> >>>>> >>>>> -a >>>>> >>>>> >>>>> On 22 July 2014 11:42, John Jasen <jjasen@gmail.com> wrote: >>>>>> If you have ideas on what to instrument, I'll be happy to do it. >>>>>> >>>>>> I'm faintly familiar with dtrace, et al, so it might take me a >>>>>> few tries >>>>>> to get it right -- or bludgeoning with the documentation. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 07/22/2014 02:07 PM, Adrian Chadd wrote: >>>>>>> Hi! >>>>>>> >>>>>>> Well, what's missing is some dtrace/pmc/lockdebugging >>>>>>> investigations >>>>>>> into the system to see where it's currently maxing out at. >>>>>>> >>>>>>> I wonder if you're seeing contention on the transmit paths as >>>>>>> drivers >>>>>>> queue frames from one set of driver threads/queues to another >>>>>>> potentially completely different set of driver transmit >>>>>>> threads/queues. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -a >>>>>>> >>>>>>> >>>>>>> On 22 July 2014 08:18, John Jasen <jjasen@gmail.com> wrote: >>>>>>>> Feedback and/or tips and tricks more than welcome. >>>>>>>> >>>>>>>> Outstanding questions: >>>>>>>> >>>>>>>> Would increasing the number of processor cores help? >>>>>>>> >>>>>>>> Would a system where both processor QPI ports connect to each >>>>>>>> other >>>>>>>> mitigate QPI bottlenecks? >>>>>>>> >>>>>>>> Are there further performance optimizations I am missing? >>>>>>>> >>>>>>>> Server Description: >>>>>>>> >>>>>>>> The system in question is a Dell Poweredge R820, 16GB of RAM, >>>>>>>> and two >>>>>>>> Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz. >>>>>>>> >>>>>>>> Onboard, in a 16x PCIe slot, I have one Chelsio T-580-CR >>>>>>>> two-port 40GbE >>>>>>>> NIC, and in an 8x slot, another T-580-CR dual port. >>>>>>>> >>>>>>>> I am running FreeBSD 10.0-STABLE. >>>>>>>> >>>>>>>> BIOS tweaks: >>>>>>>> >>>>>>>> Hyperthreading (or Logical Processors) is turned off. >>>>>>>> Memory Node Interleaving is turned off, but did not appear to >>>>>>>> impact >>>>>>>> performance. >>>>>>>> >>>>>>>> /boot/loader.conf contents: >>>>>>>> #for CARP+PF testing >>>>>>>> carp_load="YES" >>>>>>>> #load cxgbe drivers. >>>>>>>> cxgbe_load="YES" >>>>>>>> #maxthreads appears to not exceed CPU. >>>>>>>> net.isr.maxthreads=12 >>>>>>>> #bindthreads may be indicated when using cpuset(1) on >>>>>>>> interrupts >>>>>>>> net.isr.bindthreads=1 >>>>>>>> #random guess based on googling >>>>>>>> net.isr.maxqlimit=60480 >>>>>>>> net.link.ifqmaxlen=90000 >>>>>>>> #discussions with cxgbe maintainer and list led me to trying >>>>>>>> this. >>>>>>>> Allows more interrupts >>>>>>>> #to be fixed to CPUs, which in some cases, improves interrupt >>>>>>>> balancing. >>>>>>>> hw.cxgbe.ntxq10g=16 >>>>>>>> hw.cxgbe.nrxq10g=16 >>>>>>>> >>>>>>>> /etc/sysctl.conf contents: >>>>>>>> >>>>>>>> #the following is also enabled by rc.conf gateway_enable. >>>>>>>> net.inet.ip.fastforwarding=1 >>>>>>>> #recommendations from BSD router project >>>>>>>> kern.random.sys.harvest.ethernet=0 >>>>>>>> kern.random.sys.harvest.point_to_point=0 >>>>>>>> kern.random.sys.harvest.interrupt=0 >>>>>>>> #probably should be removed, as cxgbe does not seem to >>>>>>>> affect/be >>>>>>>> affected by irq storm settings >>>>>>>> hw.intr_storm_threshold=25000000 >>>>>>>> #based on Calomel.Org performance suggestions. 4x40GbE, seemed >>>>>>>> reasonable to use 100GbE settings >>>>>>>> kern.ipc.maxsockbuf=1258291200 >>>>>>>> net.inet.tcp.recvbuf_max=1258291200 >>>>>>>> net.inet.tcp.sendbuf_max=1258291200 >>>>>>>> #attempting to play with ULE scheduler, making it serve packets >>>>>>>> versus >>>>>>>> netstat >>>>>>>> kern.sched.slice=1 >>>>>>>> kern.sched.interact=1 >>>>>>>> >>>>>>>> /etc/rc.conf contains: >>>>>>>> >>>>>>>> hostname="fbge1" >>>>>>>> #should remove, especially given below duplicate entry >>>>>>>> ifconfig_igb0="DHCP" >>>>>>>> sshd_enable="YES" >>>>>>>> #ntpd_enable="YES" >>>>>>>> # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable >>>>>>>> dumpdev="AUTO" >>>>>>>> # OpenBSD PF options to play with later. very bad for raw >>>>>>>> packet rates. >>>>>>>> #pf_enable="YES" >>>>>>>> #pflog_enable="YES" >>>>>>>> # enable packet forwarding >>>>>>>> # these enable forwarding and fastforwarding sysctls. inet6 >>>>>>>> does not >>>>>>>> have fastforward >>>>>>>> gateway_enable="YES" >>>>>>>> ipv6_gateway_enable="YES" >>>>>>>> # enable OpenBSD ftp-proxy >>>>>>>> # should comment out until actively playing with PF >>>>>>>> ftpproxy_enable="YES" >>>>>>>> #left in place, commented out from prior testing >>>>>>>> #ifconfig_mlxen1="inet 172.16.2.1 netmask 255.255.255.0 mtu >>>>>>>> 9000" >>>>>>>> #ifconfig_mlxen0="inet 172.16.1.1 netmask 255.255.255.0 mtu >>>>>>>> 9000" >>>>>>>> #ifconfig_mlxen3="inet 172.16.7.1 netmask 255.255.255.0 mtu >>>>>>>> 9000" >>>>>>>> #ifconfig_mlxen2="inet 172.16.8.1 netmask 255.255.255.0 mtu >>>>>>>> 9000" >>>>>>>> # -lro and -tso options added per mailing list suggestion from >>>>>>>> Bjoern A. >>>>>>>> Zeeb (bzeeb-lists at lists.zabbadoz.net) >>>>>>>> ifconfig_cxl0="inet 172.16.3.1 netmask 255.255.255.0 mtu 9000 >>>>>>>> -lro -tso up" >>>>>>>> ifconfig_cxl1="inet 172.16.4.1 netmask 255.255.255.0 mtu 9000 >>>>>>>> -lro -tso up" >>>>>>>> ifconfig_cxl2="inet 172.16.5.1 netmask 255.255.255.0 mtu 9000 >>>>>>>> -lro -tso up" >>>>>>>> ifconfig_cxl3="inet 172.16.6.1 netmask 255.255.255.0 mtu 9000 >>>>>>>> -lro -tso up" >>>>>>>> # aliases instead of reconfiguring test clients. See above >>>>>>>> commented out >>>>>>>> entries >>>>>>>> ifconfig_cxl0_alias0="172.16.7.1 netmask 255.255.255.0" >>>>>>>> ifconfig_cxl1_alias0="172.16.8.1 netmask 255.255.255.0" >>>>>>>> ifconfig_cxl2_alias0="172.16.1.1 netmask 255.255.255.0" >>>>>>>> ifconfig_cxl3_alias0="172.16.2.1 netmask 255.255.255.0" >>>>>>>> # for remote monitoring/admin of the test device >>>>>>>> ifconfig_igb0="inet 172.30.60.60 netmask 255.255.0.0" >>>>>>>> >>>>>>>> Additional configurations: >>>>>>>> cpuset-chelsio-6cpu-high >>>>>>>> # Original provided by Navdeep Parhar <nparhar@gmail.com> >>>>>>>> # takes vmstat -ai output into a list, and assigns interrupts >>>>>>>> in order to >>>>>>>> # the available CPU cores. >>>>>>>> # Modified: to assign only to the 'high CPUs', ie: on core1. >>>>>>>> # See: >>>>>>>> http://lists.freebsd.org/pipermail/freebsd-net/2014-July/039317.html >>>>>>>> #!/usr/local/bin/bash >>>>>>>> ncpu=12 >>>>>>>> irqlist=$(vmstat -ia | egrep 't4nex|t5nex|cxgbc' | cut -f1 -d: >>>>>>>> | cut -c4-) >>>>>>>> i=6 >>>>>>>> for irq in $irqlist; do >>>>>>>> cpuset -l $i -x $irq >>>>>>>> i=$((i+1)) >>>>>>>> [ $i -ge $ncpu ] && i=6 >>>>>>>> done >>>>>>>> >>>>>>>> Client Description: >>>>>>>> >>>>>>>> Two Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz processors >>>>>>>> 64 GB ram >>>>>>>> Mellanox Technologies MT27500 Family [ConnectX-3] >>>>>>>> Centos 6.4 with updates >>>>>>>> iperf3 installed from yum repositories: >>>>>>>> iperf3-3.0.3-3.el6.x86_64 >>>>>>>> >>>>>>>> Test setup: >>>>>>>> >>>>>>>> I've found about 3 streams between Centos clients is about the >>>>>>>> best way >>>>>>>> to get the most out of them. >>>>>>>> Above certain points, the -b flag does not change results. >>>>>>>> -N is an artifact from using TCP >>>>>>>> -l is needed, as -M doesn't work for UDP. >>>>>>>> >>>>>>>> I usually use launch scripts similar to the following: >>>>>>>> >>>>>>>> for i in `seq 41 60`; do ssh loader$i "export TIME=120; export >>>>>>>> STREAMS=1; export PORT=52$i; export PKT=64; export RATE=2000m; >>>>>>>> /root/iperf-test-8port-udp" & done >>>>>>>> >>>>>>>> The scripts execute the following on each host. >>>>>>>> >>>>>>>> #!/bin/bash >>>>>>>> PORT1=$PORT >>>>>>>> PORT2=$(($PORT+1000)) >>>>>>>> PORT3=$(($PORT+2000)) >>>>>>>> iperf3 -c loader41-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>>> -P$STREAMS -p$PORT1 & >>>>>>>> iperf3 -c loader42-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>>> -P$STREAMS -p$PORT1 & >>>>>>>> iperf3 -c loader43-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>>> -P$STREAMS -p$PORT1 & >>>>>>>> ... (through all clients and all three ports) ... >>>>>>>> iperf3 -c loader60-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>>> -P$STREAMS -p$PORT3 & >>>>>>>> >>>>>>>> >>>>>>>> Results: >>>>>>>> >>>>>>>> Summarized, netstat -w 1 -q 240 -bd, run through: >>>>>>>> cat test4-tuning | egrep -v {'packets | input '} | awk >>>>>>>> '{ipackets+=$1} >>>>>>>> {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input " >>>>>>>> ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR, >>>>>>>> "odrops " >>>>>>>> odrops/NR}' >>>>>>>> >>>>>>>> input 1.10662e+07 idrops 8.01783e+06 opackets 3.04516e+06 >>>>>>>> odrops 3152.4 >>>>>>>> >>>>>>>> Snapshot of raw output: >>>>>>>> >>>>>>>> input (Total) output >>>>>>>> packets errs idrops bytes packets errs bytes >>>>>>>> colls drops >>>>>>>> 11189148 0 7462453 1230805216 3725006 0 409750710 >>>>>>>> 0 799 >>>>>>>> 10527505 0 6746901 1158024978 3779096 0 415700708 >>>>>>>> 0 127 >>>>>>>> 10606163 0 6850760 1166676673 3751780 0 412695761 >>>>>>>> 0 1535 >>>>>>>> 10749324 0 7132014 1182425799 3617558 0 397930956 >>>>>>>> 0 5972 >>>>>>>> 10695667 0 7022717 1176521907 3669342 0 403627236 >>>>>>>> 0 1461 >>>>>>>> 10441173 0 6762134 1148528662 3675048 0 404255540 >>>>>>>> 0 6021 >>>>>>>> 10683773 0 7005635 1175215014 3676962 0 404465671 >>>>>>>> 0 2606 >>>>>>>> 10869859 0 7208696 1195683372 3658432 0 402427698 >>>>>>>> 0 979 >>>>>>>> 11948989 0 8310926 1314387881 3633773 0 399714986 >>>>>>>> 0 725 >>>>>>>> 12426195 0 8864415 1366877194 3562311 0 391853156 >>>>>>>> 0 2762 >>>>>>>> 13006059 0 9432389 1430661751 3570067 0 392706552 >>>>>>>> 0 5158 >>>>>>>> 12822243 0 9098871 1410443600 3715177 0 408668500 >>>>>>>> 0 4064 >>>>>>>> 13317864 0 9683602 1464961374 3632156 0 399536131 >>>>>>>> 0 3684 >>>>>>>> 13701905 0 10182562 1507207982 3523101 0 387540859 >>>>>>>> 0 >>>>>>>> 8690 >>>>>>>> 13820227 0 10244870 1520221820 3562038 0 391823322 >>>>>>>> 0 >>>>>>>> 2426 >>>>>>>> 14437060 0 10955483 1588073033 3480105 0 382810557 >>>>>>>> 0 >>>>>>>> 2619 >>>>>>>> 14518471 0 11119573 1597028105 3397439 0 373717355 >>>>>>>> 0 >>>>>>>> 5691 >>>>>>>> 14890287 0 11675003 1637926521 3199812 0 351978304 >>>>>>>> 0 >>>>>>>> 11007 >>>>>>>> 14923610 0 11749091 1641594441 3171436 0 348857468 >>>>>>>> 0 >>>>>>>> 7389 >>>>>>>> 14738704 0 11609730 1621254991 3117715 0 342948394 >>>>>>>> 0 >>>>>>>> 2597 >>>>>>>> 14753975 0 11549735 1622935026 3207393 0 352812846 >>>>>>>> 0 >>>>>>>> 4798 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> freebsd-net@freebsd.org mailing list >>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>>>>>>> To unsubscribe, send any mail to >>>>>>>> "freebsd-net-unsubscribe@freebsd.org" >> > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?83597B15-63B3-4AD7-A458-00B67C9E5396>