From owner-freebsd-net@FreeBSD.ORG Thu Jul 3 20:23:04 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B35871065680 for ; Thu, 3 Jul 2008 20:23:04 +0000 (UTC) (envelope-from paul@gtcomm.net) Received: from atlas.gtcomm.net (atlas.gtcomm.net [67.215.15.242]) by mx1.freebsd.org (Postfix) with ESMTP id 6C0028FC1A for ; Thu, 3 Jul 2008 20:23:04 +0000 (UTC) (envelope-from paul@gtcomm.net) Received: from c-76-108-179-28.hsd1.fl.comcast.net ([76.108.179.28] helo=[192.168.1.6]) by atlas.gtcomm.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1KEVGx-0000yI-Ow; Thu, 03 Jul 2008 16:19:20 -0400 Message-ID: <486D35A0.4000302@gtcomm.net> Date: Thu, 03 Jul 2008 16:25:04 -0400 From: Paul User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: Bruce Evans References: <4867420D.7090406@gtcomm.net> <200806301944.m5UJifJD081781@lava.sentex.ca> <20080701004346.GA3898@stlux503.dsto.defence.gov.au> <20080701010716.GF3898@stlux503.dsto.defence.gov.au> <486986D9.3000607@monkeybrains.net> <48699960.9070100@gtcomm.net> <20080701033117.GH83626@cdnetworks.co.kr> <4869ACFC.5020205@gtcomm.net> <4869B025.9080006@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <486B4F11.6040906@gtcomm.net> <486BC7F5.5070604@gtcomm.net> <20080703160540.W6369@delplex.bde.org> <486C7F93.7010308@gtcomm.net> <20080703195521.O6973@delplex.bde. org> In-Reply-To: <20080703195521.O6973@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net , Ingo Flaschberger Subject: Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jul 2008 20:23:04 -0000 Opteron 2222 UP mode, no polling input (em0) output packets errs bytes packets errs bytes colls 1071020 0 66403248 2 0 404 0 1049793 0 65087174 2 0 356 0 1040320 0 64499848 2 0 356 0 1049712 0 65082152 2 0 356 0 1039504 0 64449256 2 0 356 0 933118 0 57853324 2 0 356 0 still has some cpu left and i can't generate any more packets Polling turned on provided better performance on 32 bit, but it gets strange errors on 64 bit.. Even at low pps I get small amounts of errors, and high pps same thing.. you would think that if it got errors at low pps it would get more errors at high pps but that isn't the case.. Polling on: packets errs bytes packets errs bytes colls 979736 963 60743636 1 0 226 0 991838 496 61493960 1 0 178 0 996125 460 61759754 1 0 178 0 979381 326 60721626 1 0 178 0 1022249 379 63379442 1 0 178 0 991468 557 61471020 1 0 178 0 lowering pps a little....... input (em0) output packets errs bytes packets errs bytes colls 818688 151 50758660 1 0 226 0 837920 179 51951044 1 0 178 0 826217 168 51225458 1 0 178 0 801017 100 49663058 1 0 178 0 761857 287 47235138 1 0 178 0 what could cause this? If i'm going to use a uniprocessor mode system I NEED polling to work because I have to have cpu cycles left over for userspace processes and I can't afford to have it lock those out. SMP is no big deal if it actually worked.. I'm going to do a SMP test with this cpu now with polling off/on and then I'm going to apply the polling patch and try that. Bruce Evans wrote: > On Thu, 3 Jul 2008, Paul wrote: > >> Bruce Evans wrote: >>>> No polling: >>>> 843762 25337 52313248 1 0 178 0 >>>> 763555 0 47340414 1 0 178 0 >>>> 830189 0 51471722 1 0 178 0 >>>> 838724 0 52000892 1 0 178 0 >>>> 813594 939 50442832 1 0 178 0 >>>> 807303 763 50052790 1 0 178 0 >>>> 791024 0 49043492 1 0 178 0 >>>> 768316 1106 47635596 1 0 178 0 >>>> Machine is maxed and is unresponsive.. >>> >>> That's the most interesting one. Even 1% packet loss would probably >>> destroy performance, so the benchmarks that give 10-50% packet loss >>> are uninteresting. >>> >> But you realize that it's outputting all of these packets on em3 and >> I'm watching them coming out >> and they are consistent with the packets received on em0 that netstat >> shows are 'good' packets. > > Well, output is easier. I don't remember seeing the load on a taskq for > em3. If there is a memory bottleneck, it might to might not be more > related > to running only 1 taskq per interrupt, depending on how independent the > memory system is for different CPU. I think Opterons have more > indenpendence > here than most x86's. > >> I'm using a server opteron which supposedly has the best memory >> performance out of any CPU right now. >> Plus opterons have the biggest l1 cache, but small l2 cache. Do you >> think larger l2 cache on the Xeon (6mb for 2 core) would be better? >> I have a 2222 opteron coming which is 1ghz faster so we will see what >> happens > > I suspect lower latency memory would help more. Big memory systems > have inherently higher latency. My little old A64 workstation and > laptop have main memory latencies 3 times smaller than freebsd.org's > new Core2 servers according to lmbench2 (42 nsec for the overclocked > DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec). > If there are a lot of cache misses, then the extra 100 nsec can be > important. Profiling of sendto() using hwpmc or perfmon shows a > significant number of cache misses per packet (2 or 10?). > >>>> Polling ON: >>>> input (em0) output >>>> packets errs bytes packets errs bytes colls >>>> 784138 179079 48616564 1 0 226 0 >>>> 788815 129608 48906530 2 0 356 0 >>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm >>>> really mistified by this.. >>> >>> Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy >>> to explain (perhaps incorrectly). Polling can then read at most 256 >>> descriptors every 1/2000 second, giving a max throughput of 512 kpps. >>> Packets < descriptors in general but might be equal here (for small >>> packets). You seem to actually get 784 kpps, which is too high even >>> in descriptors unless, but matches exactly if the errors are counted >>> twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% >>> still happens to be left over after giving up at 512 kpps. Most of >>> the errors are probably handled by the hardware at low cost in CPU by >>> dropping packets. There are other types of errors but none except >>> dropped packets is likely. >>> >> Read above, it's actually transmitting 770kpps out of em3 so it can't >> just be 512kpps. > > Transmitting is easier, but with polling its even harder to send > faster than > hz * queue_length than to receive. This is without polling in idle. > >> I was thinking of trying 4 or 5.. but how would that work with this >> new hardware? > > Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally > has lower overheads and latency, but is missing important improvements > (mainly tcp optimizations in upper layers, better DMA and/or mbuf > handling, and support for newer NICs). FreeBSD-5 is also missing the > overhead+latency advantage. > > Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a > 2-line change to support a not-so-new PCI em NIC. Summary: > - my bge NIC can handle about 600 kpps on my faster machine, but only > achieves 300 in 4.10 unpatched. > - my em NIC can handle about 400 kpps on my slower machine, except in > later versions it can receive at about 600 kpps. > - only 6.x and later can achieve near wire throughput for 1500-MTU > packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf > handling... I now remember the details -- it is mainly better mbuf > handling: old versions split the 1500-MTU packets into 2 mbufs and > this causes 2 descriptors per packet, which causes extra software > overheads and even larger overheads for the hardware. > > %%% > Results of benchmarks run on 23 Feb 2007: > > my~5.2 bge --> ~4.10 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 639 98 1660 398* 77 8k > ttcp -l5 -t 6.0 100 3960 6.0 6 5900 > ttcp -l1472 -u -t 76 27 395 76 40 8k > ttcp -l1472 -t 51 40 11k 51 26 8k > > (*) Same as sender according to netstat -I, but systat -ip shows that > almost half aren't delivered to upper layers. > > my~5.2 bge --> 4.11 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 635 98 1650 399* 74 8k > ttcp -l5 -t 5.8 100 3900 5.8 6 5800 > ttcp -l1472 -u -t 76 27 395 76 32 8k > ttcp -l1472 -t 51 40 11k 51 25 8k > > (*) Same as sender according to netstat -I, but systat -ip shows that > almost half aren't delivered to upper layers. > > my~5.2 bge --> my~5.2 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 638 98 1660 394* 100- 8k > ttcp -l5 -t 5.8 100 3900 5.8 9 6000 > ttcp -l1472 -u -t 76 27 395 76 46 8k > ttcp -l1472 -t 51 40 11k 51 35 8k > > (*) Same as sender according to netstat -I, but systat -ip shows that > almost half aren't delivered to upper layers. With the em rate > limit on ips changed from 8k to 80k, about 95% are delivered up. > > my~5.2 bge --> 6.2 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 637 98 1660 637 100- 15k > ttcp -l5 -t 5.8 100 3900 5.8 8 12k > ttcp -l1472 -u -t 76 27 395 76 36 16k > ttcp -l1472 -t 51 40 11k 51 37 16k > > my~5.2 bge --> ~current em-fastintr > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 641 98 1670 641 99 8k > ttcp -l5 -t 5.9 100 2670 5.9 7 6k > ttcp -l1472 -u -t 76 27 395 76 35 8k > ttcp -l1472 -t 52 43 11k 52 30 8k > > ~6.2 bge --> ~current em-fastintr > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 309 62 1600 309 64 8k > ttcp -l5 -t 4.9 100 3000 4.9 6 7k > ttcp -l1472 -u -t 76 27 395 76 34 8k > ttcp -l1472 -t 54 28 6800 54 30 8k > > ~current bge --> ~current em-fastintr > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 602 100 1570 602 99 8k > ttcp -l5 -t 5.3 100 2660 5.3 5 5300 > ttcp -l1472 -u -t 81# 19 212 81# 38 8k > ttcp -l1472 -t 53 34 11k 53 30 8k > > (#) Wire speed to within 0.5%. This is the only kppps in this set of > benchmarks that is close to wire speed. Older kernels apparently > lose relative to -current because mbufs for mtu-sized packets are > not contiguous in older kernels. > > Old results: > > ~4.10 bge --> my~5.2 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t n/a n/a n/a 346 79 8k > ttcp -l5 -t n/a n/a n/a 5.4 10 6800 > ttcp -l1472 -u -t n/a n/a n/a 67 40 8k > ttcp -l1472 -t n/a n/a n/a 51 36 8k > > ~4.10 kernel, =4 bge --> ~current em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t n/a n/a n/a 347 96 14k > ttcp -l5 -t n/a n/a n/a 5.8 10 14k > ttcp -l1472 -u -t n/a n/a n/a 67 62 14K > ttcp -l1472 -t n/a n/a n/a 52 40 16k > > ~4.10 kernel, =4+ bge --> ~current em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t n/a n/a n/a 627 100 9k > ttcp -l5 -t n/a n/a n/a 5.6 9 13k > ttcp -l1472 -u -t n/a n/a n/a 68 63 14k > ttcp -l1472 -t n/a n/a n/a 54 44 16k > %%% > > %%% > Results of benchmarks run on 28 Dec 2007: > > ~5.2 epsplex (em) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 825k 3 206k 229 412k 52.1 45.1 2.8 > local with sink: 659k 3 263k 231 131k 66.5 27.3 6.2 > tx remote no sink: 35k 3 273k 8237 266k 42.0 52.1 2.3 3.6 > tx remote with sink: 26k 3 394k 8224 100 60.0 5.41 3.4 11.2 > rx remote no sink: 25k 4 26 8237 373k 20.6 79.4 0.0 0.0 > rx remote with sink: 30k 3 203k 8237 398k 36.5 60.7 2.8 0.0 > > 6.3-PR besplex (em) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 417k 1 208k 418k 2 49.5 48.5 2.0 > local with sink: 420k 1 276k 145k 2 70.0 23.6 6.4 > tx remote no sink: 19k 2 250k 8144 2 58.5 38.7 2.8 0.0 > tx remote with sink: 16k 2 361k 8336 2 72.9 24.0 3.1 4.4 > rx remote no sink: 429 3 49 888 2 0.3 99.33 0.0 0.4 > tx remote with sink: 13k 2 316k 5385 2 31.7 63.8 3.6 0.8 > > 8.0-C epsplex (em-fast) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 442k 3 221k 230 442k 47.2 49.6 2.7 > local with sink: 394k 3 262k 228 131k 72.1 22.6 5.3 > tx remote no sink: 17k 3 226k 7832 100 94.1 0.2 3.0 0.0 > tx remote with sink: 17k 3 360k 7962 100 91.7 0.2 3.7 4.4 > rx remote no sink: saturated -- cannot update systat display > rx remote with sink: 15k 6 358k 8224 100 97.0 0.0 2.5 0.5 > > ~4.10 besplex (bge) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 15 0 425k 228 11 96.3 0.0 3.7 > local with sink: ** 0 622k 229 ** 94.7 0.3 5.0 > tx remote no sink: 29 1 490k 7024 11 47.9 29.8 4.4 17.9 > tx remote with sink: 26 1 635k 1883 11 65.7 11.4 5.6 17.3 > rx remote no sink: 5 1 68 7025 1 0.0 47.3 0.0 52.7 > rx remote with sink: 6679 2 365k 6899 12 19.7 29.2 2.5 48.7 > > ~5.2-C besplex (bge) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 1M 3 271k 229 543k 50.7 46.8 2.5 > local with sink: 1M 3 406k 229 203k 67.4 28.2 4.4 > tx remote no sink: 49k 3 474k 11k 167k 52.3 42.7 5.0 0.0 > tx remote with sink: 6371 3 641k 1900 100 76.0 16.8 6.2 0.9 > rx remote no sink: 34k 3 25 11k 270k 0.8 65.4 0.0 33.8 > rx remote with sink: 41k 3 365k 10k 370k 31.5 47.1 2.3 19.0 > > 6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken): > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 540k 0 270k 540k 0 50.5 46.0 3.5 > local with sink: 628k 0 417k 210k 0 68.8 27.9 3.3 > tx remote no sink: 15k 1 222k 7190 1 28.4 29.3 1.7 40.6 > tx remote with sink: 5947 1 315k 2825 1 39.9 14.7 2.6 42.8 > rx remote no sink: 13k 1 23 6943 0 0.3 49.5 0.2 50.0 > rx remote with sink: 20k 1 371k 6819 0 29.5 30.1 3.9 36.5 > > 8.0-C besplex (bge) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 649k 3 324k 100 649k 53.9 42.9 3.2 > local with sink: 649k 3 433k 100 216k 75.2 18.8 6.0 > tx remote no sink: 24k 3 432k 10k 100 49.7 41.3 2.4 6.6 > tx remote with sink: 3199 3 568k 1580 100 64.3 19.6 4.0 12.2 > rx remote no sink: 20k 3 27 10k 100 0.0 46.1 0.0 53.9 > rx remote with sink: 31k 3 370k 10k 100 30.7 30.9 4.8 33.5 > %%% > > Bruce > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >