From owner-freebsd-net@FreeBSD.ORG Thu Jul 3 10:42:02 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 757531065683 for ; Thu, 3 Jul 2008 10:42:02 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id C2B388FC32 for ; Thu, 3 Jul 2008 10:42:01 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m63AftAB010598 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 Jul 2008 20:41:56 +1000 Date: Thu, 3 Jul 2008 20:41:54 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Paul In-Reply-To: <486C7F93.7010308@gtcomm.net> Message-ID: <20080703195521.O6973@delplex.bde.org> References: <4867420D.7090406@gtcomm.net> <200806301944.m5UJifJD081781@lava.sentex.ca> <20080701004346.GA3898@stlux503.dsto.defence.gov.au> <20080701010716.GF3898@stlux503.dsto.defence.gov.au> <486986D9.3000607@monkeybrains.net> <48699960.9070100@gtcomm.net> <20080701033117.GH83626@cdnetworks.co.kr> <4869ACFC.5020205@gtcomm.net> <4869B025.9080006@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <486B4F11.6040906@gtcomm.net> <486BC7F5.5070604@gtcomm.net> <20080703160540.W6369@delplex.bde.org> <486C7F93.7010308@gtcomm.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: FreeBSD Net , Ingo Flaschberger Subject: Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jul 2008 10:42:02 -0000 On Thu, 3 Jul 2008, Paul wrote: > Bruce Evans wrote: >>> No polling: >>> 843762 25337 52313248 1 0 178 0 >>> 763555 0 47340414 1 0 178 0 >>> 830189 0 51471722 1 0 178 0 >>> 838724 0 52000892 1 0 178 0 >>> 813594 939 50442832 1 0 178 0 >>> 807303 763 50052790 1 0 178 0 >>> 791024 0 49043492 1 0 178 0 >>> 768316 1106 47635596 1 0 178 0 >>> Machine is maxed and is unresponsive.. >> >> That's the most interesting one. Even 1% packet loss would probably >> destroy performance, so the benchmarks that give 10-50% packet loss >> are uninteresting. >> > But you realize that it's outputting all of these packets on em3 and I'm > watching them coming out > and they are consistent with the packets received on em0 that netstat shows > are 'good' packets. Well, output is easier. I don't remember seeing the load on a taskq for em3. If there is a memory bottleneck, it might to might not be more related to running only 1 taskq per interrupt, depending on how independent the memory system is for different CPU. I think Opterons have more indenpendence here than most x86's. > I'm using a server opteron which supposedly has the best memory performance > out of any CPU right now. > Plus opterons have the biggest l1 cache, but small l2 cache. Do you think > larger l2 cache on the Xeon (6mb for 2 core) would be better? > I have a 2222 opteron coming which is 1ghz faster so we will see what happens I suspect lower latency memory would help more. Big memory systems have inherently higher latency. My little old A64 workstation and laptop have main memory latencies 3 times smaller than freebsd.org's new Core2 servers according to lmbench2 (42 nsec for the overclocked DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec). If there are a lot of cache misses, then the extra 100 nsec can be important. Profiling of sendto() using hwpmc or perfmon shows a significant number of cache misses per packet (2 or 10?). >>> Polling ON: >>> input (em0) output >>> packets errs bytes packets errs bytes colls >>> 784138 179079 48616564 1 0 226 0 >>> 788815 129608 48906530 2 0 356 0 >>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really >>> mistified by this.. >> >> Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy >> to explain (perhaps incorrectly). Polling can then read at most 256 >> descriptors every 1/2000 second, giving a max throughput of 512 kpps. >> Packets < descriptors in general but might be equal here (for small >> packets). You seem to actually get 784 kpps, which is too high even >> in descriptors unless, but matches exactly if the errors are counted >> twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% >> still happens to be left over after giving up at 512 kpps. Most of >> the errors are probably handled by the hardware at low cost in CPU by >> dropping packets. There are other types of errors but none except >> dropped packets is likely. >> > Read above, it's actually transmitting 770kpps out of em3 so it can't just be > 512kpps. Transmitting is easier, but with polling its even harder to send faster than hz * queue_length than to receive. This is without polling in idle. > I was thinking of trying 4 or 5.. but how would that work with this new > hardware? Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally has lower overheads and latency, but is missing important improvements (mainly tcp optimizations in upper layers, better DMA and/or mbuf handling, and support for newer NICs). FreeBSD-5 is also missing the overhead+latency advantage. Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a 2-line change to support a not-so-new PCI em NIC. Summary: - my bge NIC can handle about 600 kpps on my faster machine, but only achieves 300 in 4.10 unpatched. - my em NIC can handle about 400 kpps on my slower machine, except in later versions it can receive at about 600 kpps. - only 6.x and later can achieve near wire throughput for 1500-MTU packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf handling... I now remember the details -- it is mainly better mbuf handling: old versions split the 1500-MTU packets into 2 mbufs and this causes 2 descriptors per packet, which causes extra software overheads and even larger overheads for the hardware. %%% Results of benchmarks run on 23 Feb 2007: my~5.2 bge --> ~4.10 em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 639 98 1660 398* 77 8k ttcp -l5 -t 6.0 100 3960 6.0 6 5900 ttcp -l1472 -u -t 76 27 395 76 40 8k ttcp -l1472 -t 51 40 11k 51 26 8k (*) Same as sender according to netstat -I, but systat -ip shows that almost half aren't delivered to upper layers. my~5.2 bge --> 4.11 em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 635 98 1650 399* 74 8k ttcp -l5 -t 5.8 100 3900 5.8 6 5800 ttcp -l1472 -u -t 76 27 395 76 32 8k ttcp -l1472 -t 51 40 11k 51 25 8k (*) Same as sender according to netstat -I, but systat -ip shows that almost half aren't delivered to upper layers. my~5.2 bge --> my~5.2 em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 638 98 1660 394* 100- 8k ttcp -l5 -t 5.8 100 3900 5.8 9 6000 ttcp -l1472 -u -t 76 27 395 76 46 8k ttcp -l1472 -t 51 40 11k 51 35 8k (*) Same as sender according to netstat -I, but systat -ip shows that almost half aren't delivered to upper layers. With the em rate limit on ips changed from 8k to 80k, about 95% are delivered up. my~5.2 bge --> 6.2 em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 637 98 1660 637 100- 15k ttcp -l5 -t 5.8 100 3900 5.8 8 12k ttcp -l1472 -u -t 76 27 395 76 36 16k ttcp -l1472 -t 51 40 11k 51 37 16k my~5.2 bge --> ~current em-fastintr tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 641 98 1670 641 99 8k ttcp -l5 -t 5.9 100 2670 5.9 7 6k ttcp -l1472 -u -t 76 27 395 76 35 8k ttcp -l1472 -t 52 43 11k 52 30 8k ~6.2 bge --> ~current em-fastintr tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 309 62 1600 309 64 8k ttcp -l5 -t 4.9 100 3000 4.9 6 7k ttcp -l1472 -u -t 76 27 395 76 34 8k ttcp -l1472 -t 54 28 6800 54 30 8k ~current bge --> ~current em-fastintr tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t 602 100 1570 602 99 8k ttcp -l5 -t 5.3 100 2660 5.3 5 5300 ttcp -l1472 -u -t 81# 19 212 81# 38 8k ttcp -l1472 -t 53 34 11k 53 30 8k (#) Wire speed to within 0.5%. This is the only kppps in this set of benchmarks that is close to wire speed. Older kernels apparently lose relative to -current because mbufs for mtu-sized packets are not contiguous in older kernels. Old results: ~4.10 bge --> my~5.2 em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t n/a n/a n/a 346 79 8k ttcp -l5 -t n/a n/a n/a 5.4 10 6800 ttcp -l1472 -u -t n/a n/a n/a 67 40 8k ttcp -l1472 -t n/a n/a n/a 51 36 8k ~4.10 kernel, =4 bge --> ~current em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t n/a n/a n/a 347 96 14k ttcp -l5 -t n/a n/a n/a 5.8 10 14k ttcp -l1472 -u -t n/a n/a n/a 67 62 14K ttcp -l1472 -t n/a n/a n/a 52 40 16k ~4.10 kernel, =4+ bge --> ~current em tx rx kpps load% ips kpps load% ips ttcp -l5 -u -t n/a n/a n/a 627 100 9k ttcp -l5 -t n/a n/a n/a 5.6 9 13k ttcp -l1472 -u -t n/a n/a n/a 68 63 14k ttcp -l1472 -t n/a n/a n/a 54 44 16k %%% %%% Results of benchmarks run on 28 Dec 2007: ~5.2 epsplex (em) ttcp: Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 825k 3 206k 229 412k 52.1 45.1 2.8 local with sink: 659k 3 263k 231 131k 66.5 27.3 6.2 tx remote no sink: 35k 3 273k 8237 266k 42.0 52.1 2.3 3.6 tx remote with sink: 26k 3 394k 8224 100 60.0 5.41 3.4 11.2 rx remote no sink: 25k 4 26 8237 373k 20.6 79.4 0.0 0.0 rx remote with sink: 30k 3 203k 8237 398k 36.5 60.7 2.8 0.0 6.3-PR besplex (em) ttcp: Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 417k 1 208k 418k 2 49.5 48.5 2.0 local with sink: 420k 1 276k 145k 2 70.0 23.6 6.4 tx remote no sink: 19k 2 250k 8144 2 58.5 38.7 2.8 0.0 tx remote with sink: 16k 2 361k 8336 2 72.9 24.0 3.1 4.4 rx remote no sink: 429 3 49 888 2 0.3 99.33 0.0 0.4 tx remote with sink: 13k 2 316k 5385 2 31.7 63.8 3.6 0.8 8.0-C epsplex (em-fast) ttcp: Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 442k 3 221k 230 442k 47.2 49.6 2.7 local with sink: 394k 3 262k 228 131k 72.1 22.6 5.3 tx remote no sink: 17k 3 226k 7832 100 94.1 0.2 3.0 0.0 tx remote with sink: 17k 3 360k 7962 100 91.7 0.2 3.7 4.4 rx remote no sink: saturated -- cannot update systat display rx remote with sink: 15k 6 358k 8224 100 97.0 0.0 2.5 0.5 ~4.10 besplex (bge) ttcp: Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 15 0 425k 228 11 96.3 0.0 3.7 local with sink: ** 0 622k 229 ** 94.7 0.3 5.0 tx remote no sink: 29 1 490k 7024 11 47.9 29.8 4.4 17.9 tx remote with sink: 26 1 635k 1883 11 65.7 11.4 5.6 17.3 rx remote no sink: 5 1 68 7025 1 0.0 47.3 0.0 52.7 rx remote with sink: 6679 2 365k 6899 12 19.7 29.2 2.5 48.7 ~5.2-C besplex (bge) ttcp: Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 1M 3 271k 229 543k 50.7 46.8 2.5 local with sink: 1M 3 406k 229 203k 67.4 28.2 4.4 tx remote no sink: 49k 3 474k 11k 167k 52.3 42.7 5.0 0.0 tx remote with sink: 6371 3 641k 1900 100 76.0 16.8 6.2 0.9 rx remote no sink: 34k 3 25 11k 270k 0.8 65.4 0.0 33.8 rx remote with sink: 41k 3 365k 10k 370k 31.5 47.1 2.3 19.0 6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken): Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 540k 0 270k 540k 0 50.5 46.0 3.5 local with sink: 628k 0 417k 210k 0 68.8 27.9 3.3 tx remote no sink: 15k 1 222k 7190 1 28.4 29.3 1.7 40.6 tx remote with sink: 5947 1 315k 2825 1 39.9 14.7 2.6 42.8 rx remote no sink: 13k 1 23 6943 0 0.3 49.5 0.2 50.0 rx remote with sink: 20k 1 371k 6819 0 29.5 30.1 3.9 36.5 8.0-C besplex (bge) ttcp: Csw Trp Sys Int Sof Sys Intr User Idle local no sink: 649k 3 324k 100 649k 53.9 42.9 3.2 local with sink: 649k 3 433k 100 216k 75.2 18.8 6.0 tx remote no sink: 24k 3 432k 10k 100 49.7 41.3 2.4 6.6 tx remote with sink: 3199 3 568k 1580 100 64.3 19.6 4.0 12.2 rx remote no sink: 20k 3 27 10k 100 0.0 46.1 0.0 53.9 rx remote with sink: 31k 3 370k 10k 100 30.7 30.9 4.8 33.5 %%% Bruce