From owner-freebsd-net@FreeBSD.ORG Wed Oct 5 12:51:57 2005 Return-Path: X-Original-To: net@freebsd.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A681116A41F for ; Wed, 5 Oct 2005 12:51:57 +0000 (GMT) (envelope-from toasty@dragondata.com) Received: from tokyo01.jp.mail.your.org (tokyo01.jp.mail.your.org [204.9.54.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id C739043D46 for ; Wed, 5 Oct 2005 12:51:56 +0000 (GMT) (envelope-from toasty@dragondata.com) Received: from mail.dragondata.com (server3-b.your.org [64.202.113.67]) by tokyo01.jp.mail.your.org (Postfix) with ESMTP id 33D322AD55EF; Wed, 5 Oct 2005 13:14:26 +0000 (UTC) Received: from [69.31.99.45] (pool045.dhcp.your.org [69.31.99.45]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by mail.dragondata.com (Postfix) with ESMTP id 78FDF3D1848; Wed, 5 Oct 2005 07:51:54 -0500 (CDT) In-Reply-To: <4343C559.5000000@jku.at> References: <4341089F.7010504@jku.at> <20051003104548.GB70355@cell.sick.ru> <4341242F.9060602@jku.at> <20051003123210.GF70355@cell.sick.ru> <43426EF3.3020404@jku.at> <9CD8C672-1EF2-42FE-A61E-83DC684C893D@dragondata.com> <43429157.90606@jku.at> <4342987D.7000200@benswebs.com> <20051004161217.GB43195@obiwan.tataz.chchile.org> <1128470191.75484.TMDA@seddon.ca> <979B163D-7078-4558-9095-DC329707A5B4@dragondata.com> <4343C559.5000000@jku.at> Mime-Version: 1.0 (Apple Message framework v734) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Kevin Day Date: Wed, 5 Oct 2005 07:52:16 -0500 To: ferdinand.goldmann@jku.at X-Mailer: Apple Mail (2.734) Cc: net@freebsd.org Subject: Re: dummynet, em driver, device polling issues :-(( X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Oct 2005 12:51:57 -0000 On Oct 5, 2005, at 7:21 AM, Ferdinand Goldmann wrote: > >> In one case, we had a system acting as a router. It was a Dell >> PowerEdge 2650, with two dual "server" adapters. each were on >> separate PCI busses. 3 were "lan" links, and one was a "wan" link. >> The lan links were receiving about 300mbps each, all going out the >> "wan" link at near 900mbps at peak. We were never able to get >> above 944mbps, but I never cared enough to figure out where the >> bottleneck was there. >> > > 944mbps is a very good value, anyway. What we see in our setup are > throuput rates around 300mbps or below. When testing with tcpspray, > throughput hardly exceeded 13MB/s. > > Are you running vlans on your interface? Our em0-card connects > several sites together, which are all sitting on separate vlan > interfaces for which the em0 acts as parent interface. > Two of the interfaces had vlans, two didn't. > >> This was with PCI-X, and a pretty stripped config on the server side. >> > > Maybe this makes a difference, too. We only have a quite old > xSeries 330 with PCI and a 1.2GHz CPU. > I think that's a really important key. If you're running a "normal" 32 bit 33MHz PCI bus, the math just doesn't work for high speeds. The entire bandwidth of the bus is just a tad over 1gbps. Assuming 100% efficiency (you receive a packet, then turn around and resend it immediately) you'll only be able to reach 500mbps. When you add in the overhead of each PCI transaction, the fact that the CPU can't instantly turn around and send data out the same cycle that the last packet was finished being received, and other inefficiencies you will probably only see something in the 250-300mbps range at MOST, if that. I believe the xSeries 330 uses 64 bit 33MHz slots though. That gives you double the bandwidth to play with. But, I'm still not convinced that the CPU isn't the bottleneck there. If you know you're running 64/33 in the slot you have the card in, I'd be willing to say you could do 500mbps or so at peak. A bunch of IPFW rules, the CPU just not being able to keep up, other activity on the system, or a complex routing table will reduce that. Just to sum up: A 64/33MHz bus has the theoretical speed of 2gbps. If you're forwarding packets in one interface then out another, you have to cut that in half. PCI is half duplex, you can't receive and send at the same time. This leaves 1gbps left. PCI itself isn't 100% efficient. You burn cycles setting up each PCI transaction. When the card busmasters to dump the packet into RAM, it frequently will have to wait for the memory controller to proceed. The ethernet card itself requires some PCI bandwidth to operate - the kernel needs to check its registers, the card has to update pointers in ram for the busmaster circular buffer, etc. All those things take time on the PCI bus, leaving maybe 750-800mbps left for actual data. The rest of the system isn't 100% efficient either. The CPU/kernel/ etc can't immediately turn around a packet to send out the instant it's received, further lowering your overall bandwidth limit. I've done a lot of work on custom ethernet interfaces both in FreeBSD and in custom embedded OS projects. The safe bet is to assume that you can route/forward 250mbps on 32/33 and 500mbps on 64/33 if you have enough CPU efficiency to fill the bus. > >> Nothing fancy on polling, i think we set HZ to 10000 >> > > Ten-thousand? Or is this a typo, and did you mean thousand? > > This is weird. :-( Please, is there any good documentation on > tuning device polling? The man page does not give any useful > pointers about values to use for Gbit cards. I have already read > things about people using 2000, 4000HZ ... Gaaah! > > I tried with 1000 and 2000 so far, without good results. It seems > like everybody makes wild assumptions on what values to use for > polling. > We arrived at 10000 by experimentation. A large number of interfaces, a ton of traffic... I'm not sure the complete reasons why it helped, but it did. > >> , turned on idle_poll, and set user_frac to 10 because we had some >> cpu hungry tasks that were not a high priority. >> > > I think I red somewhere about problems with idle_poll. How high is > your burst_max value? Are you seeing a lot of ierrs? > No ierrs at all, and we never touched burst_max. In the end, if you're getting "Receive No Buffers" incrementing, that basically means what it implies. The ethernet chip received a packet and was out of room to store it because the CPU hadn't dumped the receive buffers from previous packets yet. Either the CPU is too busy and can't keep up, the PCI bus is being saturated and the ethernet chip can't move packets out of its tiny internal memory fast enough, or there is some polling problem that's being hit here. If I had to bet, what I think is happening is that you've got a bottleneck somewhere (ipfw rules, not enough CPU, too much PCI activity, or you're in a 32/33 PCI slot) and it can't keep up with what you're doing. Turning polling on is exposing different symptoms of this than having polling off. Polling may be increasing your overall speed enough that instead of having packets getting backed up in the kernel and stopping you from going higher than XXmbps, with polling you're getting to XXXmbps and seeing a new symptom of a bottleneck. > Forgot to ask - do you have fastforwarding enabled in your sysctl? No. But we were running either 2.8 or 3.2GHz P4 Xeons, so we had the CPU to burn.