From owner-freebsd-net@FreeBSD.ORG Sat Jul 9 00:40:09 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A16B61065674 for ; Sat, 9 Jul 2011 00:40:05 +0000 (UTC) (envelope-from spork@bway.net) Received: from xena.bway.net (xena.bway.net [216.220.96.26]) by mx1.freebsd.org (Postfix) with ESMTP id AFA458FC15 for ; Sat, 9 Jul 2011 00:40:04 +0000 (UTC) Received: (qmail 39950 invoked by uid 0); 9 Jul 2011 00:40:03 -0000 Received: from smtp.bway.net (216.220.96.25) by xena.bway.net with (DHE-RSA-AES256-SHA encrypted) SMTP; 9 Jul 2011 00:40:03 -0000 Received: (qmail 39939 invoked by uid 90); 9 Jul 2011 00:40:03 -0000 Received: from unknown (HELO freemac) (spork@bway.net@96.57.144.66) by smtp.bway.net with (DHE-RSA-AES256-SHA encrypted) SMTP; 9 Jul 2011 00:40:03 -0000 Date: Fri, 8 Jul 2011 20:39:59 -0400 (EDT) From: Charles Sprickman X-X-Sender: spork@freemac To: David Christensen In-Reply-To: <5D267A3F22FD854F8F48B3D2B523819385C32D96B7@IRVEXCHCCR01.corp.ad.broadcom.com> Message-ID: References: <20110706201509.GA5559@michelle.cdnetworks.com> <20110707174233.GB8702@michelle.cdnetworks.com> <5D267A3F22FD854F8F48B3D2B523819385C32D96B7@IRVEXCHCCR01.corp.ad.broadcom.com> User-Agent: Alpine 2.00 (OSX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: YongHyeon PYUN , "freebsd-net@freebsd.org" , David Christensen Subject: RE: bce packet loss X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Jul 2011 00:40:09 -0000 On Fri, 8 Jul 2011, David Christensen wrote: >> I was able to reproduce the drops in very large numbers on the internal >> network today. I simply scp'd some large files from 1000/FD hosts to a >> 100/FD host (ie: scp bigfile.tgz oldhost.i:/dev/null). Immediately the >> 1000/FD hosts sending the files showed massive amounts of drops on the >> switch. This makes me suspect that this switch might be garbage in that >> it doesn't have enough buffer space to handle sending large amounts of >> traffic from the GigE ports to the FE ports without randomly dropping >> packets. Granted, I don't really understand how a "good" switch does >> this >> either, I would have thought tcp just took care of throttling itself. > > If you have flow control enabled end-to-end I wouldn't expect to see > such behavior, frames should not be dropped. If you're seeing drops > at the switch then I'd suspect that the traffic source connected to > that switch doesn't honor flow control. Check if either the switch or > traffic source keeps statistics on flow control frames generated/received. I'm running 8.1 and at least on the bce hosts, it looks like flow control isn't supported, it was added on 4/30/2010: http://svnweb.freebsd.org/base/head/sys/dev/bce/if_bce.c?r1=206268&r2=207411 In my 8.1 sources I still see this comment, which was removed in the above commit: /* ToDo: Enable flow control support in brgphy and bge. */ So at least on the bce hosts (and bge it seems), I do not have flow control available on the NIC. The sysctl stats do show that it's received "XON/XOFF" frames, which I assume are flow control messages, but there's no indication that the NIC does anything with them. It also looks like on 11/14/2010 a major change was made where flow control was brought to the mii layer. I believe this made it into 8.2. On my em interfaces, I can't tell if they support flow control or not, and sysctl has little info. There's a sysctl that suggest the availability of more info, but I'm probably misunderstanding it, as it cannot be changed: [root@h15 /home/spork]# sysctl dev.em.0.stats=1 dev.em.0.stats: -1 -> -1 If anyone can clarify the flow control status on em(4) in 8.1, I'd really appreciate it. The source for the Intel drivers is a bit hard to poke around in since it supports so many variations. >> Bear in mind that on the external switch our port to our ISP, which is >> the >> destination of almost all the traffic, is 100/FD and not 1000/FD. >> >> This of course does not explain why the original setup where I'd locked >> the switch ports and the host ports to 100/FD showed the same behavior. >> >> I'm stumped. >> >> We are running 8.1, am I correct in that flow control is not implemented >> there? We do have an 8.2-STABLE image from a month or so ago that we >> are >> testing with zfs v28, might that implement flow control? > > Flow control will depend on the NIC driver implementation. Older > versions of the bce(4) firmware will rarely generate pause frames > (frames would be dropped by firmware but statistics should show > the frame drop occurring) and should always honor pause frames > from the link partner when flow control is enabled. I think my nics probably lack it. I am also guessing that if any high-traffic host ignores flow control frames, that's going to screw up other hosts as well since the one causing the buffers to fill is not going to throttle and the overflow will continue, correct? >> >> Although reading this: >> >> http://en.wikipedia.org/wiki/Ethernet_flow_control >> >> It sounds like flow control is not terribly optimal since it forces the >> host to block all traffic. Not sure if this means drops are eliminated, >> reduced or shuffled around. > When congestion is detected the switch should buffer up to a certain > limit (say 80% of full) and then start sending pause frames to avoid > dropping frames. This will affect all hosts connecting through the > switch so congestion at one host can spread to other hosts (see > http://www.ieee802.org/3/cm_study/public/september04/thaler_3_0904.pdf). Wow. I did not catch that. I do recall something about the flow control frames being multicast - so every host gets them and pauses. That's... interesting, isn't it? > Small networks with a few hosts should be OK with flow control but > if you have dozens of switches and hundreds of hosts then it's not a > good idea. We only have 2/3 of a cabinet of hosts, and we've been further consolidating that down as we replace old hardware (ie: 3 physical hosts go to 1 physical host w/3 jails). If anyone is aware of whether the bce driver for 8.2 would port cleanly to 8.1, let me know. Those are my highest traffic hosts and I'd like to see what happens here when the NICs support flow control. Another stopgap I'm looking at is having our upstream put us on a GigE port - I imagine that would help us if the switch is running out of buffer space. Thanks again all, Charles > Dave > >