From owner-freebsd-net@FreeBSD.ORG  Fri Jul  8 18:07:13 2011
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8F9C81065673;
	Fri,  8 Jul 2011 18:07:13 +0000 (UTC)
	(envelope-from davidch@broadcom.com)
Received: from mms1.broadcom.com (mms1.broadcom.com [216.31.210.17])
	by mx1.freebsd.org (Postfix) with ESMTP id 652C68FC17;
	Fri,  8 Jul 2011 18:07:13 +0000 (UTC)
Received: from [10.9.200.133] by mms1.broadcom.com with ESMTP (Broadcom
	SMTP Relay (Email Firewall v6.3.2)); Fri, 08 Jul 2011 11:05:28 -0700
X-Server-Uuid: 02CED230-5797-4B57-9875-D5D2FEE4708A
Received: from IRVEXCHCCR01.corp.ad.broadcom.com ([10.252.49.30]) by
	IRVEXCHHUB02.corp.ad.broadcom.com ([10.9.200.133]) with mapi; Fri, 8
	Jul 2011 11:00:11 -0700
From: "David Christensen" <davidch@broadcom.com>
To: "Charles Sprickman" <spork@bway.net>, "YongHyeon PYUN" <pyunyh@gmail.com>
Date: Fri, 8 Jul 2011 11:00:23 -0700
Thread-Topic: bce packet loss
Thread-Index: Acw9EYeqMMNh5VAMSjOb2D/OBpbEcwAhUvgw
Message-ID: <5D267A3F22FD854F8F48B3D2B523819385C32D96B7@IRVEXCHCCR01.corp.ad.broadcom.com>
References: <alpine.OSX.2.00.1107042113000.2407@freemac>
	<20110706201509.GA5559@michelle.cdnetworks.com>
	<alpine.OSX.2.00.1107070121060.2407@freemac>
	<20110707174233.GB8702@michelle.cdnetworks.com>
	<alpine.OSX.2.00.1107072129310.2407@freemac>
In-Reply-To: <alpine.OSX.2.00.1107072129310.2407@freemac>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
acceptlanguage: en-US
MIME-Version: 1.0
X-WSS-ID: 620999623B416804050-01-01
Content-Type: text/plain;
 charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>,
	David Christensen <davidch@freebsd.org>
Subject: RE: bce packet loss
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jul 2011 18:07:13 -0000

> I was able to reproduce the drops in very large numbers on the internal
> network today.  I simply scp'd some large files from 1000/FD hosts to a
> 100/FD host (ie: scp bigfile.tgz oldhost.i:/dev/null).  Immediately the
> 1000/FD hosts sending the files showed massive amounts of drops on the
> switch.  This makes me suspect that this switch might be garbage in that
> it doesn't have enough buffer space to handle sending large amounts of
> traffic from the GigE ports to the FE ports without randomly dropping
> packets.  Granted, I don't really understand how a "good" switch does
> this
> either, I would have thought tcp just took care of throttling itself.

If you have flow control enabled end-to-end I wouldn't expect to see
such behavior, frames should not be dropped.  If you're seeing drops
at the switch then I'd suspect that the traffic source connected to
that switch doesn't honor flow control.  Check if either the switch or
traffic source keeps statistics on flow control frames generated/received.

> Bear in mind that on the external switch our port to our ISP, which is
> the
> destination of almost all the traffic, is 100/FD and not 1000/FD.
>=20
> This of course does not explain why the original setup where I'd locked
> the switch ports and the host ports to 100/FD showed the same behavior.
>=20
> I'm stumped.
>=20
> We are running 8.1, am I correct in that flow control is not implemented
> there?  We do have an 8.2-STABLE image from a month or so ago that we
> are
> testing with zfs v28, might that implement flow control?

Flow control will depend on the NIC driver implementation.  Older
versions of the bce(4) firmware will rarely generate pause frames
(frames would be dropped by firmware but statistics should show
the frame drop occurring) and should always honor pause frames
from the link partner when flow control is enabled.

>=20
> Although reading this:
>=20
> http://en.wikipedia.org/wiki/Ethernet_flow_control
>=20
> It sounds like flow control is not terribly optimal since it forces the
> host to block all traffic.  Not sure if this means drops are eliminated,
> reduced or shuffled around.

When congestion is detected the switch should buffer up to a certain
limit (say 80% of full) and then start sending pause frames to avoid
dropping frames.  This will affect all hosts connecting through the
switch so congestion at one host can spread to other hosts (see
http://www.ieee802.org/3/cm_study/public/september04/thaler_3_0904.pdf).
Small networks with a few hosts should be OK with flow control but
if you have dozens of switches and hundreds of hosts then it's not a
good idea.

Dave