From owner-freebsd-net@FreeBSD.ORG  Sun Oct 27 12:13:47 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 3F3CCF0
 for <freebsd-net@freebsd.org>; Sun, 27 Oct 2013 12:13:47 +0000 (UTC)
 (envelope-from eocallaghan@alterapraxis.com)
Received: from smtp.alterapraxis.com (unknown [101.164.33.212])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8502A2563
 for <freebsd-net@freebsd.org>; Sun, 27 Oct 2013 12:13:46 +0000 (UTC)
Received: from smtp.alterapraxis.com (tony [127.0.0.1])
 by smtp.alterapraxis.com (Postfix) with ESMTP id A7948634852
 for <freebsd-net@freebsd.org>; Sun, 27 Oct 2013 23:11:19 +1100 (EST)
Received: from tinkerbell.alterapraxis.com (unknown [101.164.33.212])
 (using SSLv3 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: eocallaghan@alterapraxis.com)
 by smtp.alterapraxis.com (Postfix) with ESMTPSA id 6212E63484A
 for <freebsd-net@freebsd.org>; Sun, 27 Oct 2013 23:11:18 +1100 (EST)
Date: Sun, 27 Oct 2013 23:13:25 +1100
From: Edward O'Callaghan <eocallaghan@alterapraxis.com>
To: freebsd-net@freebsd.org
Subject: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU,
 8411B and 8106EUS.
Message-ID: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com>
Organization: Altera Praxis Pty Ltd
X-Mailer: Claws Mail 3.9.2 (GTK+ 2.24.22; x86_64-unknown-linux-gnu)
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=PGP-SHA512;
 boundary="Sig_/+_L+578aFh1LcEL6OaLqt2L"; protocol="application/pgp-signature"
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Oct 2013 12:13:47 -0000

--Sig_/+_L+578aFh1LcEL6OaLqt2L
Content-Type: multipart/mixed; boundary="MP_/0wFFVR_dOtdshJmu5wgtM3."

--MP_/0wFFVR_dOtdshJmu5wgtM3.
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hi,

This is a follow up. I have tested most of these NIC's now and this
patch _should_ be fine to commit to HEAD. Could someone please help me
mediate this? This also fixes kern/183167. Please disregards the
patches in the PR.

Kind Regards,
Edward.

--MP_/0wFFVR_dOtdshJmu5wgtM3.
Content-Type: text/x-patch
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
 filename=0001-re-4-resync.-Adds-preliminary-support-for-8168G-8168.patch

=46rom 5357870e5d9129a3f098e48d47e34f1c40924485 Mon Sep 17 00:00:00 2001
From: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Date: Sun, 27 Oct 2013 23:03:53 +1100
Subject: [PATCH] re(4) resync. Adds preliminary support for 8168G, 8168EP,
 8168GU, 8411B and 8106EUS.
Organization: Altera Praxis Pty Ltd.

Signed-off-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
---
 sys/dev/re/if_re.c | 8 ++++++++
 sys/pci/if_rlreg.h | 6 +++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/sys/dev/re/if_re.c b/sys/dev/re/if_re.c
index 381fa87..0de569f 100644
--- a/sys/dev/re/if_re.c
+++ b/sys/dev/re/if_re.c
@@ -234,7 +234,11 @@ static const struct rl_hwrev re_hwrevs[] =3D {
 	{ RL_HWREV_8168E, RL_8169, "8168E/8111E", RL_JUMBO_MTU_9K},
 	{ RL_HWREV_8168E_VL, RL_8169, "8168E/8111E-VL", RL_JUMBO_MTU_6K},
 	{ RL_HWREV_8168F, RL_8169, "8168F/8111F", RL_JUMBO_MTU_9K},
+	{ RL_HWREV_8168G, RL_8169, "8168G/8111G", RL_JUMBO_MTU_9K},
+	{ RL_HWREV_8168EP, RL_8169, "8168G/8111EP", RL_JUMBO_MTU_9K},
+	{ RL_HWREV_8168GU, RL_8169, "8168G/8111GU", RL_JUMBO_MTU_9K},
 	{ RL_HWREV_8411, RL_8169, "8411", RL_JUMBO_MTU_9K},
+	{ RL_HWREV_8411B, RL_8169, "8411B", RL_JUMBO_MTU_9K},
 	{ 0, 0, NULL, 0 }
 };
=20
@@ -1451,6 +1455,7 @@ re_attach(device_t dev)
 		    RL_FLAG_DESCV2 | RL_FLAG_MACSTAT | RL_FLAG_AUTOPAD |
 		    RL_FLAG_JUMBOV2 | RL_FLAG_WAIT_TXPOLL | RL_FLAG_WOL_MANLINK;
 		break;
+	case RL_HWREV_8168GU:
 	case RL_HWREV_8168E:
 		sc->rl_flags |=3D RL_FLAG_PHYWAKE | RL_FLAG_PHYWAKE_PM |
 		    RL_FLAG_PAR | RL_FLAG_DESCV2 | RL_FLAG_MACSTAT |
@@ -1458,8 +1463,11 @@ re_attach(device_t dev)
 		    RL_FLAG_WOL_MANLINK;
 		break;
 	case RL_HWREV_8168E_VL:
+	case RL_HWREV_8168EP:
 	case RL_HWREV_8168F:
+	case RL_HWREV_8168G:
 	case RL_HWREV_8411:
+	case RL_HWREV_8411B:
 		sc->rl_flags |=3D RL_FLAG_PHYWAKE | RL_FLAG_PAR |
 		    RL_FLAG_DESCV2 | RL_FLAG_MACSTAT | RL_FLAG_CMDSTOP |
 		    RL_FLAG_AUTOPAD | RL_FLAG_JUMBOV2 |
diff --git a/sys/pci/if_rlreg.h b/sys/pci/if_rlreg.h
index 142fe48..89440e3 100644
--- a/sys/pci/if_rlreg.h
+++ b/sys/pci/if_rlreg.h
@@ -174,7 +174,7 @@
 #define	RL_HWREV_8102EL_SPIN1	0x24C00000
 #define	RL_HWREV_8168D		0x28000000
 #define	RL_HWREV_8168DP		0x28800000
-#define	RL_HWREV_8168E		0x2C000000
+#define	RL_HWREV_8168E		0x2C000000     /* 8105E */
 #define	RL_HWREV_8168E_VL	0x2C800000
 #define	RL_HWREV_8168B_SPIN1	0x30000000
 #define	RL_HWREV_8100E		0x30800000
@@ -192,6 +192,10 @@
 #define	RL_HWREV_8106E		0x44800000
 #define	RL_HWREV_8168F		0x48000000
 #define	RL_HWREV_8411		0x48800000
+#define RE_HWREV_8411B      0x5C800000
+#define RE_HWREV_8168G      0x4C000000
+#define RE_HWREV_8168EP     0x50000000
+#define RE_HWREV_8168GU     0x50800000      /* 8106EUS */
 #define	RL_HWREV_8139		0x60000000
 #define	RL_HWREV_8139A		0x70000000
 #define	RL_HWREV_8139AG		0x70800000
--=20
1.8.4.1


--MP_/0wFFVR_dOtdshJmu5wgtM3.--

--Sig_/+_L+578aFh1LcEL6OaLqt2L
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBCgAGBQJSbQNqAAoJENeyf/ug44dtLmUP/3yHCmLJEt2EH26TtxHj+Ozq
GpIWSQu7y2kHGzq4co5EFzD0pY7R/6sUjesO9R8Cudfyp99/dud+wvpMzGI+uXpq
v90uNk2gwZp7OWuToK7d5h4zs171eshdnWZyBGmTtR7RjfZIWtNBVeOba8Bm+RNG
rg+NiSjSQdIhro3PMFToSLqoPZMGavB7G3Wd5oRCbtHaVNOC4bLBNE9ShB8IzShX
RWocmGcQIvvBO3rI27npmwQB0nwo1liLdxhsrL1dt0Px7WLPlZy4+Z12pVxQ/9VT
F9M7RpsBcSIfGKxzwZuQRL8NeUMGxHJIk5z3WNCyEJpjy4N2xF/b4rxZETpgXGyy
cUoAs4QBKvwA+g0OPhwQXVR8gkRUQ3dZWPbc30aRlZQyRONvKU7CQhENB6jlsyIS
mq+W5NAdh6kOeB3oUkcwOlwDfdR2BsJBneklIwww68VcZhfTTT91ifDUUpyEXlVo
3aozV1zdBQFlWg7mdFW42SzgEUTD+yyRwtzXx/F8F+zhrG7pM9fDrF2oWfWptKsC
Bnh+mqb8+wKpRUsFo44S7wNBNJS6LXWSEvvsZ4liUmo6GfSKKxintWiVRhQkv00T
ucNrIatOeADR4nuXEiJMnBqTFxf4prfobK3+D/KYx/dhXGOR60/RYxdQ45KHoroD
aCQcdq0/uaP+pI8V7Rtp
=FBep
-----END PGP SIGNATURE-----

--Sig_/+_L+578aFh1LcEL6OaLqt2L--

From owner-freebsd-net@FreeBSD.ORG  Mon Oct 28 02:27:28 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 5DDEC673
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 02:27:28 +0000 (UTC)
 (envelope-from pyunyh@gmail.com)
Received: from mail-pb0-x231.google.com (mail-pb0-x231.google.com
 [IPv6:2607:f8b0:400e:c01::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 346A12AF8
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 02:27:28 +0000 (UTC)
Received: by mail-pb0-f49.google.com with SMTP id xb4so2986996pbc.22
 for <freebsd-net@freebsd.org>; Sun, 27 Oct 2013 19:27:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:date:to:cc:subject:message-id:reply-to:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=R+FgeKFNk546gakYuAVJarxmq84daGBQv/PrvWJrpvc=;
 b=OPiXhWDiNm80zEWd4/98XWJqO4aFQ9fgqF1VijZGznW3DZaZJHrKVbYGqpNMaBGU94
 4zkp1b8mTnSMnKbczx31bjMp1U01cTx9k9q3ypCnqMkeUf8PULzVa17KetrqtrSzm+LQ
 y/9gY5KNzNuJl+uJdXZ+699N+xWDEq1Ltm1p0NrmCftvZGThpUGQU8s8UmFcQufQS9n8
 ytTGxakxdTIIM/XMw/k+aUxKfV3f+j0I1NGsTF4Pt0DBGPgwqg8Gp/HaOm4e8J03ypla
 +b1lpmi0PTS5fQtT9octmZ4HWBFg7cxFnkPvLdD3QaLqIrMLHw/WVzejwj+1kFhzn6HH
 OTEg==
X-Received: by 10.66.163.164 with SMTP id yj4mr23419537pab.91.1382927246133;
 Sun, 27 Oct 2013 19:27:26 -0700 (PDT)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPSA id yh1sm24865208pbc.21.2013.10.27.19.27.23
 for <multiple recipients>
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Sun, 27 Oct 2013 19:27:25 -0700 (PDT)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Mon, 28 Oct 2013 11:27:23 +0900
From: Yonghyeon PYUN <pyunyh@gmail.com>
Date: Mon, 28 Oct 2013 11:27:23 +0900
To: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU,
 8411B and 8106EUS.
Message-ID: <20131028022723.GA4367@michelle.cdnetworks.com>
References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Oct 2013 02:27:28 -0000

On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote:
> Hi,
> 
> This is a follow up. I have tested most of these NIC's now and this
> patch _should_ be fine to commit to HEAD. Could someone please help me
> mediate this? This also fixes kern/183167. Please disregards the
> patches in the PR.
> 

I can handle this. Actually I had been working on supporting these
newer controllers for a while. It seems just adding 8168GU id does
not work. Did you test the patch on 8168GU controller?
If yes, please let me know the OUI id and model number of the PHY.

From owner-freebsd-net@FreeBSD.ORG  Mon Oct 28 05:48:49 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BB481B22
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 05:48:49 +0000 (UTC)
 (envelope-from eocallaghan@alterapraxis.com)
Received: from smtp.alterapraxis.com (unknown [101.164.33.212])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 73ABB238E
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 05:48:49 +0000 (UTC)
Received: from smtp.alterapraxis.com (tony [127.0.0.1])
 by smtp.alterapraxis.com (Postfix) with ESMTP id 1AF67634852;
 Mon, 28 Oct 2013 16:46:27 +1100 (EST)
Received: from tinkerbell.alterapraxis.com (unknown [101.164.33.212])
 (using SSLv3 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: eocallaghan@alterapraxis.com)
 by smtp.alterapraxis.com (Postfix) with ESMTPSA id 1F51463484A;
 Mon, 28 Oct 2013 16:46:25 +1100 (EST)
Date: Mon, 28 Oct 2013 16:48:35 +1100
From: Edward O'Callaghan <eocallaghan@alterapraxis.com>
To: pyunyh@gmail.com
Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP,
 8168GU, 8411B and 8106EUS.
Message-ID: <20131028164835.298646d5.eocallaghan@alterapraxis.com>
In-Reply-To: <20131028022723.GA4367@michelle.cdnetworks.com>
References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com>
 <20131028022723.GA4367@michelle.cdnetworks.com>
Organization: Altera Praxis Pty Ltd
X-Mailer: Claws Mail 3.9.2 (GTK+ 2.24.22; x86_64-unknown-linux-gnu)
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=PGP-SHA512;
 boundary="Sig_/0tWS0zraPu/45De8hAEzUAu"; protocol="application/pgp-signature"
X-Virus-Scanned: ClamAV using ClamSMTP
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Oct 2013 05:48:49 -0000

--Sig_/0tWS0zraPu/45De8hAEzUAu
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Mon, 28 Oct 2013 11:27:23 +0900
Yonghyeon PYUN <pyunyh@gmail.com> wrote:

> On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote:
> > Hi,
> >=20
> > This is a follow up. I have tested most of these NIC's now and this
> > patch _should_ be fine to commit to HEAD. Could someone please help
> > me mediate this? This also fixes kern/183167. Please disregards the
> > patches in the PR.
> >=20
>=20
> I can handle this. Actually I had been working on supporting these
> newer controllers for a while. It seems just adding 8168GU id does
> not work. Did you test the patch on 8168GU controller?
> If yes, please let me know the OUI id and model number of the PHY.

Hi Yonghyeon,

Many thanks! Not the 8168GU, however I did find out that its the same
as a 8106EUS. I don't know if this may shed some light if you have the
hw to test it.. What exactly did not work about the 8168GU, what is it
doing?

My main concern is to get a board here working that has a 8168G onboard.

Kind Regards,
Edward.

--Sig_/0tWS0zraPu/45De8hAEzUAu
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBCgAGBQJSbfq3AAoJENeyf/ug44dtJnUP/2dax3g2HyZf5EBa82EjEPRK
JsWqJGCXvBnfcwRouOKr12hdZdNwPI7kmmjDxHIc5BH66hdPbSrvtvdh0aa4daSm
BTyv2Ycdj36I7znZcWsGkeZ5NHL+iwk0o7tnRpOqp8g111/fVDLwuzMij/wULU6c
e77G1Z5V31g6t/DENh0UOBbayJ/3NJ0twgdLwoQewdbA2UYk6IhJeA6gOFGSwJC1
7TMuLO/CLnY6wUU8x7rLtGJb7HOftIjUqmYlR6rmUdSJyrmiHTBaZ5R+JxTgcJJn
S4GVvFJC96e9eV8sbsq1SjV0ExkDO43tnLh8q5b/OFTMmSMcoUANVExml3JvWPuk
vLmgCr82YTJCNHNnyjD0jmTuzZW4eqRw/WrdQ+z/spu1vmxus6HqHZpy+dBFmjoX
INniKCYmqsJenpPTPNxdxpTOyj9woR74UAzb19fXlmJ7IuobtPD181lw64rrb8+K
jdsk0h/yLk9KkDpWdb0LXS0XAfIq0Ky1jYSX6VTVxUFqKPso4pHvImtzQXrn6MCM
ma9pYbMcISgi7pAiVnXzK8AY0Xk/txwPCbHP+dnptU2QDMe1N2qkwikAigs01Ybq
cQfEtEIomHzCHqmaAk+ijHi1FmV5XssNxjyRrlFhDnokUxOf5gA4uU+96yH8ghJt
3x1J+4ndt2be3vkb95Ya
=+ZPK
-----END PGP SIGNATURE-----

--Sig_/0tWS0zraPu/45De8hAEzUAu--

From owner-freebsd-net@FreeBSD.ORG  Mon Oct 28 06:11:08 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 17D5F1D4
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 06:11:08 +0000 (UTC)
 (envelope-from pyunyh@gmail.com)
Received: from mail-pd0-x230.google.com (mail-pd0-x230.google.com
 [IPv6:2607:f8b0:400e:c02::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id E33F424B5
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 06:11:07 +0000 (UTC)
Received: by mail-pd0-f176.google.com with SMTP id g10so6587459pdj.35
 for <freebsd-net@freebsd.org>; Sun, 27 Oct 2013 23:11:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:date:to:cc:subject:message-id:reply-to:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=8/aUAwTJavH4cys1CiDzmNREE/RXpzy/akL3rOs7IEA=;
 b=vsXHlpD56uGlt4WiKLALZXfLx/IPJp7PuPWEauIjxO/9Df2gzmXqkY8NtW5BhIuw8O
 nXutdU7xWnRNn/wLz7GmdW/jlxLmgEI2mCGuLowfskAeXJ7dwf7NG1dr7ojxnvOO0gGI
 C6pINwEgMyqFND4b9M9q/5AUvHrtI+99b67n19HdHRavxSTnMS/c2uLOb58pgA11gONU
 FzTh1L7PshUYhKjn1AV0TIEv/UFynyfCDbk5vJ5tjGKlM2+19oYB6xxHzB5QarOMaxYM
 EslOfispjZCf9n9MlHS3tcj0vs4Pk/fTDGxRswg+dVIwwr9XJkke1EqEsUuAWPeBM25+
 Ud8g==
X-Received: by 10.66.149.231 with SMTP id ud7mr24221077pab.8.1382940666472;
 Sun, 27 Oct 2013 23:11:06 -0700 (PDT)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPSA id lm2sm32345952pab.2.2013.10.27.23.11.03
 for <multiple recipients>
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Sun, 27 Oct 2013 23:11:05 -0700 (PDT)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Mon, 28 Oct 2013 15:11:00 +0900
From: Yonghyeon PYUN <pyunyh@gmail.com>
Date: Mon, 28 Oct 2013 15:11:00 +0900
To: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU,
 8411B and 8106EUS.
Message-ID: <20131028061100.GC1350@michelle.cdnetworks.com>
References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com>
 <20131028022723.GA4367@michelle.cdnetworks.com>
 <20131028164835.298646d5.eocallaghan@alterapraxis.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20131028164835.298646d5.eocallaghan@alterapraxis.com>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Oct 2013 06:11:08 -0000

On Mon, Oct 28, 2013 at 04:48:35PM +1100, Edward O'Callaghan wrote:
> On Mon, 28 Oct 2013 11:27:23 +0900
> Yonghyeon PYUN <pyunyh@gmail.com> wrote:
> 
> > On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote:
> > > Hi,
> > > 
> > > This is a follow up. I have tested most of these NIC's now and this
> > > patch _should_ be fine to commit to HEAD. Could someone please help
> > > me mediate this? This also fixes kern/183167. Please disregards the
> > > patches in the PR.
> > > 
> > 
> > I can handle this. Actually I had been working on supporting these
> > newer controllers for a while. It seems just adding 8168GU id does
> > not work. Did you test the patch on 8168GU controller?
> > If yes, please let me know the OUI id and model number of the PHY.
> 
> Hi Yonghyeon,
> 
> Many thanks! Not the 8168GU, however I did find out that its the same
> as a 8106EUS. I don't know if this may shed some light if you have the
> hw to test it.. What exactly did not work about the 8168GU, what is it
> doing?

Intermittent packet drops and slightly high number of RX
interrupts.

> 
> My main concern is to get a board here working that has a 8168G onboard.
> 

Just adding RTL8168G id would use ukpky(4). Probably rgephy(4)
should be taught to pick up the PHY but I don't have copy of data
sheet. I'm testing patched rgephy(4) at this moment so give me some
time.

> Kind Regards,
> Edward.


From owner-freebsd-net@FreeBSD.ORG  Mon Oct 28 11:06:52 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BA650AEC
 for <freebsd-net@FreeBSD.org>; Mon, 28 Oct 2013 11:06:52 +0000 (UTC)
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id A17A62475
 for <freebsd-net@FreeBSD.org>; Mon, 28 Oct 2013 11:06:52 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9SB6qDj055167
 for <freebsd-net@FreeBSD.org>; Mon, 28 Oct 2013 11:06:52 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9SB6qPn055165
 for freebsd-net@FreeBSD.org; Mon, 28 Oct 2013 11:06:52 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 28 Oct 2013 11:06:52 GMT
Message-Id: <201310281106.r9SB6qPn055165@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
 owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@freebsd.org>
To: freebsd-net@FreeBSD.org
Subject: Current problem reports assigned to freebsd-net@FreeBSD.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Oct 2013 11:06:52 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/182847  net        [netinet6] [patch] Remove dead code
o kern/182665  net        [wlan] Kernel panic when creating second wlandev.
o kern/182382  net        [tcp] sysctl to set TCP CC method on BIG ENDIAN system
o kern/182297  net        [cm] ArcNet driver fails to detect the link address - 
o kern/182212  net        [patch] [ng_mppc] ng_mppc(4) blocks on network errors 
o kern/181970  net        [re] LAN Realtek� 8111G is not supported by re driver
o kern/181931  net        [vlan] [lagg] vlan over lagg over mlxen crashes the ke
o kern/181823  net        [ip6] [patch] make ipv6 mroute return same errror code
o kern/181741  net        [kernel] [patch] Packet loss when 'control' messages a
o kern/181703  net        [re] [patch] Fix Realtek 8111G Ethernet controller not
o kern/181657  net        [bpf] [patch] BPF_COP/BPF_COPX instruction reservation
o kern/181257  net        [bge] bge link status change
o kern/181236  net        [igb] igb driver unstable work
o kern/181225  net        [infiniband] [patch] unloading ipoib crashes the kerne
o kern/181135  net        [netmap] [patch] sys/dev/netmap patch for Linux compat
o kern/181131  net        [netmap] [patch] sys/dev/netmap memory allocation impr
o kern/181006  net        [run] [patch] mbuf leak in run(4) driver
o kern/180893  net        [if_ethersubr] [patch] Packets received with own LLADD
o kern/180844  net        [panic] [re] Intermittent panic (re driver?)
o kern/180775  net        [bxe] if_bxe driver broken with Broadcom BCM57711 card
o kern/180722  net        [bluetooth] bluetooth takes 30-50 attempts to pair to 
s kern/180468  net        [request] LOCAL_PEERCRED support for PF_INET
o kern/180065  net        [netinet6] [patch] Multicast loopback to own host brok
o kern/179926  net        [lacp] [patch] active aggregator selection bug
o kern/179824  net        [ixgbe] System (9.1-p4) hangs on heavy ixgbe network t
o kern/179733  net        [lagg] [patch] interface loses capabilities when proto
o kern/179429  net        [tap] STP enabled tap bridge
o kern/179299  net        [igb] Intel X540-T2 - unstable driver
a kern/179264  net        [vimage] [pf] Core dump with Packet filter and VIMAGE 
o kern/178947  net        [arp] arp rejecting not working
o kern/178782  net        [ixgbe] 82599EB SFP does not work with passthrough und
o kern/178612  net        [run] kernel panic due the problems with run driver
o kern/178472  net        [ip6] [patch] make return code consistent with IPv4 co
o kern/178079  net        [tcp] Switching TCP CC algorithm panics on sparc64 wit
s kern/178071  net        FreeBSD unable to recongize Kontron (Industrial Comput
o kern/177905  net        [xl] [panic] ifmedia_set when pluging CardBus LAN card
o kern/177618  net        [bridge] Problem with bridge firewall with trunk ports
o kern/177417  net        [ip6] Invalid protocol value in ipsec6_common_input_cb
o kern/177402  net        [igb] [pf] problem with ethernet driver igb + pf / alt
o kern/177400  net        [jme] JMC25x 1000baseT establishment issues
o kern/177366  net        [ieee80211] negative malloc(9) statistics for 80211nod
f kern/177362  net        [netinet] [patch] Wrong control used to return TOS
o kern/177194  net        [netgraph] Unnamed netgraph nodes for vlan interfaces
o kern/177184  net        [bge] [patch] enable wake on lan
o kern/177139  net        [igb] igb drops ethernet ports 2 and 3
o kern/176884  net        [re] re0 flapping up/down
o kern/176671  net        [epair] MAC address for epair device not unique
o kern/176484  net        [ipsec] [enc] [patch] panic: IPsec + enc(4); device na
o kern/176446  net        [netinet] [patch] Concurrency in ixgbe driving out-of-
o kern/176420  net        [kernel] [patch] incorrect errno for LOCAL_PEERCRED
o kern/176419  net        [kernel] [patch] socketpair support for LOCAL_PEERCRED
o kern/176401  net        [netgraph] page fault  in netgraph
o kern/176167  net        [ipsec][lagg] using lagg and ipsec causes immediate pa
o kern/176027  net        [em] [patch] flow control systcl consistency for em dr
o kern/176026  net        [tcp] [patch] TCP wrappers caused quite a lot of warni
o kern/175864  net        [re] Intel MB D510MO, onboard ethernet not working aft
o kern/175852  net        [amd64] [patch] in_cksum_hdr() behaves differently on 
o kern/175734  net        no ethernet detected on system with EG20T PCH chipset 
o kern/175267  net        [pf] [tap] pf + tap keep state problem
o kern/175236  net        [epair] [gif] epair and gif Devices On Bridge
o kern/175182  net        [panic] kernel panic on RADIX_MPATH when deleting rout
o kern/175153  net        [tcp] will there miss a FIN when do TSO?
o kern/174959  net        [net] [patch] rnh_walktree_from visits spurious nodes
o kern/174958  net        [net] [patch] rnh_walktree_from makes unreasonable ass
o kern/174897  net        [route] Interface routes are broken
o kern/174851  net        [bxe] [patch] UDP checksum offload is wrong in bxe dri
o kern/174850  net        [bxe] [patch] bxe driver does not receive multicasts
o kern/174849  net        [bxe] [patch] bxe driver can hang kernel when reset
o kern/174822  net        [tcp] Page fault in tcp_discardcb under high traffic
o kern/174602  net        [gif] [ipsec] traceroute issue on gif tunnel with ipse
o kern/174535  net        [tcp] TCP fast retransmit feature works strange
o kern/173871  net        [gif] process of 'ifconfig gif0 create hangs' when if_
o kern/173475  net        [tun] tun(4) stays opened by PID after process is term
o kern/173201  net        [ixgbe] [patch] Missing / broken ixgbe sysctl's and tu
o kern/173137  net        [em] em(4) unable to run at gigabit with 9.1-RC2
o kern/173002  net        [patch] data type size problem in if_spppsubr.c
o kern/172895  net        [ixgb] [ixgbe] do not properly determine link-state
o kern/172683  net        [ip6] Duplicate IPv6 Link Local Addresses
o kern/172675  net        [netinet] [patch] sysctl_tcp_hc_list (net.inet.tcp.hos
p kern/172113  net        [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4
o kern/171840  net        [ip6] IPv6 packets transmitting only on queue 0
o kern/171739  net        [bce] [panic] bce related kernel panic
o kern/171711  net        [dummynet] [panic] Kernel panic in dummynet
o kern/171532  net        [ndis] ndis(4) driver includes 'pccard'-specific code,
o kern/171531  net        [ndis] undocumented dependency for ndis(4)
o kern/171524  net        [ipmi] ipmi driver crashes kernel by reboot or shutdow
s kern/171508  net        [epair] [request] Add the ability to name epair device
o kern/171228  net        [re] [patch] if_re - eeprom write issues
o kern/170701  net        [ppp] killl ppp or reboot with active ppp connection c
o kern/170267  net        [ixgbe] IXGBE_LE32_TO_CPUS is probably an unintentiona
o kern/170081  net        [fxp] pf/nat/jails not working if checksum offloading 
o kern/169898  net        ifconfig(8) fails to set MTU on multiple interfaces.
o kern/169676  net        [bge] [hang] system hangs, fully or partially after re
o kern/169620  net        [ng] [pf] ng_l2tp incoming packet bypass pf firewall
o kern/169459  net        [ppp] umodem/ppp/3g stopped working after update from 
o kern/169438  net        [ipsec] ipv4-in-ipv6 tunnel mode IPsec does not work
p kern/168294  net        [ixgbe] [patch] ixgbe driver compiled in kernel has no
o kern/168246  net        [em] Multiple em(4) not working with qemu
o kern/168245  net        [arp] [regression] Permanent ARP entry not deleted on 
o kern/168244  net        [arp] [regression] Unable to manually remove permanent
o kern/168183  net        [bce] bce driver hang system
o kern/167603  net        [ip] IP fragment reassembly's broken: file transfer ov
o kern/167500  net        [em] [panic] Kernel panics in em driver
o kern/167325  net        [netinet] [patch] sosend sometimes return EINVAL with 
o kern/167202  net        [igmp]: Sending multiple IGMP packets crashes kernel
o kern/166462  net        [gre] gre(4) when using a tunnel source address from c
o kern/166285  net        [arp] FreeBSD v8.1 REL p8 arp: unknown hardware addres
o kern/166255  net        [net] [patch] It should be possible to disable "promis
p kern/165903  net        mbuf leak
o kern/165622  net        [ndis][panic][patch] Unregistered use of FPU in kernel
s kern/165562  net        [request] add support for Intel i350 in FreeBSD 7.4
o kern/165526  net        [bxe] UDP packets checksum calculation whithin if_bxe 
o kern/165488  net        [ppp] [panic] Fatal trap 12 jails and ppp , kernel wit
o kern/165305  net        [ip6] [request] Feature parity between IP_TOS and IPV6
o kern/165296  net        [vlan] [patch] Fix EVL_APPLY_VLID, update EVL_APPLY_PR
o kern/165181  net        [igb] igb freezes after about 2 weeks of uptime
o kern/165174  net        [patch] [tap] allow tap(4) to keep its address on clos
o kern/165152  net        [ip6] Does not work through the issue of ipv6 addresse
o kern/164495  net        [igb] connect double head igb to switch cause system t
o kern/164490  net        [pfil] Incorrect IP checksum on pfil pass from ip_outp
o kern/164475  net        [gre] gre misses RUNNING flag after a reboot
o kern/164265  net        [netinet] [patch] tcp_lro_rx computes wrong checksum i
o kern/163903  net        [igb] "igb0:tx(0)","bpf interface lock" v2.2.5 9-STABL
o kern/163481  net        freebsd do not add itself to ping route packet
o kern/162927  net        [tun] Modem-PPP error ppp[1538]: tun0: Phase: Clearing
o kern/162558  net        [dummynet] [panic] seldom dummynet panics
o kern/162153  net        [em] intel em driver 7.2.4 don't compile
o kern/162110  net        [igb] [panic] RELENG_9 panics on boot in IGB driver - 
o kern/162028  net        [ixgbe] [patch] misplaced #endif in ixgbe.c
o kern/161277  net        [em] [patch] BMC cannot receive IPMI traffic after loa
o kern/160873  net        [igb] igb(4) from HEAD fails to build on 7-STABLE
o kern/160750  net        Intel PRO/1000 connection breaks under load until rebo
o kern/160693  net        [gif] [em] Multicast packet are not passed from GIF0 t
o kern/160293  net        [ieee80211] ppanic] kernel panic during network setup 
o kern/160206  net        [gif] gifX stops working after a while (IPv6 tunnel)
o kern/159817  net        [udp] write UDPv4: No buffer space available (code=55)
o kern/159629  net        [ipsec] [panic] kernel panic with IPsec in transport m
o kern/159621  net        [tcp] [panic] panic: soabort: so_count
o kern/159603  net        [netinet] [patch] in_ifscrubprefix() - network route c
o kern/159601  net        [netinet] [patch] in_scrubprefix() - loopback route re
o kern/159294  net        [em] em watchdog timeouts
o kern/159203  net        [wpi] Intel 3945ABG Wireless LAN not support IBSS
o kern/158930  net        [bpf] BPF element leak in ifp->bpf_if->bif_dlist
o kern/158726  net        [ip6] [patch] ICMPv6 Router Announcement flooding limi
o kern/158694  net        [ix] [lagg] ix0 is not working within lagg(4)
o kern/158665  net        [ip6] [panic] kernel pagefault in in6_setscope()
o kern/158635  net        [em] TSO breaks BPF packet captures with em driver
f kern/157802  net        [dummynet] [panic] kernel panic in dummynet
o kern/157785  net        amd64 + jail + ipfw + natd = very slow outbound traffi
o kern/157418  net        [em] em driver lockup during boot on Supermicro X9SCM-
o kern/157410  net        [ip6] IPv6 Router Advertisements Cause Excessive CPU U
o kern/157287  net        [re] [panic] INVARIANTS panic (Memory modified after f
o kern/157200  net        [network.subr] [patch] stf(4) can not communicate betw
o kern/157182  net        [lagg] lagg interface not working together with epair 
o kern/156877  net        [dummynet] [panic] dummynet move_pkt() null ptr derefe
o kern/156667  net        [em] em0 fails to init on CURRENT after March 17
o kern/156408  net        [vlan] Routing failure when using VLANs vs. Physical e
o kern/156328  net        [icmp]: host can ping other subnet but no have IP from
o kern/156317  net        [ip6] Wrong order of IPv6 NS DAD/MLD Report
o kern/156283  net        [ip6] [patch] nd6_ns_input - rtalloc_mpath does not re
o kern/156279  net        [if_bridge][divert][ipfw] unable to correctly re-injec
o kern/156226  net        [lagg]: failover does not announce the failover to swi
o kern/156030  net        [ip6] [panic] Crash in nd6_dad_start() due to null ptr
o kern/155680  net        [multicast] problems with multicast
s kern/155642  net        [new driver] [request] Add driver for Realtek RTL8191S
o kern/155597  net        [panic] Kernel panics with "sbdrop" message
o kern/155420  net        [vlan] adding vlan break existent vlan
o kern/155177  net        [route] [panic] Panic when inject routes in kernel
o kern/155010  net        [msk] ntfs-3g via iscsi using msk driver cause kernel 
o kern/154943  net        [gif] ifconfig gifX create on existing gifX clears IP
s kern/154851  net        [new driver] [request]: Port brcm80211 driver from Lin
o kern/154850  net        [netgraph] [patch] ng_ether fails to name nodes when t
o kern/154679  net        [em] Fatal trap 12: "em1 taskq" only at startup (8.1-R
o kern/154600  net        [tcp] [panic] Random kernel panics on tcp_output
o kern/154557  net        [tcp] Freeze tcp-session of the clients, if in the gat
o kern/154443  net        [if_bridge] Kernel module bridgestp.ko missing after u
o kern/154286  net        [netgraph] [panic] 8.2-PRERELEASE panic in netgraph
o kern/154255  net        [nfs] NFS not responding
o kern/154214  net        [stf] [panic] Panic when creating stf interface
o kern/154185  net        race condition in mb_dupcl
p kern/154169  net        [multicast] [ip6] Node Information Query multicast add
o kern/154134  net        [ip6] stuck kernel state in LISTEN on ipv6 daemon whic
o kern/154091  net        [netgraph] [panic] netgraph, unaligned mbuf?
o conf/154062  net        [vlan] [patch] change to way of auto-generatation of v
o kern/153937  net        [ral] ralink panics the system (amd64 freeBSDD 8.X) wh
o kern/153936  net        [ixgbe] [patch] MPRC workaround incorrectly applied to
o kern/153816  net        [ixgbe] ixgbe doesn't work properly with the Intel 10g
o kern/153772  net        [ixgbe] [patch] sysctls reference wrong XON/XOFF varia
o kern/153497  net        [netgraph] netgraph panic due to race conditions
o kern/153454  net        [patch] [wlan] [urtw] Support ad-hoc and hostap modes 
o kern/153308  net        [em] em interface use 100% cpu
o kern/153244  net        [em] em(4) fails to send UDP to port 0xffff
o kern/152893  net        [netgraph] [panic] 8.2-PRERELEASE panic in netgraph
o kern/152853  net        [em] tftpd (and likely other udp traffic) fails over e
o kern/152828  net        [em] poor performance on 8.1, 8.2-PRE
o kern/152569  net        [net]: Multiple ppp connections and routing table prob
o kern/152235  net        [arp] Permanent local ARP entries are not properly upd
o kern/152141  net        [vlan] [patch] encapsulate vlan in ng_ether before out
o kern/152036  net        [libc] getifaddrs(3) returns truncated sockaddrs for n
o kern/151690  net        [ep] network connectivity won't work until dhclient is
o kern/151681  net        [nfs] NFS mount via IPv6 leads to hang on client with 
o kern/151593  net        [igb] [panic] Kernel panic when bringing up igb networ
o kern/150920  net        [ixgbe][igb] Panic when packets are dropped with heade
o kern/150557  net        [igb] igb0: Watchdog timeout -- resetting
o kern/150251  net        [patch] [ixgbe] Late cable insertion broken
o kern/150249  net        [ixgbe] Media type detection broken
o bin/150224   net        ppp(8) does not reassign static IP after kill -KILL co
f kern/149969  net        [wlan] [ral] ralink rt2661 fails to maintain connectio
o kern/149643  net        [rum] device not sending proper beacon frames in ap mo
o kern/149609  net        [panic] reboot after adding second default route
o kern/149117  net        [inet] [patch] in_pcbbind: redundant test
o kern/149086  net        [multicast] Generic multicast join failure in 8.1
o kern/148018  net        [flowtable] flowtable crashes on ia64
o kern/147912  net        [boot] FreeBSD 8 Beta won't boot on Thinkpad i1300  11
o kern/147894  net        [ipsec] IPv6-in-IPv4 does not work inside an ESP-only 
o kern/147155  net        [ip6] setfb not work with ipv6
o kern/146845  net        [libc] close(2) returns error 54 (connection reset by 
f kern/146792  net        [flowtable] flowcleaner 100% cpu's core load
o kern/146719  net        [pf] [panic] PF or dumynet kernel panic
o kern/146534  net        [icmp6] wrong source address in echo reply
o kern/146427  net        [mwl] Additional virtual access points don't work on m
f kern/146394  net        [vlan] IP source address for outgoing connections
o bin/146377   net        [ppp] [tun] Interface doesn't clear addresses when PPP
o kern/146358  net        [vlan] wrong destination MAC address
o kern/146165  net        [wlan] [panic] Setting bssid in adhoc mode causes pani
o kern/146082  net        [ng_l2tp] a false invaliant check was performed in ng_
o kern/146037  net        [panic] mpd + CoA = kernel panic
o kern/145825  net        [panic] panic: soabort: so_count
o kern/145728  net        [lagg] Stops working lagg between two servers.
p kern/145600  net        TCP/ECN behaves different to CE/CWR than ns2 reference
f kern/144917  net        [flowtable] [panic] flowtable crashes system [regressi
o kern/144882  net        MacBookPro =>4.1 does not connect to BSD in hostap wit
o kern/144874  net        [if_bridge] [patch] if_bridge frees mbuf after pfil ho
o conf/144700  net        [rc.d] async dhclient breaks stuff for too many people
o kern/144616  net        [nat] [panic] ip_nat panic FreeBSD 7.2
f kern/144315  net        [ipfw] [panic] freebsd 8-stable reboot after add ipfw 
o kern/144231  net        bind/connect/sendto too strict about sockaddr length
o kern/143846  net        [gif] bringing gif3 tunnel down causes gif0 tunnel to 
s kern/143673  net        [stf] [request] there should be a way to support multi
s kern/143666  net        [ip6] [request] PMTU black hole detection not implemen
o kern/143622  net        [pfil] [patch] unlock pfil lock while calling firewall
o kern/143593  net        [ipsec] When using IPSec, tcpdump doesn't show outgoin
o kern/143591  net        [ral] RT2561C-based DLink card (DWL-510) fails to work
o kern/143208  net        [ipsec] [gif] IPSec over gif interface not working
o kern/143034  net        [panic] system reboots itself in tcp code [regression]
o kern/142877  net        [hang] network-related repeatable 8.0-STABLE hard hang
o kern/142774  net        Problem with outgoing connections on interface with mu
o kern/142772  net        [libc] lla_lookup: new lle malloc failed
f kern/142518  net        [em] [lagg] Problem on 8.0-STABLE with em and lagg
o kern/142018  net        [iwi] [patch] Possibly wrong interpretation of beacon-
o kern/141861  net        [wi] data garbled with WEP and wi(4) with Prism 2.5
f kern/141741  net        Etherlink III NIC won't work after upgrade to FBSD 8, 
o kern/140742  net        rum(4) Two asus-WL167G adapters cannot talk to each ot
o kern/140682  net        [netgraph] [panic] random panic in netgraph
f kern/140634  net        [vlan] destroying if_lagg interface with if_vlan membe
o kern/140619  net        [ifnet] [patch] refine obsolete if_var.h comments desc
o kern/140346  net        [wlan] High bandwidth use causes loss of wlan connecti
o kern/140142  net        [ip6] [panic] FreeBSD 7.2-amd64 panic w/IPv6
o kern/140066  net        [bwi] install report for 8.0 RC 2 (multiple problems)
o kern/139387  net        [ipsec] Wrong lenth of PF_KEY messages in promiscuous 
o bin/139346   net        [patch] arp(8) add option to remove static entries lis
o kern/139268  net        [if_bridge] [patch] allow if_bridge to forward just VL
p kern/139204  net        [arp] DHCP server replies rejected, ARP entry lost bef
o kern/139117  net        [lagg] + wlan boot timing (EBUSY)
o kern/138850  net        [dummynet] dummynet doesn't work correctly on a bridge
o kern/138782  net        [panic] sbflush_internal: cc 0 || mb 0xffffff004127b00
o kern/138688  net        [rum] possibly broken on 8 Beta 4 amd64: able to wpa a
o kern/138678  net        [lo] FreeBSD does not assign linklocal address to loop
o kern/138407  net        [gre] gre(4) interface does not come up after reboot
o kern/138332  net        [tun] [lor] ifconfig tun0 destroy causes LOR if_adata/
o kern/138266  net        [panic] kernel panic when udp benchmark test used as r
f kern/138029  net        [bpf] [panic] periodically kernel panic and reboot
o kern/137881  net        [netgraph] [panic] ng_pppoe fatal trap 12
p bin/137841   net        [patch] wpa_supplicant(8) cannot verify SHA256 signed 
p kern/137776  net        [rum] panic in rum(4) driver on 8.0-BETA2
o bin/137641   net        ifconfig(8): various problems with "vlan_device.vlan_i
o kern/137392  net        [ip] [panic] crash in ip_nat.c line 2577
o kern/137372  net        [ral] FreeBSD doesn't support wireless interface from 
o kern/137089  net        [lagg] lagg falsely triggers IPv6 duplicate address de
o kern/136911  net        [netgraph] [panic] system panic on kldload ng_bpf.ko t
o kern/136618  net        [pf][stf] panic on cloning interface without unit numb
o kern/135502  net        [periodic] Warning message raised by rtfree function i
o kern/134583  net        [hang] Machine with jail freezes after random amount o
o kern/134531  net        [route] [panic] kernel crash related to routes/zebra
o kern/134157  net        [dummynet] dummynet loads cpu for 100% and make a syst
o kern/133969  net        [dummynet] [panic] Fatal trap 12: page fault while in 
o kern/133968  net        [dummynet] [panic] dummynet kernel panic
o kern/133736  net        [udp] ip_id not protected ...
o kern/133595  net        [panic] Kernel Panic at pcpu.h:195
o kern/133572  net        [ppp] [hang] incoming PPTP connection hangs the system
o kern/133490  net        [bpf] [panic] 'kmem_map too small' panic on Dell r900 
o kern/133235  net        [netinet] [patch] Process SIOCDLIFADDR command incorre
f kern/133213  net        arp and sshd errors on 7.1-PRERELEASE
o kern/133060  net        [ipsec] [pfsync] [panic] Kernel panic with ipsec + pfs
o kern/132889  net        [ndis] [panic] NDIS kernel crash on load BCM4321 AGN d
o conf/132851  net        [patch] rc.conf(5): allow to setfib(1) for service run
o kern/132734  net        [ifmib] [panic] panic in net/if_mib.c
o kern/132705  net        [libwrap] [patch] libwrap - infinite loop if hosts.all
o kern/132672  net        [ndis] [panic] ndis with rt2860.sys causes kernel pani
o kern/132354  net        [nat] Getting some packages to ipnat(8) causes crash
o kern/132277  net        [crypto] [ipsec] poor performance using cryptodevice f
o kern/131781  net        [ndis] ndis keeps dropping the link
o kern/131776  net        [wi] driver fails to init
o kern/131753  net        [altq] [panic] kernel panic in hfsc_dequeue
o bin/131365   net        route(8): route add changes interpretation of network 
f kern/130820  net        [ndis] wpa_supplicant(8) returns 'no space on device'
o kern/130628  net        [nfs] NFS / rpc.lockd deadlock on 7.1-R
o kern/130525  net        [ndis] [panic] 64 bit ar5008 ndisgen-erated driver cau
o kern/130311  net        [wlan_xauth] [panic] hostapd restart causing kernel pa
o kern/130109  net        [ipfw] Can not set fib for packets originated from loc
f kern/130059  net        [panic] Leaking 50k mbufs/hour
f kern/129719  net        [nfs] [panic] Panic during shutdown, tcp_ctloutput: in
o kern/129517  net        [ipsec] [panic] double fault / stack overflow
f kern/129508  net        [carp] [panic] Kernel panic with EtherIP (may be relat
o kern/129219  net        [ppp] Kernel panic when using kernel mode ppp
o kern/129197  net        [panic] 7.0 IP stack related panic
o bin/128954   net        ifconfig(8) deletes valid routes
o bin/128602   net        [an] wpa_supplicant(8) crashes with an(4)
o kern/128448  net        [nfs] 6.4-RC1 Boot Fails if NFS Hostname cannot be res
o bin/128295   net        [patch] ifconfig(8) does not print TOE4 or TOE6 capabi
o bin/128001   net        wpa_supplicant(8), wlan(4), and wi(4) issues
o kern/127826  net        [iwi] iwi0 driver has reduced performance and connecti
o kern/127815  net        [gif] [patch] if_gif does not set vlan attributes from
o kern/127724  net        [rtalloc] rtfree: 0xc5a8f870 has 1 refs
f bin/127719   net        [arp] arp: Segmentation fault (core dumped)
f kern/127528  net        [icmp]: icmp socket receives icmp replies not owned by
p kern/127360  net        [socket] TOE socket options missing from sosetopt()
o bin/127192   net        routed(8) removes the secondary alias IP of interface 
f kern/127145  net        [wi]: prism (wi) driver crash at bigger traffic
o kern/126895  net        [patch] [ral] Add antenna selection (marked as TBD)
o kern/126874  net        [vlan]: Zebra problem if ifconfig vlanX destroy
o kern/126695  net        rtfree messages and network disruption upon use of if_
o kern/126339  net        [ipw] ipw driver drops the connection
o kern/126075  net        [inet] [patch] internet control accesses beyond end of
o bin/125922   net        [patch] Deadlock in arp(8)
o kern/125920  net        [arp] Kernel Routing Table loses Ethernet Link status 
o kern/125845  net        [netinet] [patch] tcp_lro_rx() should make use of hard
o kern/125258  net        [socket] socket's SO_REUSEADDR option does not work
o kern/125239  net        [gre] kernel crash when using gre
o kern/124341  net        [ral] promiscuous mode for wireless device ral0 looses
o kern/124225  net        [ndis] [patch] ndis network driver sometimes loses net
o kern/124160  net        [libc] connect(2) function loops indefinitely
o kern/124021  net        [ip6] [panic] page fault in nd6_output()
o kern/123968  net        [rum] [panic] rum driver causes kernel panic with WPA.
o kern/123892  net        [tap] [patch] No buffer space available
o kern/123890  net        [ppp] [panic] crash & reboot on work with PPP low-spee
o kern/123858  net        [stf] [patch] stf not usable behind a NAT
o kern/123758  net        [panic] panic while restarting net/freenet6
o bin/123633   net        ifconfig(8) doesn't set inet and ether address in one 
o kern/123559  net        [iwi] iwi periodically disassociates/associates [regre
o bin/123465   net        [ip6] route(8): route add -inet6 <ipv6_addr> -interfac
o kern/123463  net        [ipsec] [panic] repeatable crash related to ipsec-tool
o conf/123330  net        [nsswitch.conf] Enabling samba wins in nsswitch.conf c
o kern/123160  net        [ip] Panic and reboot at sysctl kern.polling.enable=0
o kern/122989  net        [swi] [panic] 6.3 kernel panic in swi1: net
o kern/122954  net        [lagg] IPv6 EUI64 incorrectly chosen for lagg devices
f kern/122780  net        [lagg] tcpdump on lagg interface during high pps wedge
o kern/122685  net        It is not visible passing packets in tcpdump(1)
o kern/122319  net        [wi] imposible to enable ad-hoc demo mode with Orinoco
o kern/122290  net        [netgraph] [panic] Netgraph related "kmem_map too smal
o kern/122252  net        [ipmi] [bge] IPMI problem with BCM5704 (does not work 
o kern/122033  net        [ral] [lor] Lock order reversal in ral0 at bootup ieee
o bin/121895   net        [patch] rtsol(8)/rtsold(8) doesn't handle managed netw
s kern/121774  net        [swi] [panic] 6.3 kernel panic in swi1: net
o kern/121555  net        [panic] Fatal trap 12: current process = 12 (swi1: net
o kern/121534  net        [ipl] [nat] FreeBSD Release 6.3 Kernel Trap 12:
o kern/121443  net        [gif] [lor] icmp6_input/nd6_lookup
o kern/121437  net        [vlan] Routing to layer-2 address does not work on VLA
o bin/121359   net        [patch] [security] ppp(8): fix local stack overflow in
o kern/121257  net        [tcp] TSO + natd  -> slow outgoing tcp traffic
o kern/121181  net        [panic] Fatal trap 3: breakpoint instruction fault whi
o kern/120966  net        [rum] kernel panic with if_rum and WPA encryption
o kern/120566  net        [request]: ifconfig(8) make order of arguments more fr
o kern/120304  net        [netgraph] [patch] netgraph source assumes 32-bit time
o kern/120266  net        [udp] [panic] gnugk causes kernel panic when closing U
o bin/120060   net        routed(8) deletes link-level routes in the presence of
o kern/119945  net        [rum] [panic] rum device in hostap mode, cause kernel 
o kern/119791  net        [nfs] UDP NFS mount of aliased IP addresses from a Sol
o kern/119617  net        [nfs] nfs error on wpa network when reseting/shutdown
f kern/119516  net        [ip6] [panic] _mtx_lock_sleep: recursed on non-recursi
o kern/119432  net        [arp] route add -host <host> -iface <nic> causes arp e
o kern/119225  net        [wi] 7.0-RC1 no carrier with Prism 2.5 wifi card [regr
o kern/118727  net        [netgraph] [patch] [request] add new ng_pf module
o kern/117423  net        [vlan] Duplicate IP on different interfaces
o bin/117339   net        [patch] route(8): loading routing management commands 
o bin/116643   net        [patch] [request] fstat(1): add INET/INET6 socket deta
o kern/116185  net        [iwi] if_iwi driver leads system to reboot
o kern/115239  net        [ipnat] panic with 'kmem_map too small' using ipnat
o kern/115019  net        [netgraph] ng_ether upper hook packet flow stops on ad
o kern/115002  net        [wi] if_wi timeout. failed allocation (busy bit). ifco
o kern/114915  net        [patch] [pcn] pcn (sys/pci/if_pcn.c) ethernet driver f
o kern/113432  net        [ucom] WARNING: attempt to net_add_domain(netgraph) af
o kern/112722  net        [ipsec] [udp] IP v4 udp fragmented packet reject
o kern/112686  net        [patm] patm driver freezes System (FreeBSD 6.2-p4) i38
o bin/112557   net        [patch] ppp(8) lock file should not use symlink name
o kern/112528  net        [nfs] NFS over TCP under load hangs with "impossible p
o kern/111537  net        [inet6] [patch] ip6_input() treats mbuf cluster wrong
o kern/111457  net        [ral] ral(4) freeze
o kern/110284  net        [if_ethersubr] Invalid Assumption in SIOCSIFADDR in et
o kern/110249  net        [kernel] [regression] [patch] setsockopt() error regre
o kern/109470  net        [wi] Orinoco Classic Gold PC Card Can't Channel Hop
o bin/108895   net        pppd(8): PPPoE dead connections on 6.2 [regression]
f kern/108197  net        [panic] [gif] [ip6] if_delmulti reference counting pan
o kern/107944  net        [wi] [patch] Forget to unlock mutex-locks
o conf/107035  net        [patch] bridge(8): bridge interface given in rc.conf n
o kern/106444  net        [netgraph] [panic] Kernel Panic on Binding to an ip to
o kern/106316  net        [dummynet] dummynet with multipass ipfw drops packets 
o kern/105945  net        Address can disappear from network interface
s kern/105943  net        Network stack may modify read-only mbuf chain copies
o bin/105925   net        problems with ifconfig(8) and vlan(4) [regression]
o kern/104851  net        [inet6] [patch] On link routes not configured when usi
o kern/104751  net        [netgraph] kernel panic, when getting info about my tr
o kern/104738  net        [inet] [patch] Reentrant problem with inet_ntoa in the
o kern/103191  net        Unpredictable reboot
o kern/103135  net        [ipsec] ipsec with ipfw divert (not NAT) encodes a pac
o kern/102540  net        [netgraph] [patch] supporting vlan(4) by ng_fec(4)
o conf/102502  net        [netgraph] [patch] ifconfig name does't rename netgrap
o kern/102035  net        [plip] plip networking disables parallel port printing
o kern/100709  net        [libc] getaddrinfo(3) should return TTL info
o kern/100519  net        [netisr] suggestion to fix suboptimal network polling
o kern/98597   net        [inet6] Bug in FreeBSD 6.1 IPv6 link-local DAD procedu
o bin/98218    net        wpa_supplicant(8) blacklist not working
o kern/97306   net        [netgraph] NG_L2TP locks after connection with failed 
o conf/97014   net        [gif] gifconfig_gif? in rc.conf does not recognize IPv
f kern/96268   net        [socket] TCP socket performance drops by 3000% if pack
o kern/95519   net        [ral] ral0 could not map mbuf
o kern/95288   net        [pppd] [tty] [panic] if_ppp panic in sys/kern/tty_subr
o kern/95277   net        [netinet] [patch] IP Encapsulation mask_match() return
o kern/95267   net        packet drops periodically appear
f kern/93378   net        [tcp] Slow data transfer in Postfix and Cyrus IMAP (wo
o kern/93019   net        [ppp] ppp and tunX problems: no traffic after restarti
o kern/92880   net        [libc] [patch] almost rewritten inet_network(3) functi
s kern/92279   net        [dc] Core faults everytime I reboot, possible NIC issu
o kern/91859   net        [ndis] if_ndis does not work with Asus WL-138
o kern/91364   net        [ral] [wep] WF-511 RT2500 Card PCI and WEP
o kern/91311   net        [aue] aue interface hanging
o kern/87421   net        [netgraph] [panic]: ng_ether + ng_eiface + if_bridge
o kern/86871   net        [tcp] [patch] allocation logic for PCBs in TIME_WAIT s
o kern/86427   net        [lor] Deadlock with FASTIPSEC and nat
o kern/85780   net        'panic: bogus refcnt 0' in routing/ipv6
o bin/85445    net        ifconfig(8): deprecated keyword to ifconfig inoperativ
o bin/82975    net        route change does not parse classfull network as given
o kern/82881   net        [netgraph] [panic] ng_fec(4) causes kernel panic after
o kern/82468   net        Using 64MB tcp send/recv buffers, trafficflow stops, i
o bin/82185    net        [patch] ndp(8) can delete the incorrect entry
o kern/81095   net        IPsec connection stops working if associated network i
o kern/78968   net        FreeBSD freezes on mbufs exhaustion (network interface
o kern/78090   net        [ipf] ipf filtering on bridged packets doesn't work if
o kern/77341   net        [ip6] problems with IPV6 implementation
o kern/75873   net        Usability problem with non-RFC-compliant IP spoof prot
s kern/75407   net        [an] an(4): no carrier after short time
a kern/71474   net        [route] route lookup does not skip interfaces marked d
o kern/71469   net        default route to internet magically disappears with mu
o kern/68889   net        [panic] m_copym, length > size of mbuf chain
o kern/66225   net        [netgraph] [patch] extend ng_eiface(4) control message
o kern/65616   net        IPSEC can't detunnel GRE packets after real ESP encryp
s kern/60293   net        [patch] FreeBSD arp poison patch
a kern/56233   net        IPsec tunnel (ESP) over IPv6: MTU computation is wrong
s bin/41647    net        ifconfig(8) doesn't accept lladdr along with inet addr
o kern/39937   net        ipstealth issue
a kern/38554   net        [patch] changing interface ipaddress doesn't seem to w
o kern/31940   net        ip queue length too short for >500kpps
o kern/31647   net        [libc] socket calls can return undocumented EINVAL
o kern/30186   net        [libc] getaddrinfo(3) does not handle incorrect servna
f kern/24959   net        [patch] proper TCP_NOPUSH/TCP_CORK compatibility
o conf/23063   net        [arp] [patch] for static ARP tables in rc.network
o kern/21998   net        [socket] [patch] ident only for outgoing connections
o kern/5877    net        [socket] sb_cc counts control data as well as data dat

468 problems total.


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 06:03:49 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 4D295FF7
 for <freebsd-net@freebsd.org>; Tue, 29 Oct 2013 06:03:49 +0000 (UTC)
 (envelope-from pyunyh@gmail.com)
Received: from mail-oa0-x234.google.com (mail-oa0-x234.google.com
 [IPv6:2607:f8b0:4003:c02::234])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 127732B9B
 for <freebsd-net@freebsd.org>; Tue, 29 Oct 2013 06:03:49 +0000 (UTC)
Received: by mail-oa0-f52.google.com with SMTP id j1so1203171oag.39
 for <freebsd-net@freebsd.org>; Mon, 28 Oct 2013 23:03:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:date:to:cc:subject:message-id:reply-to:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=sNGIKSsL8Iil63MEJ62EtFhTZwPvvVBMYYir8pib/xo=;
 b=Ut5DhMJQf+K7SSErHk5kMQC2orjbTMAqIMZyyGFpoTOxrA5+V8acZrcFKnU3Hrm+i3
 Ja9csrFwAQN0ivHg1f3RCCOSvI2Zx1RvC2OHbgPDTzWQy2C4cEmMJg+fUqwohbQZjROw
 7leqkzsCkqrA3Z3i+rXQl5QV5tqM1Afmyul4vINQXjMV2ZKR4TvBU6JOUh1xoEiK0Hq3
 60f2nyp75RK1Ti78AgYZHa7w4UTFuXLeywbG1nz8iRvn8ZUABb+3WZPtDTYCMJXmW5BM
 QIqHc4kdLqQ2YhbrXsiDtJYZP1E6CZYhcbxmbgl7B1PgTvjldHwPmBrzeYHVZxeKIAYr
 rrMQ==
X-Received: by 10.182.66.164 with SMTP id g4mr4807666obt.47.1383026627524;
 Mon, 28 Oct 2013 23:03:47 -0700 (PDT)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPSA id xx9sm32857193obc.6.2013.10.28.23.03.44
 for <multiple recipients>
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Mon, 28 Oct 2013 23:03:46 -0700 (PDT)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Tue, 29 Oct 2013 15:03:40 +0900
From: Yonghyeon PYUN <pyunyh@gmail.com>
Date: Tue, 29 Oct 2013 15:03:40 +0900
To: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU,
 8411B and 8106EUS.
Message-ID: <20131029060340.GA1390@michelle.cdnetworks.com>
References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com>
 <20131028022723.GA4367@michelle.cdnetworks.com>
 <20131028164835.298646d5.eocallaghan@alterapraxis.com>
 <20131028061100.GC1350@michelle.cdnetworks.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20131028061100.GC1350@michelle.cdnetworks.com>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 06:03:49 -0000

On Mon, Oct 28, 2013 at 03:11:00PM +0900, Yonghyeon PYUN wrote:
> On Mon, Oct 28, 2013 at 04:48:35PM +1100, Edward O'Callaghan wrote:
> > On Mon, 28 Oct 2013 11:27:23 +0900
> > Yonghyeon PYUN <pyunyh@gmail.com> wrote:
> > 
> > > On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote:
> > > > Hi,
> > > > 
> > > > This is a follow up. I have tested most of these NIC's now and this
> > > > patch _should_ be fine to commit to HEAD. Could someone please help
> > > > me mediate this? This also fixes kern/183167. Please disregards the
> > > > patches in the PR.
> > > > 
> > > 
> > > I can handle this. Actually I had been working on supporting these
> > > newer controllers for a while. It seems just adding 8168GU id does
> > > not work. Did you test the patch on 8168GU controller?
> > > If yes, please let me know the OUI id and model number of the PHY.
> > 
> > Hi Yonghyeon,
> > 
> > Many thanks! Not the 8168GU, however I did find out that its the same
> > as a 8106EUS. I don't know if this may shed some light if you have the
> > hw to test it.. What exactly did not work about the 8168GU, what is it
> > doing?
> 
> Intermittent packet drops and slightly high number of RX
> interrupts.
> 
> > 
> > My main concern is to get a board here working that has a 8168G onboard.
> > 
> 
> Just adding RTL8168G id would use ukpky(4). Probably rgephy(4)
> should be taught to pick up the PHY but I don't have copy of data
> sheet. I'm testing patched rgephy(4) at this moment so give me some
> time.
> 

FYI: Committed in r257304-257306.
These commits do not address high number of RX interrupts though.

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 10:51:20 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BF1A174B
 for <net@freebsd.org>; Tue, 29 Oct 2013 10:51:20 +0000 (UTC)
 (envelope-from rrs@lakerest.net)
Received: from lakerest.net (lakerest.net [162.235.35.161])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id A20A32B92
 for <net@freebsd.org>; Tue, 29 Oct 2013 10:51:19 +0000 (UTC)
Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162])
 (authenticated bits=0)
 by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TAouW2068631
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT)
 for <net@freebsd.org>; Tue, 29 Oct 2013 06:50:56 -0400 (EDT)
 (envelope-from rrs@lakerest.net)
From: Randall Stewart <rrs@lakerest.net>
Content-Type: multipart/mixed;
 boundary="Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4"
Subject: MQ Patch.
Date: Tue, 29 Oct 2013 06:50:56 -0400
Message-Id: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
To: net@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1283)
X-Mailer: Apple Mail (2.1283)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 10:51:20 -0000


--Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

Hi:

As discussed at vBSDcon with andre/emaste and gnn, I am sending
this patch out to all of you ;-)

I have previously sent it to gnn, andre, jhb, rwatson, and several other
of the usual suspects (as gnn put it) and received dead silence.

What does this patch do?

Well it add the ability to do multi-queue at the driver level. Basically
any driver that uses the new interface gets under it N queues (default
is 8) for each physical transmit ring it has. The driver picks up=20
its queue 0 first, then queue 1 .. up to the max.

This allows you to prioritize packets. Also in here is the start of some
work I will be doing for AQM.. think either Pi or Codel ;-)

Right now thats pretty simple and just (in a few drivers) as the ability
to limit the amount of data on the ring=85 which can help reduce buffer
bloat. That needs to be refined into a lot more.

This work is donated by Adara Networks and has been discussed in several
of the past vendor summits.

I plan on committing this before the IETF unless I hear major =
objections.

Please have a look ;-)

Best wishes

R


--Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4
Content-Disposition: attachment;
	filename=patch_mq.txt
Content-Type: text/plain;
	x-unix-mode=0644;
	name="patch_mq.txt"
Content-Transfer-Encoding: quoted-printable

Index: sys/conf/files
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/conf/files	(revision 257322)
+++ sys/conf/files	(working copy)
@@ -3062,6 +3062,7 @@ net/bridgestp.c			optional bridge =
| if_bridge
 net/flowtable.c			optional flowtable inet | =
flowtable inet6
 net/ieee8023ad_lacp.c		optional lagg
 net/if.c			standard
+net/drbr.c			standard
 net/if_arcsubr.c		optional arcnet
 net/if_atmsubr.c		optional atm
 net/if_bridge.c			optional bridge inet | if_bridge =
inet
Index: sys/dev/bxe/bxe.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/bxe/bxe.c	(revision 257322)
+++ sys/dev/bxe/bxe.c	(working copy)
@@ -5935,10 +5935,11 @@ bxe_tx_mq_start_locked(struct bxe_softc    *sc,
                        struct bxe_fastpath *fp,
                        struct mbuf         *m)
 {
-    struct buf_ring *tx_br =3D fp->tx_br;
+    struct drbr_ring *tx_br =3D fp->tx_br;
     struct mbuf *next;
     int depth, rc, tx_count;
     uint16_t tx_bd_avail;
+    uint8_t qused;
=20
     rc =3D tx_count =3D 0;
=20
@@ -5955,25 +5956,16 @@ bxe_tx_mq_start_locked(struct bxe_softc    *sc,
=20
     BXE_FP_TX_LOCK_ASSERT(fp);
=20
-    if (m =3D=3D NULL) {
-        /* no new work, check for pending frames */
-        next =3D drbr_dequeue(ifp, tx_br);
-    } else if (drbr_needs_enqueue(ifp, tx_br)) {
-        /* have both new and pending work, maintain packet order */
-        rc =3D drbr_enqueue(ifp, tx_br, m);
-        if (rc !=3D 0) {
-            fp->eth_q_stats.tx_soft_errors++;
-            goto bxe_tx_mq_start_locked_exit;
-        }
-        next =3D drbr_dequeue(ifp, tx_br);
-    } else {
-        /* new work only and nothing pending */
-        next =3D m;
+    if (m !=3D NULL) {
+	    rc =3D drbr_enqueue(ifp, tx_br, m);
+	    if (rc !=3D 0) {
+		    fp->eth_q_stats.tx_soft_errors++;
+		    goto bxe_tx_mq_start_locked_exit;
+	    }
     }
=20
     /* keep adding entries while there are frames to send */
-    while (next !=3D NULL) {
-
+    while ((next =3D drbr_peek(ifp, fp->tx_br, &qused)) !=3D NULL) {
         /* the mbuf now belongs to us */
         fp->eth_q_stats.mbuf_alloc_tx++;
=20
@@ -5985,19 +5977,22 @@ bxe_tx_mq_start_locked(struct bxe_softc    *sc,
         rc =3D bxe_tx_encap(fp, &next);
         if (__predict_false(rc !=3D 0)) {
             fp->eth_q_stats.tx_encap_failures++;
-            if (next !=3D NULL) {
-                /* mark the TX queue as full and save the frame */
-                ifp->if_drv_flags |=3D IFF_DRV_OACTIVE;
-                /* XXX this may reorder the frame */
-                rc =3D drbr_enqueue(ifp, tx_br, next);
-                fp->eth_q_stats.mbuf_alloc_tx--;
-                fp->eth_q_stats.tx_frames_deferred++;
-            }
-
+	    if (next =3D=3D NULL) {
+		    drbr_advance(ifp, fp->tx_br, qused);
+	    } else {
+		    drbr_putback(ifp, fp->tx_br, next, qused);
+		    /*
+		     * Mark the TX queue as full and save
+		     * the frame.
+		     */
+		    ifp->if_drv_flags |=3D IFF_DRV_OACTIVE;
+		    fp->eth_q_stats.mbuf_alloc_tx--;
+		    fp->eth_q_stats.tx_frames_deferred++;
+	    }
             /* stop looking for more work */
             break;
         }
-
+	drbr_advance(ifp, fp->tx_br, qused);
         /* the transmit frame was enqueued successfully */
         tx_count++;
=20
@@ -6078,7 +6073,6 @@ bxe_mq_flush(struct ifnet *ifp)
 {
     struct bxe_softc *sc =3D ifp->if_softc;
     struct bxe_fastpath *fp;
-    struct mbuf *m;
     int i;
=20
     for (i =3D 0; i < sc->num_queues; i++) {
@@ -6093,9 +6087,7 @@ bxe_mq_flush(struct ifnet *ifp)
         if (fp->tx_br !=3D NULL) {
             BLOGD(sc, DBG_LOAD, "Clearing fp[%02d] buf_ring\n", =
fp->index);
             BXE_FP_TX_LOCK(fp);
-            while ((m =3D buf_ring_dequeue_sc(fp->tx_br)) !=3D NULL) {
-                m_freem(m);
-            }
+	    drbr_flush(ifp, fp->tx_br);
             BXE_FP_TX_UNLOCK(fp);
         }
     }
@@ -6496,12 +6488,9 @@ bxe_free_fp_buffers(struct bxe_softc *sc)
=20
 #if __FreeBSD_version >=3D 800000
         if (fp->tx_br !=3D NULL) {
-            struct mbuf *m;
             /* just in case bxe_mq_flush() wasn't called */
-            while ((m =3D buf_ring_dequeue_sc(fp->tx_br)) !=3D NULL) {
-                m_freem(m);
-            }
-            buf_ring_free(fp->tx_br, M_DEVBUF);
+	    drbr_flush(sc->ifnet, fp->tx_br);
+            drbr_free(fp->tx_br, M_DEVBUF);
             fp->tx_br =3D NULL;
         }
 #endif
@@ -6762,8 +6751,7 @@ bxe_alloc_fp_buffers(struct bxe_softc *sc)
         fp =3D &sc->fp[i];
=20
 #if __FreeBSD_version >=3D 800000
-        fp->tx_br =3D buf_ring_alloc(BXE_BR_SIZE, M_DEVBUF,
-                                   M_DONTWAIT, &fp->tx_mtx);
+        fp->tx_br =3D drbr_alloc(M_DEVBUF, M_DONTWAIT, &fp->tx_mtx);
         if (fp->tx_br =3D=3D NULL) {
             BLOGE(sc, "buf_ring alloc fail for fp[%02d]\n", i);
             goto bxe_alloc_fp_buffers_error;
Index: sys/dev/bxe/bxe.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/bxe/bxe.h	(revision 257322)
+++ sys/dev/bxe/bxe.h	(working copy)
@@ -69,6 +69,7 @@ __FBSDID("$FreeBSD$");
 #include <net/if_vlan_var.h>
 #include <net/zlib.h>
 #include <net/bpf.h>
+#include <net/drbr.h>
=20
 #include <netinet/in.h>
 #include <netinet/ip.h>
@@ -734,7 +735,7 @@ struct bxe_fastpath {
=20
 #if __FreeBSD_version >=3D 800000
 #define BXE_BR_SIZE 4096
-    struct buf_ring *tx_br;
+    struct drbr_ring *tx_br;
 #endif
 }; /* struct bxe_fastpath */
=20
Index: sys/dev/cesa/cesa.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cesa/cesa.c	(revision 257322)
+++ sys/dev/cesa/cesa.c	(working copy)
@@ -995,11 +995,17 @@ cesa_attach(device_t dev)
 	sc->sc_dev =3D dev;
=20
 	/* Check if CESA peripheral device has power turned on */
+#if defined(SOC_MV_KIRKWOOD)
+	if (soc_power_ctrl_get(CPU_PM_CTRL_CRYPTO) =3D=3D =
CPU_PM_CTRL_CRYPTO) {
+		device_printf(dev, "not powered on\n");
+		return (ENXIO);
+	}
+#else
 	if (soc_power_ctrl_get(CPU_PM_CTRL_CRYPTO) !=3D =
CPU_PM_CTRL_CRYPTO) {
 		device_printf(dev, "not powered on\n");
 		return (ENXIO);
 	}
-
+#endif
 	soc_id(&d, &r);
=20
 	switch (d) {
Index: sys/dev/cxgb/cxgb_adapter.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cxgb/cxgb_adapter.h	(revision 257322)
+++ sys/dev/cxgb/cxgb_adapter.h	(working copy)
@@ -252,7 +252,7 @@ struct sge_txq {
 	bus_dma_tag_t   entry_tag;
 	struct mbuf_head sendq;
=20
-	struct buf_ring *txq_mr;
+	struct drbr_ring *txq_mr;
 	struct ifaltq	*txq_ifq;
 	struct callout	txq_timer;
 	struct callout	txq_watchdog;
Index: sys/dev/cxgb/cxgb_main.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cxgb/cxgb_main.c	(revision 257322)
+++ sys/dev/cxgb/cxgb_main.c	(working copy)
@@ -66,6 +66,7 @@ __FBSDID("$FreeBSD$");
 #include <net/if_media.h>
 #include <net/if_types.h>
 #include <net/if_vlan_var.h>
+#include <net/drbr.h>
=20
 #include <netinet/in_systm.h>
 #include <netinet/in.h>
@@ -2361,7 +2362,7 @@ cxgb_tick_handler(void *arg, int count)
=20
 		drops =3D 0;
 		for (j =3D pi->first_qset; j < pi->first_qset + =
pi->nqsets; j++)
-			drops +=3D =
sc->sge.qs[j].txq[TXQ_ETH].txq_mr->br_drops;
+			drops +=3D =
drbr_get_dropcnt(sc->sge.qs[j].txq[TXQ_ETH].txq_mr);
 		ifp->if_snd.ifq_drops =3D drops;
=20
 		ifp->if_oerrors =3D
Index: sys/dev/cxgb/cxgb_sge.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cxgb/cxgb_sge.c	(revision 257322)
+++ sys/dev/cxgb/cxgb_sge.c	(working copy)
@@ -61,6 +61,7 @@ __FBSDID("$FreeBSD$");
 #include <net/bpf.h>=09
 #include <net/ethernet.h>
 #include <net/if_vlan_var.h>
+#include <net/drbr.h>
=20
 #include <netinet/in_systm.h>
 #include <netinet/in.h>
@@ -1684,7 +1685,7 @@ cxgb_transmit_locked(struct ifnet *ifp, struct sge
 {
 	struct port_info *pi =3D qs->port;
 	struct sge_txq *txq =3D &qs->txq[TXQ_ETH];
-	struct buf_ring *br =3D txq->txq_mr;
+	struct drbr_ring *br =3D txq->txq_mr;
 	int error, avail;
=20
 	avail =3D txq->size - txq->in_use;
@@ -1980,7 +1981,7 @@ t3_free_qset(adapter_t *sc, struct sge_qset *q)
 =09
 	reclaim_completed_tx(q, 0, TXQ_ETH);
 	if (q->txq[TXQ_ETH].txq_mr !=3D NULL)=20
-		buf_ring_free(q->txq[TXQ_ETH].txq_mr, M_DEVBUF);
+		drbr_free(q->txq[TXQ_ETH].txq_mr, M_DEVBUF);
 	if (q->txq[TXQ_ETH].txq_ifq !=3D NULL) {
 		ifq_delete(q->txq[TXQ_ETH].txq_ifq);
 		free(q->txq[TXQ_ETH].txq_ifq, M_DEVBUF);
@@ -2430,8 +2431,8 @@ t3_sge_alloc_qset(adapter_t *sc, u_int id, int npo
 	q->port =3D pi;
 	q->adap =3D sc;
=20
-	if ((q->txq[TXQ_ETH].txq_mr =3D =
buf_ring_alloc(cxgb_txq_buf_ring_size,
-	    M_DEVBUF, M_WAITOK, &q->lock)) =3D=3D NULL) {
+	if ((q->txq[TXQ_ETH].txq_mr =3D drbr_alloc(M_DEVBUF, M_WAITOK,=20=

+	    &q->lock)) =3D=3D NULL) {
 		device_printf(sc->dev, "failed to allocate mbuf =
ring\n");
 		goto err;
 	}
@@ -3523,9 +3524,9 @@ t3_add_configured_sysctls(adapter_t *sc)
 			    CTLTYPE_STRING | CTLFLAG_RD, &qs->rspq,
 			    0, t3_dump_rspq, "A", "dump of the response =
queue");
=20
-			SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, =
"dropped",
+/* RRS FIXME    	SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, =
"dropped",
 			    CTLFLAG_RD, =
&qs->txq[TXQ_ETH].txq_mr->br_drops,
-			    "#tunneled packets dropped");
+			    "#tunneled packets dropped");*/
 			SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, =
"sendqlen",
 			    CTLFLAG_RD, &qs->txq[TXQ_ETH].sendq.qlen,
 			    0, "#tunneled packets waiting to be sent");
Index: sys/dev/cxgbe/adapter.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cxgbe/adapter.h	(revision 257322)
+++ sys/dev/cxgbe/adapter.h	(working copy)
@@ -419,7 +419,7 @@ struct sge_txq {
=20
 	struct ifnet *ifp;	/* the interface this txq belongs to */
 	bus_dma_tag_t tx_tag;	/* tag for transmit buffers */
-	struct buf_ring *br;	/* tx buffer ring */
+	struct drbr_ring *br;	/* tx buffer ring */
 	struct tx_sdesc *sdesc;	/* KVA of software descriptor ring */
 	struct mbuf *m;		/* held up due to temporary resource =
shortage */
=20
Index: sys/dev/cxgbe/t4_main.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cxgbe/t4_main.c	(revision 257322)
+++ sys/dev/cxgbe/t4_main.c	(working copy)
@@ -54,6 +54,7 @@ __FBSDID("$FreeBSD$");
 #include <net/if.h>
 #include <net/if_types.h>
 #include <net/if_dl.h>
+#include <net/drbr.h>
 #include <net/if_vlan_var.h>
 #if defined(__i386__) || defined(__amd64__)
 #include <vm/vm.h>
@@ -1254,7 +1255,7 @@ cxgbe_transmit(struct ifnet *ifp, struct mbuf *m)
 	struct port_info *pi =3D ifp->if_softc;
 	struct adapter *sc =3D pi->adapter;
 	struct sge_txq *txq =3D &sc->sge.txq[pi->first_txq];
-	struct buf_ring *br;
+	struct drbr_ring *br;
 	int rc;
=20
 	M_ASSERTPKTHDR(m);
@@ -1295,7 +1296,7 @@ cxgbe_transmit(struct ifnet *ifp, struct mbuf *m)
 	 */
=20
 	TXQ_LOCK_ASSERT_OWNED(txq);
-	if (drbr_needs_enqueue(ifp, br) || txq->m) {
+	if (txq->m) {
=20
 		/* Queued for transmission. */
=20
@@ -1321,7 +1322,6 @@ cxgbe_qflush(struct ifnet *ifp)
 	struct port_info *pi =3D ifp->if_softc;
 	struct sge_txq *txq;
 	int i;
-	struct mbuf *m;
=20
 	/* queues do not exist if !PORT_INIT_DONE. */
 	if (pi->flags & PORT_INIT_DONE) {
@@ -1329,8 +1329,7 @@ cxgbe_qflush(struct ifnet *ifp)
 			TXQ_LOCK(txq);
 			m_freem(txq->m);
 			txq->m =3D NULL;
-			while ((m =3D buf_ring_dequeue_sc(txq->br)) !=3D =
NULL)
-				m_freem(m);
+			drbr_flush(ifp, txq->br);
 			TXQ_UNLOCK(txq);
 		}
 	}
@@ -4042,7 +4041,7 @@ cxgbe_tick(void *arg)
=20
 	drops =3D s->tx_drop;
 	for_each_txq(pi, i, txq)
-		drops +=3D txq->br->br_drops;
+		drops +=3D drbr_get_dropcnt(txq->br);
 	ifp->if_snd.ifq_drops =3D drops;
=20
 	ifp->if_oerrors =3D s->tx_error_frames;
@@ -6493,7 +6492,7 @@ sysctl_wcwr_stats(SYSCTL_HANDLER_ARGS)
 static inline void
 txq_start(struct ifnet *ifp, struct sge_txq *txq)
 {
-	struct buf_ring *br;
+	struct drbr_ring *br;
 	struct mbuf *m;
=20
 	TXQ_LOCK_ASSERT_OWNED(txq);
@@ -7509,7 +7508,6 @@ t4_ioctl(struct cdev *dev, unsigned long cmd, cadd
 				txq->txpkt_wrs =3D 0;
 				txq->txpkts_wrs =3D 0;
 				txq->txpkts_pkts =3D 0;
-				txq->br->br_drops =3D 0;
 				txq->no_dmamap =3D 0;
 				txq->no_desc =3D 0;
 			}
Index: sys/dev/cxgbe/t4_sge.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/cxgbe/t4_sge.c	(revision 257322)
+++ sys/dev/cxgbe/t4_sge.c	(working copy)
@@ -47,6 +47,7 @@ __FBSDID("$FreeBSD$");
 #include <net/ethernet.h>
 #include <net/if.h>
 #include <net/if_vlan_var.h>
+#include <net/drbr.h>
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/ip6.h>
@@ -1844,9 +1845,10 @@ t4_eth_tx(struct ifnet *ifp, struct sge_txq *txq,
 	struct port_info *pi =3D (void *)ifp->if_softc;
 	struct adapter *sc =3D pi->adapter;
 	struct sge_eq *eq =3D &txq->eq;
-	struct buf_ring *br =3D txq->br;
+	struct drbr_ring *br =3D txq->br;
 	struct mbuf *next;
 	int rc, coalescing, can_reclaim;
+	uint8_t qused;
 	struct txpkts txpkts;
 	struct sgl sgl;
=20
@@ -1873,8 +1875,7 @@ t4_eth_tx(struct ifnet *ifp, struct sge_txq *txq,
=20
 	if (__predict_false(eq->flags & EQ_DOOMED)) {
 		m_freem(m);
-		while ((m =3D buf_ring_dequeue_sc(txq->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(ifp, br);
 		return (ENETDOWN);
 	}
=20
@@ -1889,7 +1890,7 @@ t4_eth_tx(struct ifnet *ifp, struct sge_txq *txq,
 		next =3D m->m_nextpkt;
 		m->m_nextpkt =3D NULL;
=20
-		if (next || buf_ring_peek(br))
+		if (next || drbr_peek(ifp, br, &qused))
 			coalescing =3D 1;
=20
 		rc =3D get_pkt_sgl(txq, &m, &sgl, coalescing);
@@ -2936,7 +2937,7 @@ alloc_txq(struct port_info *pi, struct sge_txq *tx
=20
 	txq->sdesc =3D malloc(eq->cap * sizeof(struct tx_sdesc), =
M_CXGBE,
 	    M_ZERO | M_WAITOK);
-	txq->br =3D buf_ring_alloc(eq->qsize, M_CXGBE, M_WAITOK, =
&eq->eq_lock);
+	txq->br =3D drbr_alloc(M_CXGBE, M_WAITOK, &eq->eq_lock);
=20
 	rc =3D bus_dma_tag_create(sc->dmat, 1, 0, BUS_SPACE_MAXADDR,
 	    BUS_SPACE_MAXADDR, NULL, NULL, 64 * 1024, TX_SGL_SEGS,
@@ -2991,8 +2992,8 @@ alloc_txq(struct port_info *pi, struct sge_txq *tx
 	SYSCTL_ADD_UQUAD(&pi->ctx, children, OID_AUTO, "txpkts_pkts", =
CTLFLAG_RD,
 	    &txq->txpkts_pkts, "# of frames tx'd using txpkts work =
requests");
=20
-	SYSCTL_ADD_UQUAD(&pi->ctx, children, OID_AUTO, "br_drops", =
CTLFLAG_RD,
-	    &txq->br->br_drops, "# of drops in the buf_ring for this =
queue");
+/*	SYSCTL_ADD_UQUAD(&pi->ctx, children, OID_AUTO, "br_drops", =
CTLFLAG_RD,
+	&txq->br->br_drops, "# of drops in the buf_ring for this =
queue");*/
 	SYSCTL_ADD_UINT(&pi->ctx, children, OID_AUTO, "no_dmamap", =
CTLFLAG_RD,
 	    &txq->no_dmamap, 0, "# of times txq ran out of DMA maps");
 	SYSCTL_ADD_UINT(&pi->ctx, children, OID_AUTO, "no_desc", =
CTLFLAG_RD,
@@ -3021,7 +3022,7 @@ free_txq(struct port_info *pi, struct sge_txq *txq
 	if (txq->txmaps.maps)
 		t4_free_tx_maps(&txq->txmaps, txq->tx_tag);
=20
-	buf_ring_free(txq->br, M_CXGBE);
+	drbr_free(txq->br, M_CXGBE);
=20
 	if (txq->tx_tag)
 		bus_dma_tag_destroy(txq->tx_tag);
Index: sys/dev/e1000/if_em.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/e1000/if_em.c	(revision 257322)
+++ sys/dev/e1000/if_em.c	(working copy)
@@ -67,6 +67,7 @@
 #include <net/if_arp.h>
 #include <net/if_dl.h>
 #include <net/if_media.h>
+#include <net/drbr.h>
=20
 #include <net/if_types.h>
 #include <net/if_vlan_var.h>
@@ -273,6 +274,9 @@ static int 	em_is_valid_ether_addr(u8 *);
 static int	em_sysctl_int_delay(SYSCTL_HANDLER_ARGS);
 static void	em_add_int_delay_sysctl(struct adapter *, const char *,
 		    const char *, struct em_int_delay_info *, int, int);
+static void 	em_max_bytes(struct ifnet *, uint64_t max);
+static struct drbr_ring *em_get_ring(struct ifnet *ifp, int num);
+static int	em_ring_query(struct ifnet *ifp, struct mbuf *);
 /* Management and WOL Support */
 static void	em_init_manageability(struct adapter *);
 static void	em_release_manageability(struct adapter *);
@@ -897,7 +901,38 @@ em_resume(device_t dev)
 	return bus_generic_resume(dev);
 }
=20
+void
+em_max_bytes(struct ifnet *ifp, uint64_t max)
+{
+	struct adapter	*adapter =3D ifp->if_softc;
+	adapter->ring_bytes_max =3D max;
+}
=20
+struct drbr_ring *
+em_get_ring(struct ifnet *ifp, int num)
+{
+	struct adapter	*adapter =3D ifp->if_softc;
+	struct tx_ring	*txr;
+	if (num >=3D adapter->num_queues) {
+		return (NULL);
+	}
+	if (adapter->tx_rings) {
+		txr =3D &adapter->tx_rings[num];
+		return (txr->br);
+	} else {
+		return (NULL);
+	}
+}
+=20
+int
+em_ring_query(struct ifnet *ifp, struct mbuf *m)
+{
+	struct adapter *adapter =3D ifp->if_softc;
+	struct tx_ring	*txr;
+	txr =3D &adapter->tx_rings[0];
+	return(drbr_is_on_ring(txr->br, m));
+}
+
 #ifdef EM_MULTIQUEUE
 /*********************************************************************
  *  Multiqueue Transmit routines=20
@@ -913,6 +948,7 @@ em_mq_start_locked(struct ifnet *ifp, struct tx_ri
 	struct adapter  *adapter =3D txr->adapter;
         struct mbuf     *next;
         int             err =3D 0, enq =3D 0;
+	uint8_t qused;
=20
 	if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=3D=

 	    IFF_DRV_RUNNING || adapter->link_active =3D=3D 0) {
@@ -929,20 +965,26 @@ em_mq_start_locked(struct ifnet *ifp, struct tx_ri
 	}=20
=20
 	/* Process the queue */
-	while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) {
+	while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) {
 		if ((err =3D em_xmit(txr, &next)) !=3D 0) {
 			if (next =3D=3D NULL)
-				drbr_advance(ifp, txr->br);
+				drbr_advance(ifp, txr->br, qused);
 			else=20
-				drbr_putback(ifp, txr->br, next);
+				drbr_putback(ifp, txr->br, next, qused);
 			break;
 		}
-		drbr_advance(ifp, txr->br);
+		drbr_advance(ifp, txr->br, qused);
+ 		atomic_add_long(&txr->bytes_on_ring,=20
+ 			(uint64_t)next->m_pkthdr.len);
 		enq++;
 		ifp->if_obytes +=3D next->m_pkthdr.len;
 		if (next->m_flags & M_MCAST)
 			ifp->if_omcasts++;
 		ETHER_BPF_MTAP(ifp, next);
+		if (adapter->ring_bytes_max &&=20
+		    (txr->bytes_on_ring >=3D adapter->ring_bytes_max)) {
+			break;
+		}
 		if ((ifp->if_drv_flags & IFF_DRV_RUNNING) =3D=3D 0)
                         break;
 	}
@@ -991,8 +1033,7 @@ em_qflush(struct ifnet *ifp)
=20
 	for (int i =3D 0; i < adapter->num_queues; i++, txr++) {
 		EM_TX_LOCK(txr);
-		while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(ifp, txr->br);
 		EM_TX_UNLOCK(txr);
 	}
 	if_qflush(ifp);
@@ -2984,6 +3025,9 @@ em_setup_interface(device_t dev, struct adapter *a
 	ifp->if_softc =3D adapter;
 	ifp->if_flags =3D IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
 	ifp->if_ioctl =3D em_ioctl;
+	ifp->if_maxbytes =3D em_max_bytes;
+	ifp->if_getdrbr_ring =3D em_get_ring;
+	ifp->if_mbuf_on_ring =3D em_ring_query;
 #ifdef EM_MULTIQUEUE
 	/* Multiqueue stack interface */
 	ifp->if_transmit =3D em_mq_start;
@@ -3222,7 +3266,7 @@ em_allocate_queues(struct adapter *adapter)
         	}
 #if __FreeBSD_version >=3D 800000
 		/* Allocate a buf ring */
-		txr->br =3D buf_ring_alloc(4096, M_DEVBUF,
+		txr->br =3D drbr_alloc(M_DEVBUF,
 		    M_WAITOK, &txr->tx_mtx);
 #endif
 	}
@@ -3272,7 +3316,7 @@ err_tx_desc:
 	free(adapter->rx_rings, M_DEVBUF);
 rx_fail:
 #if __FreeBSD_version >=3D 800000
-	buf_ring_free(txr->br, M_DEVBUF);
+	drbr_free(txr->br, M_DEVBUF);
 #endif
 	free(adapter->tx_rings, M_DEVBUF);
 fail:
@@ -3396,6 +3440,7 @@ em_setup_transmit_ring(struct tx_ring *txr)
=20
 	/* Set number of descriptors available */
 	txr->tx_avail =3D adapter->num_tx_desc;
+	txr->bytes_on_ring =3D 0;
 	txr->queue_status =3D EM_QUEUE_IDLE;
=20
 	/* Clear checksum offload context. */
@@ -3579,7 +3624,7 @@ em_free_transmit_buffers(struct tx_ring *txr)
 	}
 #if __FreeBSD_version >=3D 800000
 	if (txr->br !=3D NULL)
-		buf_ring_free(txr->br, M_DEVBUF);
+		drbr_free(txr->br, M_DEVBUF);
 #endif
 	if (txr->tx_buffers !=3D NULL) {
 		free(txr->tx_buffers, M_DEVBUF);
@@ -3877,6 +3922,8 @@ em_txeof(struct tx_ring *txr)
 			++processed;
=20
 			if (tx_buffer->m_head) {
+				=
atomic_subtract_long(&txr->bytes_on_ring,
+						     =
(u_long)tx_buffer->m_head->m_pkthdr.len);
 				bus_dmamap_sync(txr->txtag,
 				    tx_buffer->map,
 				    BUS_DMASYNC_POSTWRITE);
@@ -5329,7 +5376,7 @@ em_add_hw_stats(struct adapter *adapter)
 		queue_node =3D SYSCTL_ADD_NODE(ctx, child, OID_AUTO, =
namebuf,
 					    CTLFLAG_RD, NULL, "Queue =
Name");
 		queue_list =3D SYSCTL_CHILDREN(queue_node);
-
+		drbr_add_sysctl_stats(dev, queue_list, txr->br);
 		SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "txd_head",=20=

 				CTLTYPE_UINT | CTLFLAG_RD, adapter,
 				E1000_TDH(txr->me),
Index: sys/dev/e1000/if_em.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/e1000/if_em.h	(revision 257322)
+++ sys/dev/e1000/if_em.h	(working copy)
@@ -298,8 +298,9 @@ struct tx_ring {
 	u8			last_hw_tucso;
 	u8			last_hw_tucss;
 #if __FreeBSD_version >=3D 800000
-	struct buf_ring         *br;
+	struct drbr_ring        *br;
 #endif
+	volatile u_long		bytes_on_ring;
 	/* Interrupt resources */
         bus_dma_tag_t           txtag;
 	void                    *tag;
@@ -346,6 +347,7 @@ struct rx_ring {
 /* Our adapter structure */
 struct adapter {
 	struct ifnet	*ifp;
+	uint64_t	ring_bytes_max;
 	struct e1000_hw	hw;
=20
 	/* FreeBSD operating-system-specific structures. */
Index: sys/dev/e1000/if_igb.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/e1000/if_igb.c	(revision 257322)
+++ sys/dev/e1000/if_igb.c	(working copy)
@@ -72,6 +72,7 @@
 #include <net/if_arp.h>
 #include <net/if_dl.h>
 #include <net/if_media.h>
+#include <net/drbr.h>
=20
 #include <net/if_types.h>
 #include <net/if_vlan_var.h>
@@ -216,6 +217,9 @@ static void	igb_reset(struct adapter *);
 static int	igb_setup_interface(device_t, struct adapter *);
 static int	igb_allocate_queues(struct adapter *);
 static void	igb_configure_queues(struct adapter *);
+static void 	igb_max_bytes(struct ifnet *, uint64_t max);
+static struct drbr_ring *igb_get_ring(struct ifnet *ifp, int num);
+static int	igb_ring_query(struct ifnet *ifp, struct mbuf *m);
=20
 static int	igb_allocate_transmit_buffers(struct tx_ring *);
 static void	igb_setup_transmit_structures(struct adapter *);
@@ -883,7 +887,43 @@ igb_resume(device_t dev)
 	return bus_generic_resume(dev);
 }
=20
+void
+igb_max_bytes(struct ifnet *ifp, uint64_t max)
+{
+	struct adapter	*adapter =3D ifp->if_softc;
+	adapter->ring_bytes_max =3D max;
=20
+}
+
+struct drbr_ring *
+igb_get_ring(struct ifnet *ifp, int num)
+{
+	struct adapter	*adapter =3D ifp->if_softc;
+	struct tx_ring *txr;
+
+	if (num >=3D adapter->num_queues) {
+		return (NULL);
+	}
+	if (adapter->tx_rings) {
+		txr =3D &adapter->tx_rings[num];
+		return (txr->br);
+	} else {
+		return (NULL);
+	}
+}
+
+int
+igb_ring_query(struct ifnet *ifp, struct mbuf *m)
+{
+	struct adapter *adapter =3D ifp->if_softc;
+	struct tx_ring	*txr;
+	/* For this hack, we only use 0, since adara stuff
+	 * sends out on queue 0 always.
+	 */
+	txr =3D &adapter->tx_rings[0];
+	return(drbr_is_on_ring(txr->br, m));
+}
+
 #ifdef IGB_LEGACY_TX
=20
 /*********************************************************************
@@ -1003,6 +1043,7 @@ igb_mq_start_locked(struct ifnet *ifp, struct tx_r
 	struct adapter  *adapter =3D txr->adapter;
         struct mbuf     *next;
         int             err =3D 0, enq =3D 0;
+	uint8_t		qused;
=20
 	IGB_TX_LOCK_ASSERT(txr);
=20
@@ -1012,11 +1053,11 @@ igb_mq_start_locked(struct ifnet *ifp, struct =
tx_r
=20
=20
 	/* Process the queue */
-	while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) {
+	while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) {
 		if ((err =3D igb_xmit(txr, &next)) !=3D 0) {
 			if (next =3D=3D NULL) {
 				/* It was freed, move forward */
-				drbr_advance(ifp, txr->br);
+				drbr_advance(ifp, txr->br, qused);
 			} else {
 				/*=20
 				 * Still have one left, it may not be
@@ -1023,11 +1064,13 @@ igb_mq_start_locked(struct ifnet *ifp, struct =
tx_r
 				 * the same since the transmit function
 				 * may have changed it.
 				 */
-				drbr_putback(ifp, txr->br, next);
+				drbr_putback(ifp, txr->br, next, qused);
 			}
 			break;
 		}
-		drbr_advance(ifp, txr->br);
+		drbr_advance(ifp, txr->br, qused);
+		atomic_add_long(&txr->bytes_on_ring,=20
+			 (u_long)next->m_pkthdr.len);
 		enq++;
 		ifp->if_obytes +=3D next->m_pkthdr.len;
 		if (next->m_flags & M_MCAST)
@@ -1035,6 +1078,11 @@ igb_mq_start_locked(struct ifnet *ifp, struct =
tx_r
 		ETHER_BPF_MTAP(ifp, next);
 		if ((ifp->if_drv_flags & IFF_DRV_RUNNING) =3D=3D 0)
 			break;
+		if (adapter->ring_bytes_max &&=20
+		    (txr->bytes_on_ring >=3D adapter->ring_bytes_max)) {
+			break;
+		}
+
 	}
 	if (enq > 0) {
 		/* Set the watchdog */
@@ -1072,12 +1120,10 @@ igb_qflush(struct ifnet *ifp)
 {
 	struct adapter	*adapter =3D ifp->if_softc;
 	struct tx_ring	*txr =3D adapter->tx_rings;
-	struct mbuf	*m;
=20
 	for (int i =3D 0; i < adapter->num_queues; i++, txr++) {
 		IGB_TX_LOCK(txr);
-		while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(ifp, txr->br);
 		IGB_TX_UNLOCK(txr);
 	}
 	if_qflush(ifp);
@@ -3117,6 +3163,9 @@ igb_setup_interface(device_t dev, struct adapter *
 #ifndef IGB_LEGACY_TX
 	ifp->if_transmit =3D igb_mq_start;
 	ifp->if_qflush =3D igb_qflush;
+	ifp->if_maxbytes =3D igb_max_bytes;
+	ifp->if_getdrbr_ring =3D igb_get_ring;
+	ifp->if_mbuf_on_ring =3D igb_ring_query;
 #else
 	ifp->if_start =3D igb_start;
 	IFQ_SET_MAXLEN(&ifp->if_snd, adapter->num_tx_desc - 1);
@@ -3361,7 +3410,7 @@ igb_allocate_queues(struct adapter *adapter)
         	}
 #ifndef IGB_LEGACY_TX
 		/* Allocate a buf ring */
-		txr->br =3D buf_ring_alloc(igb_buf_ring_size, M_DEVBUF,
+		txr->br =3D drbr_alloc(M_DEVBUF,
 		    M_WAITOK, &txr->tx_mtx);
 #endif
 	}
@@ -3421,7 +3470,7 @@ err_tx_desc:
 	free(adapter->rx_rings, M_DEVBUF);
 rx_fail:
 #ifndef IGB_LEGACY_TX
-	buf_ring_free(txr->br, M_DEVBUF);
+	drbr_free(txr->br, M_DEVBUF);
 #endif
 	free(adapter->tx_rings, M_DEVBUF);
 tx_fail:
@@ -3539,6 +3588,7 @@ igb_setup_transmit_ring(struct tx_ring *txr)
=20
 	/* Set number of descriptors available */
 	txr->tx_avail =3D adapter->num_tx_desc;
+	txr->bytes_on_ring =3D 0;
=20
 	bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map,
 	    BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
@@ -3680,7 +3730,7 @@ igb_free_transmit_buffers(struct tx_ring *txr)
 	}
 #ifndef IGB_LEGACY_TX
 	if (txr->br !=3D NULL)
-		buf_ring_free(txr->br, M_DEVBUF);
+		drbr_free(txr->br, M_DEVBUF);
 #endif
 	if (txr->tx_buffers !=3D NULL) {
 		free(txr->tx_buffers, M_DEVBUF);
@@ -4016,6 +4066,8 @@ igb_txeof(struct tx_ring *txr)
 			if (buf->m_head) {
 				txr->bytes +=3D
 				    buf->m_head->m_pkthdr.len;
+				=
atomic_subtract_long(&txr->bytes_on_ring,
+				    =
(uint64_t)buf->m_head->m_pkthdr.len);
 				bus_dmamap_sync(txr->txtag,
 				    buf->map,
 				    BUS_DMASYNC_POSTWRITE);
@@ -5636,7 +5688,7 @@ igb_add_hw_stats(struct adapter *adapter)
 		queue_node =3D SYSCTL_ADD_NODE(ctx, child, OID_AUTO, =
namebuf,
 					    CTLFLAG_RD, NULL, "Queue =
Name");
 		queue_list =3D SYSCTL_CHILDREN(queue_node);
-
+		drbr_add_sysctl_stats(dev, queue_list, txr->br);
 		SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, =
"interrupt_rate",=20
 				CTLFLAG_RD, &adapter->queues[i],
 				sizeof(&adapter->queues[i]),
Index: sys/dev/e1000/if_igb.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/e1000/if_igb.h	(revision 257322)
+++ sys/dev/e1000/if_igb.h	(working copy)
@@ -309,12 +309,14 @@ struct tx_ring {
 	    IGB_QUEUE_DEPLETED =3D 8,
 	}			queue_status;
 	u32			txd_cmd;
-	bus_dma_tag_t		txtag;
 	char			mtx_name[16];
 #ifndef IGB_LEGACY_TX
-	struct buf_ring		*br;
+	struct drbr_ring	*br;
 	struct task		txq_task;
 #endif
+	bus_dma_tag_t		txtag;
+	volatile u_long		bytes_on_ring;
+
 	u32			bytes;  /* used for AIM */
 	u32			packets;
 	/* Soft Stats */
@@ -371,17 +373,17 @@ struct adapter {
 	struct device		*dev;
 	struct cdev		*led_dev;
=20
-	struct resource		*pci_mem;
-	struct resource		*msix_mem;
-	int			memrid;
-
+	struct resource *pci_mem;
+	struct resource *msix_mem;
+	uint64_t	ring_bytes_max;
+	int		memrid;
 	/*
 	 * Interrupt resources: this set is
 	 * either used for legacy, or for Link
 	 * when doing MSIX
 	 */
-	void			*tag;
-	struct resource 	*res;
+	void		*tag;
+	struct resource	*res;
=20
 	struct ifmedia		media;
 	struct callout		timer;
Index: sys/dev/fdt/fdt_common.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/fdt/fdt_common.c	(revision 257322)
+++ sys/dev/fdt/fdt_common.c	(working copy)
@@ -183,7 +183,6 @@ fdt_is_compatible(phandle_t node, const char *comp
 		compat +=3D l;
 		len -=3D l;
 	}
-
 	return (rv);
 }
=20
@@ -585,15 +584,18 @@ fdt_get_phyaddr(phandle_t node, device_t dev, int
 	if (OF_getencprop(node, "phy-handle", (void *)&phy_handle,
 	    sizeof(phy_handle)) <=3D 0)
 		return (ENXIO);
-
 	phy_node =3D OF_xref_phandle(phy_handle);
+	device_printf(dev, "phy-handle:0x%x phy_ihandle:0x%x =
phy_node:0x%x\n",=20
+		      (uint32_t)phy_handle, (uint32_t)phy_ihandle,
+		      (uint32_t)phy_node);
=20
 	if (OF_getprop(phy_node, "reg", (void *)&phy_reg,
 	    sizeof(phy_reg)) <=3D 0)
 		return (ENXIO);
=20
+	device_printf(dev, "reg:0x%x\n", (uint32_t)phy_reg);
 	*phy_addr =3D fdt32_to_cpu(phy_reg);
-
+	device_printf(dev, "tran to reg:0x%x\n", (uint32_t)*phy_addr);
 	/*
 	 * Search for softc used to communicate with phy.
 	 */
Index: sys/dev/fdt/simplebus.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/fdt/simplebus.c	(revision 257322)
+++ sys/dev/fdt/simplebus.c	(working copy)
@@ -154,6 +154,8 @@ simplebus_probe(device_t dev)
 	return (BUS_PROBE_GENERIC);
 }
=20
+extern uint32_t simp_bus_debug;
+
 static int
 simplebus_attach(device_t dev)
 {
@@ -161,6 +163,7 @@ simplebus_attach(device_t dev)
 	struct simplebus_devinfo *di;
 	struct simplebus_softc *sc;
 	phandle_t dt_node, dt_child;
+	int ret;
=20
 	sc =3D device_get_softc(dev);
=20
@@ -215,13 +218,15 @@ simplebus_attach(device_t dev)
 			free(di, M_SIMPLEBUS);
 			continue;
 		}
-#ifdef DEBUG
+/*#ifdef DEBUG*/
 		device_printf(dev, "added child: %s\n\n", =
di->di_ofw.obd_name);
-#endif
+/*#endif*/
 		device_set_ivars(dev_child, di);
 	}
-
-	return (bus_generic_attach(dev));
+	simp_bus_debug =3D 1;
+	ret =3D bus_generic_attach(dev);
+	simp_bus_debug =3D 0;
+	return (ret);
 }
=20
 static int
Index: sys/dev/ixgbe/ixgbe.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/ixgbe/ixgbe.c	(revision 257322)
+++ sys/dev/ixgbe/ixgbe.c	(working copy)
@@ -845,7 +845,8 @@ ixgbe_mq_start_locked(struct ifnet *ifp, struct tx
 	struct adapter  *adapter =3D txr->adapter;
         struct mbuf     *next;
         int             enqueued =3D 0, err =3D 0;
-
+	uint8_t		qused;
+=09
 	if (((ifp->if_drv_flags & IFF_DRV_RUNNING) =3D=3D 0) ||
 	    adapter->link_active =3D=3D 0)
 		return (ENETDOWN);
@@ -858,18 +859,18 @@ ixgbe_mq_start_locked(struct ifnet *ifp, struct tx
 			if (next !=3D NULL)
 				err =3D drbr_enqueue(ifp, txr->br, =
next);
 #else
-	while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) {
+	while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) {
 		if ((err =3D ixgbe_xmit(txr, &next)) !=3D 0) {
 			if (next =3D=3D NULL) {
-				drbr_advance(ifp, txr->br);
+				drbr_advance(ifp, txr->br, qused);
 			} else {
-				drbr_putback(ifp, txr->br, next);
+				drbr_putback(ifp, txr->br, next, qused);
 			}
 #endif
 			break;
 		}
 #if __FreeBSD_version >=3D 901504
-		drbr_advance(ifp, txr->br);
+		drbr_advance(ifp, txr->br, qused);
 #endif
 		enqueued++;
 		/* Send a copy of the frame to the BPF listener */
@@ -917,12 +918,10 @@ ixgbe_qflush(struct ifnet *ifp)
 {
 	struct adapter	*adapter =3D ifp->if_softc;
 	struct tx_ring	*txr =3D adapter->tx_rings;
-	struct mbuf	*m;
=20
 	for (int i =3D 0; i < adapter->num_queues; i++, txr++) {
 		IXGBE_TX_LOCK(txr);
-		while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(ifp, txr->br);
 		IXGBE_TX_UNLOCK(txr);
 	}
 	if_qflush(ifp);
@@ -2891,7 +2890,7 @@ ixgbe_allocate_queues(struct adapter *adapter)
         	}
 #ifndef IXGBE_LEGACY_TX
 		/* Allocate a buf ring */
-		txr->br =3D buf_ring_alloc(IXGBE_BR_SIZE, M_DEVBUF,
+		txr->br =3D drbr_alloc(M_DEVBUF,
 		    M_WAITOK, &txr->tx_mtx);
 		if (txr->br =3D=3D NULL) {
 			device_printf(dev,
@@ -3253,7 +3252,7 @@ ixgbe_free_transmit_buffers(struct tx_ring *txr)
 	}
 #ifdef IXGBE_LEGACY_TX
 	if (txr->br !=3D NULL)
-		buf_ring_free(txr->br, M_DEVBUF);
+		drbr_free(txr->br, M_DEVBUF);
 #endif
 	if (txr->tx_buffers !=3D NULL) {
 		free(txr->tx_buffers, M_DEVBUF);
Index: sys/dev/ixgbe/ixgbe.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/ixgbe/ixgbe.h	(revision 257322)
+++ sys/dev/ixgbe/ixgbe.h	(working copy)
@@ -58,6 +58,7 @@
 #include <net/ethernet.h>
 #include <net/if_dl.h>
 #include <net/if_media.h>
+#include <net/drbr.h>
=20
 #include <net/bpf.h>
 #include <net/if_types.h>
@@ -313,7 +314,7 @@ struct tx_ring {
 	bus_dma_tag_t		txtag;
 	char			mtx_name[16];
 #ifndef IXGBE_LEGACY_TX
-	struct buf_ring		*br;
+	struct drbr_ring	*br;
 	struct task		txq_task;
 #endif
 #ifdef IXGBE_FDIR
Index: sys/dev/ixgbe/ixv.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/ixgbe/ixv.c	(revision 257322)
+++ sys/dev/ixgbe/ixv.c	(working copy)
@@ -603,6 +603,7 @@ ixv_mq_start_locked(struct ifnet *ifp, struct tx_r
 	struct adapter  *adapter =3D txr->adapter;
         struct mbuf     *next;
         int             enqueued, err =3D 0;
+	uint8_t 	qused;
=20
 	if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=3D=

 	    IFF_DRV_RUNNING || adapter->link_active =3D=3D 0) {
@@ -623,16 +624,16 @@ ixv_mq_start_locked(struct ifnet *ifp, struct tx_r
 		}
 	}
 	/* Process the queue */
-	while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) {
+	while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) {
 		if ((err =3D ixv_xmit(txr, &next)) !=3D 0) {
 			if (next =3D=3D NULL) {
-				drbr_advance(ifp, txr->br);
+				drbr_advance(ifp, txr->br, qused);
 			} else {
-				drbr_putback(ifp, txr->br, next);
+				drbr_putback(ifp, txr->br, next, qused);
 			}
 			break;
 		}
-		drbr_advance(ifp, txr->br);
+		drbr_advance(ifp, txr->br, qused);
 		enqueued++;
 		ifp->if_obytes +=3D next->m_pkthdr.len;
 		if (next->m_flags & M_MCAST)
@@ -664,12 +665,10 @@ ixv_qflush(struct ifnet *ifp)
 {
 	struct adapter  *adapter =3D ifp->if_softc;
 	struct tx_ring  *txr =3D adapter->tx_rings;
-	struct mbuf     *m;
=20
 	for (int i =3D 0; i < adapter->num_queues; i++, txr++) {
 		IXV_TX_LOCK(txr);
-		while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(ifp, txr->br);
 		IXV_TX_UNLOCK(txr);
 	}
 	if_qflush(ifp);
@@ -2053,8 +2052,7 @@ ixv_allocate_queues(struct adapter *adapter)
         	}
 #if __FreeBSD_version >=3D 800000
 		/* Allocate a buf ring */
-		txr->br =3D buf_ring_alloc(IXV_BR_SIZE, M_DEVBUF,
-		    M_WAITOK, &txr->tx_mtx);
+		txr->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, =
&txr->tx_mtx);
 		if (txr->br =3D=3D NULL) {
 			device_printf(dev,
 			    "Critical Failure setting up buf ring\n");
@@ -2355,7 +2353,7 @@ ixv_free_transmit_buffers(struct tx_ring *txr)
 	}
 #if __FreeBSD_version >=3D 800000
 	if (txr->br !=3D NULL)
-		buf_ring_free(txr->br, M_DEVBUF);
+		drbr_free(txr->br, M_DEVBUF);
 #endif
 	if (txr->tx_buffers !=3D NULL) {
 		free(txr->tx_buffers, M_DEVBUF);
Index: sys/dev/ixgbe/ixv.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/ixgbe/ixv.h	(revision 257322)
+++ sys/dev/ixgbe/ixv.h	(working copy)
@@ -61,6 +61,7 @@
 #include <net/bpf.h>
 #include <net/if_types.h>
 #include <net/if_vlan_var.h>
+#include <net/drbr.h>
=20
 #include <netinet/in_systm.h>
 #include <netinet/in.h>
@@ -267,7 +268,7 @@ struct tx_ring {
 	u32			txd_cmd;
 	bus_dma_tag_t		txtag;
 	char			mtx_name[16];
-	struct buf_ring		*br;
+	struct drbr_ring	*br;
 	/* Soft Stats */
 	u32			bytes;
 	u32			packets;
Index: sys/dev/mxge/if_mxge.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/mxge/if_mxge.c	(revision 257322)
+++ sys/dev/mxge/if_mxge.c	(working copy)
@@ -59,6 +59,7 @@ __FBSDID("$FreeBSD$");
 #include <net/if_types.h>
 #include <net/if_vlan_var.h>
 #include <net/zlib.h>
+#include <net/drbr.h>
=20
 #include <netinet/in_systm.h>
 #include <netinet/in.h>
@@ -2243,14 +2244,12 @@ mxge_qflush(struct ifnet *ifp)
 {
 	mxge_softc_t *sc =3D ifp->if_softc;
 	mxge_tx_ring_t *tx;
-	struct mbuf *m;
 	int slice;
=20
 	for (slice =3D 0; slice < sc->num_slices; slice++) {
 		tx =3D &sc->ss[slice].tx;
 		mtx_lock(&tx->mtx);
-		while ((m =3D buf_ring_dequeue_sc(tx->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(ifp, tx->br);
 		mtx_unlock(&tx->mtx);
 	}
 	if_qflush(ifp);
@@ -4060,7 +4059,7 @@ mxge_update_stats(mxge_softc_t *sc)
 #ifdef IFNET_BUF_RING
 		obytes +=3D ss->obytes;
 		omcasts +=3D ss->omcasts;
-		odrops +=3D ss->tx.br->br_drops;
+		odrops +=3D drbr_get_dropcnt(ss->tx.br);
 #endif
 		oerrors +=3D ss->oerrors;
 	}
@@ -4436,7 +4435,7 @@ mxge_alloc_slices(mxge_softc_t *sc)
 			 "%s:tx(%d)", device_get_nameunit(sc->dev), i);
 		mtx_init(&ss->tx.mtx, ss->tx.mtx_name, NULL, MTX_DEF);
 #ifdef IFNET_BUF_RING
-		ss->tx.br =3D buf_ring_alloc(2048, M_DEVBUF, M_WAITOK,
+		ss->tx.br =3D drbr_alloc(M_DEVBUF, M_WAITOK,
 					   &ss->tx.mtx);
 #endif
 	}
Index: sys/dev/mxge/if_mxge_var.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/mxge/if_mxge_var.h	(revision 257322)
+++ sys/dev/mxge/if_mxge_var.h	(working copy)
@@ -167,7 +167,7 @@ typedef struct
 {
 	struct mtx mtx;
 #ifdef IFNET_BUF_RING
-	struct buf_ring *br;
+	struct drbr_ring *br;
 #endif
 	volatile mcp_kreq_ether_send_t *lanai;	/* lanai ptr for sendq	=
*/
 	volatile uint32_t *send_go;		/* doorbell for sendq */
Index: sys/dev/oce/oce_hw.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/oce/oce_hw.c	(revision 257322)
+++ sys/dev/oce/oce_hw.c	(working copy)
@@ -360,8 +360,8 @@ oce_hw_shutdown(POCE_SOFTC sc)
 	/* release PCI resources */
 	oce_hw_pci_free(sc);
 	/* free mbox specific resources */
-	LOCK_DESTROY(&sc->bmbx_lock);
-	LOCK_DESTROY(&sc->dev_lock);
+	LOCK_DESTROY_OCE(&sc->bmbx_lock);
+	LOCK_DESTROY_OCE(&sc->dev_lock);
=20
 	oce_dma_free(sc, &sc->bsmbx);
 }
Index: sys/dev/oce/oce_if.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/oce/oce_if.c	(revision 257322)
+++ sys/dev/oce/oce_if.c	(working copy)
@@ -296,8 +296,8 @@ oce_attach(device_t dev)
 	sc->flow_control =3D OCE_DEFAULT_FLOW_CONTROL;
 	sc->promisc	 =3D OCE_DEFAULT_PROMISCUOUS;
=20
-	LOCK_CREATE(&sc->bmbx_lock, "Mailbox_lock");
-	LOCK_CREATE(&sc->dev_lock,  "Device_lock");
+	LOCK_CREATE_OCE(&sc->bmbx_lock, "Mailbox_lock");
+	LOCK_CREATE_OCE(&sc->dev_lock,  "Device_lock");
=20
 	/* initialise the hardware */
 	rc =3D oce_hw_init(sc);
@@ -372,8 +372,8 @@ mbox_free:
 	oce_dma_free(sc, &sc->bsmbx);
 pci_res_free:
 	oce_hw_pci_free(sc);
-	LOCK_DESTROY(&sc->dev_lock);
-	LOCK_DESTROY(&sc->bmbx_lock);
+	LOCK_DESTROY_OCE(&sc->dev_lock);
+	LOCK_DESTROY_OCE(&sc->bmbx_lock);
 	return rc;
=20
 }
@@ -384,9 +384,9 @@ oce_detach(device_t dev)
 {
 	POCE_SOFTC sc =3D device_get_softc(dev);
=20
-	LOCK(&sc->dev_lock);
+	LOCK_OCE(&sc->dev_lock);
 	oce_if_deactivate(sc);
-	UNLOCK(&sc->dev_lock);
+	UNLOCK_OCE(&sc->dev_lock);
=20
 	callout_drain(&sc->timer);
 =09
@@ -447,13 +447,13 @@ oce_ioctl(struct ifnet *ifp, u_long command, caddr
 			}
 			device_printf(sc->dev, "Interface Up\n");=09
 		} else {
-			LOCK(&sc->dev_lock);
+			LOCK_OCE(&sc->dev_lock);
=20
 			sc->ifp->if_drv_flags &=3D
 			    ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE);
 			oce_if_deactivate(sc);
=20
-			UNLOCK(&sc->dev_lock);
+			UNLOCK_OCE(&sc->dev_lock);
=20
 			device_printf(sc->dev, "Interface Down\n");
 		}
@@ -543,7 +543,7 @@ oce_init(void *arg)
 {
 	POCE_SOFTC sc =3D arg;
 =09
-	LOCK(&sc->dev_lock);
+	LOCK_OCE(&sc->dev_lock);
=20
 	if (sc->ifp->if_flags & IFF_UP) {
 		oce_if_deactivate(sc);
@@ -550,7 +550,7 @@ oce_init(void *arg)
 		oce_if_activate(sc);
 	}
 =09
-	UNLOCK(&sc->dev_lock);
+	UNLOCK_OCE(&sc->dev_lock);
=20
 }
=20
@@ -571,9 +571,9 @@ oce_multiq_start(struct ifnet *ifp, struct mbuf *m
=20
 	wq =3D sc->wq[queue_index];
=20
-	LOCK(&wq->tx_lock);
+	LOCK_OCE(&wq->tx_lock);
 	status =3D oce_multiq_transmit(ifp, m, wq);
-	UNLOCK(&wq->tx_lock);
+	UNLOCK_OCE(&wq->tx_lock);
=20
 	return status;
=20
@@ -584,12 +584,10 @@ static void
 oce_multiq_flush(struct ifnet *ifp)
 {
 	POCE_SOFTC sc =3D ifp->if_softc;
-	struct mbuf     *m;
 	int i =3D 0;
=20
 	for (i =3D 0; i < sc->nwqs; i++) {
-		while ((m =3D buf_ring_dequeue_sc(sc->wq[i]->br)) !=3D =
NULL)
-			m_freem(m);
+		drbr_flush(ifp, sc->wq[i]->br);
 	}
 	if_qflush(ifp);
 }
@@ -1136,13 +1134,13 @@ oce_tx_task(void *arg, int npending)
 	int rc =3D 0;
=20
 #if __FreeBSD_version >=3D 800000
-	LOCK(&wq->tx_lock);
+	LOCK_OCE(&wq->tx_lock);
 	rc =3D oce_multiq_transmit(ifp, NULL, wq);
 	if (rc) {
 		device_printf(sc->dev,
 				"TX[%d] restart failed\n", =
wq->queue_index);
 	}
-	UNLOCK(&wq->tx_lock);
+	UNLOCK_OCE(&wq->tx_lock);
 #else
 	oce_start(ifp);
 #endif
@@ -1170,9 +1168,9 @@ oce_start(struct ifnet *ifp)
 		if (m =3D=3D NULL)
 			break;
=20
-		LOCK(&sc->wq[def_q]->tx_lock);
+		LOCK_OCE(&sc->wq[def_q]->tx_lock);
 		rc =3D oce_tx(sc, &m, def_q);
-		UNLOCK(&sc->wq[def_q]->tx_lock);
+		UNLOCK_OCE(&sc->wq[def_q]->tx_lock);
 		if (rc) {
 			if (m !=3D NULL) {
 				sc->wq[def_q]->tx_stats.tx_stops ++;
@@ -1247,7 +1245,8 @@ oce_multiq_transmit(struct ifnet *ifp, struct mbuf
 	POCE_SOFTC sc =3D ifp->if_softc;
 	int status =3D 0, queue_index =3D 0;
 	struct mbuf *next =3D NULL;
-	struct buf_ring *br =3D NULL;
+	struct drbr_ring *br =3D NULL;
+	uint8_t qused;
=20
 	br  =3D wq->br;
 	queue_index =3D wq->queue_index;
@@ -1263,12 +1262,12 @@ oce_multiq_transmit(struct ifnet *ifp, struct =
mbuf
 		if ((status =3D drbr_enqueue(ifp, br, m)) !=3D 0)
 			return status;
 	}=20
-	while ((next =3D drbr_peek(ifp, br)) !=3D NULL) {
+	while ((next =3D drbr_peek(ifp, br, &qused)) !=3D NULL) {
 		if (oce_tx(sc, &next, queue_index)) {
 			if (next =3D=3D NULL) {
-				drbr_advance(ifp, br);
+				drbr_advance(ifp, br, qused);
 			} else {
-				drbr_putback(ifp, br, next);
+				drbr_putback(ifp, br, next, qused);
 				wq->tx_stats.tx_stops ++;
 				ifp->if_drv_flags |=3D IFF_DRV_OACTIVE;
 				status =3D drbr_enqueue(ifp, br, next);
@@ -1275,7 +1274,7 @@ oce_multiq_transmit(struct ifnet *ifp, struct mbuf
 			} =20
 			break;
 		}
-		drbr_advance(ifp, br);
+		drbr_advance(ifp, br, qused);
 		ifp->if_obytes +=3D next->m_pkthdr.len;
 		if (next->m_flags & M_MCAST)
 			ifp->if_omcasts++;
@@ -2078,13 +2077,13 @@ oce_if_deactivate(POCE_SOFTC sc)
 	   any other lock. So unlock device lock and require after
 	   completing taskqueue_drain.
 	*/
-	UNLOCK(&sc->dev_lock);
+	UNLOCK_OCE(&sc->dev_lock);
 	for (i =3D 0; i < sc->intr_count; i++) {
 		if (sc->intrs[i].tq !=3D NULL) {
 			taskqueue_drain(sc->intrs[i].tq, =
&sc->intrs[i].task);
 		}
 	}
-	LOCK(&sc->dev_lock);
+	LOCK_OCE(&sc->dev_lock);
=20
 	/* Delete RX queue in card with flush param */
 	oce_stop_rx(sc);
Index: sys/dev/oce/oce_if.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/oce/oce_if.h	(revision 257322)
+++ sys/dev/oce/oce_if.h	(working copy)
@@ -70,6 +70,7 @@
 #include <net/if_media.h>
 #include <net/if_vlan_var.h>
 #include <net/if_dl.h>
+#include <net/drbr.h>
=20
 #include <netinet/in.h>
 #include <netinet/in_systm.h>
@@ -528,18 +529,18 @@ struct oce_lock {
 };
 #define OCE_LOCK				struct oce_lock
=20
-#define LOCK_CREATE(lock, desc) 		{ \
+#define LOCK_CREATE_OCE(lock, desc) 		{ \
 	strncpy((lock)->name, (desc), MAX_LOCK_DESC_LEN); \
 	(lock)->name[MAX_LOCK_DESC_LEN] =3D '\0'; \
 	mtx_init(&(lock)->mutex, (lock)->name, NULL, MTX_DEF); \
 }
-#define LOCK_DESTROY(lock) 			\
+#define LOCK_DESTROY_OCE(lock) 			\
 		if (mtx_initialized(&(lock)->mutex))\
 			mtx_destroy(&(lock)->mutex)
-#define TRY_LOCK(lock)				=
mtx_trylock(&(lock)->mutex)
-#define LOCK(lock)				mtx_lock(&(lock)->mutex)
-#define LOCKED(lock)				=
mtx_owned(&(lock)->mutex)
-#define UNLOCK(lock)				=
mtx_unlock(&(lock)->mutex)
+#define TRY_LOCK_OCE(lock)			=
mtx_trylock(&(lock)->mutex)
+#define LOCK_OCE(lock)				mtx_lock(&(lock)->mutex)
+#define LOCKED_OCE(lock)			=
mtx_owned(&(lock)->mutex)
+#define UNLOCK_OCE(lock)			=
mtx_unlock(&(lock)->mutex)
=20
 #define	DEFAULT_MQ_MBOX_TIMEOUT			(5 * 1000 * =
1000)
 #define	MBX_READY_TIMEOUT			(1 * 1000 * =
1000)
@@ -702,7 +703,7 @@ struct oce_wq {
 	struct wq_config cfg;
 	int queue_index;
 	struct oce_tx_queue_stats tx_stats;
-	struct buf_ring *br;
+	struct drbr_ring *br;
 	struct task txtask;
 	uint32_t db_offset;
 };
Index: sys/dev/oce/oce_mbox.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/oce/oce_mbox.c	(revision 257322)
+++ sys/dev/oce/oce_mbox.c	(working copy)
@@ -345,7 +345,7 @@ oce_mbox_post(POCE_SOFTC sc, struct oce_mbx *mbx,
 	uint32_t cstatus =3D 0;
 	uint32_t xstatus =3D 0;
=20
-	LOCK(&sc->bmbx_lock);
+	LOCK_OCE(&sc->bmbx_lock);
=20
 	mb =3D OCE_DMAPTR(&sc->bsmbx, struct oce_bmbx);
 	mb_mbx =3D &mb->mbx;
@@ -387,7 +387,7 @@ oce_mbox_post(POCE_SOFTC sc, struct oce_mbx *mbx,
 		}
 	}
=20
-	UNLOCK(&sc->bmbx_lock);
+	UNLOCK_OCE(&sc->bmbx_lock);
=20
 	return rc;
 }
Index: sys/dev/oce/oce_queue.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/oce/oce_queue.c	(revision 257322)
+++ sys/dev/oce/oce_queue.c	(working copy)
@@ -253,12 +253,11 @@ oce_wq *oce_wq_init(POCE_SOFTC sc, uint32_t q_len,
 		goto free_wq;
=20
=20
-	LOCK_CREATE(&wq->tx_lock, "TX_lock");
+	LOCK_CREATE_OCE(&wq->tx_lock, "TX_lock");
 =09
 #if __FreeBSD_version >=3D 800000
 	/* Allocate buf ring for multiqueue*/
-	wq->br =3D buf_ring_alloc(4096, M_DEVBUF,
-			M_WAITOK, &wq->tx_lock.mutex);
+	wq->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &wq->tx_lock.mutex);
 	if (!wq->br)
 		goto free_wq;
 #endif
@@ -301,9 +300,9 @@ oce_wq_free(struct oce_wq *wq)
 	if (wq->tag !=3D NULL)
 		bus_dma_tag_destroy(wq->tag);
 	if (wq->br !=3D NULL)
-		buf_ring_free(wq->br, M_DEVBUF);
+		drbr_free(wq->br, M_DEVBUF);
=20
-	LOCK_DESTROY(&wq->tx_lock);
+	LOCK_DESTROY_OCE(&wq->tx_lock);
 	free(wq, M_DEVBUF);
 }
=20
@@ -451,7 +450,7 @@ oce_rq *oce_rq_init(POCE_SOFTC sc,
 	if (!rq->ring)
 		goto free_rq;
=20
-	LOCK_CREATE(&rq->rx_lock, "RX_lock");
+	LOCK_CREATE_OCE(&rq->rx_lock, "RX_lock");
=20
 	return rq;
=20
@@ -493,7 +492,7 @@ oce_rq_free(struct oce_rq *rq)
 	if (rq->tag !=3D NULL)
 		bus_dma_tag_destroy(rq->tag);
=20
-	LOCK_DESTROY(&rq->rx_lock);
+	LOCK_DESTROY_OCE(&rq->rx_lock);
 	free(rq, M_DEVBUF);
 }
=20
Index: sys/dev/virtio/network/if_vtnet.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/virtio/network/if_vtnet.c	(revision 257322)
+++ sys/dev/virtio/network/if_vtnet.c	(working copy)
@@ -57,6 +57,7 @@ __FBSDID("$FreeBSD$");
 #include <net/if_types.h>
 #include <net/if_media.h>
 #include <net/if_vlan_var.h>
+#include <net/drbr.h>
=20
 #include <net/bpf.h>
=20
@@ -685,7 +686,7 @@ vtnet_init_txq(struct vtnet_softc *sc, int id)
 	txq->vtntx_id =3D id;
=20
 #ifndef VTNET_LEGACY_TX
-	txq->vtntx_br =3D buf_ring_alloc(VTNET_DEFAULT_BUFRING_SIZE, =
M_DEVBUF,
+	txq->vtntx_br =3D drbr_alloc(M_DEVBUF,
 	    M_NOWAIT, &txq->vtntx_mtx);
 	if (txq->vtntx_br =3D=3D NULL)
 		return (ENOMEM);
@@ -749,7 +750,7 @@ vtnet_destroy_txq(struct vtnet_txq *txq)
=20
 #ifndef VTNET_LEGACY_TX
 	if (txq->vtntx_br !=3D NULL) {
-		buf_ring_free(txq->vtntx_br, M_DEVBUF);
+		drbr_free(txq->vtntx_br, M_DEVBUF);
 		txq->vtntx_br =3D NULL;
 	}
 #endif
@@ -2211,9 +2212,10 @@ vtnet_txq_mq_start_locked(struct vtnet_txq *txq, =
s
 {
 	struct vtnet_softc *sc;
 	struct virtqueue *vq;
-	struct buf_ring *br;
+	struct drbr_ring *br;
 	struct ifnet *ifp;
 	int enq, error;
+	uint8_t qnum;
=20
 	sc =3D txq->vtntx_sc;
 	vq =3D txq->vtntx_vq;
@@ -2239,16 +2241,16 @@ vtnet_txq_mq_start_locked(struct vtnet_txq *txq, =
s
=20
 	vtnet_txq_eof(txq);
=20
-	while ((m =3D drbr_peek(ifp, br)) !=3D NULL) {
+	while ((m =3D drbr_peek(ifp, br, &qnum)) !=3D NULL) {
 		error =3D vtnet_txq_encap(txq, &m);
 		if (error) {
 			if (m !=3D NULL)
-				drbr_putback(ifp, br, m);
+				drbr_putback(ifp, br, m, qnum);
 			else
-				drbr_advance(ifp, br);
+				drbr_advance(ifp, br, qnum);
 			break;
 		}
-		drbr_advance(ifp, br);
+		drbr_advance(ifp, br, qnum);
=20
 		enq++;
 		ETHER_BPF_MTAP(ifp, m);
@@ -2458,7 +2460,6 @@ vtnet_qflush(struct ifnet *ifp)
 {
 	struct vtnet_softc *sc;
 	struct vtnet_txq *txq;
-	struct mbuf *m;
 	int i;
=20
 	sc =3D ifp->if_softc;
@@ -2467,8 +2468,7 @@ vtnet_qflush(struct ifnet *ifp)
 		txq =3D &sc->vtnet_txqs[i];
=20
 		VTNET_TXQ_LOCK(txq);
-		while ((m =3D buf_ring_dequeue_sc(txq->vtntx_br)) !=3D =
NULL)
-			m_freem(m);
+		drbr_flush(ifp, txq->vtntx_br);
 		VTNET_TXQ_UNLOCK(txq);
 	}
=20
Index: sys/dev/virtio/network/if_vtnetvar.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/virtio/network/if_vtnetvar.h	(revision 257322)
+++ sys/dev/virtio/network/if_vtnetvar.h	(working copy)
@@ -100,7 +100,7 @@ struct vtnet_txq {
 	struct vtnet_softc	*vtntx_sc;
 	struct virtqueue	*vtntx_vq;
 #ifndef VTNET_LEGACY_TX
-	struct buf_ring		*vtntx_br;
+	struct drbr_ring	*vtntx_br;
 #endif
 	int			 vtntx_id;
 	int			 vtntx_watchdog;
Index: sys/dev/vxge/vxge.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/vxge/vxge.c	(revision 257322)
+++ sys/dev/vxge/vxge.c	(working copy)
@@ -31,6 +31,7 @@
 /*$FreeBSD$*/
=20
 #include <dev/vxge/vxge.h>
+#include <net/drbr.h>
=20
 static int vxge_pci_bd_no =3D -1;
 static u32 vxge_drv_copyright =3D 0;
@@ -729,7 +730,6 @@ void
 vxge_mq_qflush(ifnet_t ifp)
 {
 	int i;
-	mbuf_t m_head;
 	vxge_vpath_t *vpath;
=20
 	vxge_dev_t *vdev =3D (vxge_dev_t *) ifp->if_softc;
@@ -740,9 +740,7 @@ vxge_mq_qflush(ifnet_t ifp)
 			continue;
=20
 		VXGE_TX_LOCK(vpath);
-		while ((m_head =3D buf_ring_dequeue_sc(vpath->br)) !=3D =
NULL)
-			vxge_free_packet(m_head);
-
+		drbr_flush(ifp, vpath->br);
 		VXGE_TX_UNLOCK(vpath);
 	}
 	if_qflush(ifp);
@@ -2294,7 +2292,7 @@ vxge_vpath_open(vxge_dev_t *vdev)
 			break;
 		}
 #if __FreeBSD_version >=3D 800000
-		vpath->br =3D buf_ring_alloc(VXGE_DEFAULT_BR_SIZE, =
M_DEVBUF,
+		vpath->br =3D drbr_alloc(M_DEVBUF,
 		    M_WAITOK, &vpath->mtx_tx);
 		if (vpath->br =3D=3D NULL) {
 			err =3D ENOMEM;
@@ -2433,7 +2431,7 @@ vxge_vpath_close(vxge_dev_t *vdev)
=20
 #if __FreeBSD_version >=3D 800000
 		if (vpath->br !=3D NULL)
-			buf_ring_free(vpath->br, M_DEVBUF);
+			drbr_free(vpath->br, M_DEVBUF);
 #endif
 		/* Free LRO memory */
 		if (vpath->lro_enable)
Index: sys/dev/vxge/vxge.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/dev/vxge/vxge.h	(revision 257322)
+++ sys/dev/vxge/vxge.h	(working copy)
@@ -337,7 +337,7 @@ typedef struct _vxge_vpath_t {
 	struct		lro_ctrl lro;
=20
 #if __FreeBSD_version >=3D 800000
-	struct		buf_ring *br;
+	struct		drbr_ring *br;
 #endif
=20
 } vxge_vpath_t;
Index: sys/kern/kern_mbuf.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/kern/kern_mbuf.c	(revision 257322)
+++ sys/kern/kern_mbuf.c	(working copy)
@@ -653,7 +653,6 @@ m_pkthdr_init(struct mbuf *m, int how)
 	m->m_pkthdr.flowid =3D 0;
 	m->m_pkthdr.csum_flags =3D 0;
 	m->m_pkthdr.fibnum =3D 0;
-	m->m_pkthdr.cosqos =3D 0;
 	m->m_pkthdr.rsstype =3D 0;
 	m->m_pkthdr.l2hlen =3D 0;
 	m->m_pkthdr.l3hlen =3D 0;
@@ -661,6 +660,7 @@ m_pkthdr_init(struct mbuf *m, int how)
 	m->m_pkthdr.l5hlen =3D 0;
 	m->m_pkthdr.PH_per.sixtyfour[0] =3D 0;
 	m->m_pkthdr.PH_loc.sixtyfour[0] =3D 0;
+	m->m_pkthdr.cosqos =3D 0xff; /*drbr_maxq-1;*/
 #ifdef MAC
 	/* If the label init fails, fail the alloc */
 	error =3D mac_mbuf_init(m, how);
Index: sys/kern/subr_bufring.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/kern/subr_bufring.c	(revision 257322)
+++ sys/kern/subr_bufring.c	(working copy)
@@ -34,6 +34,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/malloc.h>
 #include <sys/ktr.h>
 #include <sys/buf_ring.h>
+#include <sys/mbuf.h>
=20
=20
 struct buf_ring *
@@ -63,3 +64,317 @@ buf_ring_free(struct buf_ring *br, struct malloc_t
 {
 	free(br, type);
 }
+
+/*
+ * multi-producer safe lock-free ring buffer enqueue
+ *
+ */
+extern uint32_t panic_on_dup_buf;
+
+int
+buf_ring_mbufon(struct buf_ring *br, void *buf)
+{
+	int i;
+	/* We don't count what the driver is peeking at */
+	for (i =3D br->br_cons_head; i !=3D br->br_prod_head;
+	     i =3D ((i + 1) & br->br_cons_mask)) {
+		if(br->br_ring[i] =3D=3D buf) {
+			return(1);
+		}
+	}
+	return(0);
+}
+
+__attribute__((noinline))
+int
+buf_ring_enqueue(struct buf_ring *br, void *buf)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t cons_tail;
+#ifdef DEBUG_BUFRING
+	int i;
+	critical_enter();
+	mb();
+	for (i =3D br->br_cons_head; i !=3D br->br_prod_head;
+	     i =3D ((i + 1) & br->br_cons_mask))
+		if(br->br_ring[i] =3D=3D buf) {
+			if (panic_on_dup_buf)
+				panic("help br:%p buf:%p", br, buf);
+			critical_exit();
+			return(0);
+		}
+#else
+	critical_enter();
+#endif=09
+	do {
+		prod_head =3D br->br_prod_head;
+		cons_tail =3D br->br_cons_tail;
+
+		prod_next =3D (prod_head + 1) & br->br_prod_mask;
+	=09
+		if (prod_next =3D=3D cons_tail) {
+			br->br_drops++;
+			critical_exit();
+			return (ENOBUFS);
+		}
+	} while (!atomic_cmpset_int(&br->br_prod_head, prod_head, =
prod_next));
+#ifdef DEBUG_BUFRING
+	if (br->br_ring[prod_head] !=3D NULL) {
+		printf("Dangling value in enqueue %d br:%p\n",=20
+		       prod_head, br);
+	}
+#endif=09
+	br->br_ring[prod_head] =3D buf;
+
+	/*
+	 * The full memory barrier also avoids that br_prod_tail store
+	 * is reordered before the br_ring[prod_head] is full setup.
+	 */
+	mb();
+
+	/*
+	 * If there are other enqueues in progress
+	 * that preceeded us, we need to wait for them
+	 * to complete=20
+	 */  =20
+	while (br->br_prod_tail !=3D prod_head)
+		cpu_spinwait();
+	br->br_prod_tail =3D prod_next;
+	critical_exit();
+	return (0);
+}
+
+/*
+ * multi-consumer safe dequeue=20
+ *
+ */
+void *
+buf_ring_dequeue_mc(struct buf_ring *br)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t prod_tail;
+	void *buf;
+	int success;
+
+	critical_enter();
+	do {
+		cons_head =3D br->br_cons_head;
+		prod_tail =3D br->br_prod_tail;
+
+		cons_next =3D (cons_head + 1) & br->br_cons_mask;
+	=09
+		if (cons_head =3D=3D prod_tail) {
+			critical_exit();
+			return (NULL);
+		}
+	=09
+		success =3D atomic_cmpset_int(&br->br_cons_head, =
cons_head,
+		    cons_next);
+	} while (success =3D=3D 0);	=09
+
+	buf =3D br->br_ring[cons_head];
+#ifdef DEBUG_BUFRING
+	br->br_ring[cons_head] =3D NULL;
+#endif
+	/*
+	 * The full memory barrier also avoids that br_ring[cons_read]
+	 * load is reordered after br_cons_tail is set.
+	 */
+	mb();
+=09
+	/*
+	 * If there are other dequeues in progress
+	 * that preceeded us, we need to wait for them
+	 * to complete=20
+	 */  =20
+	while (br->br_cons_tail !=3D cons_head)
+		cpu_spinwait();
+
+	br->br_cons_tail =3D cons_next;
+	critical_exit();
+
+	return (buf);
+}
+
+/*
+ * single-consumer dequeue=20
+ * use where dequeue is protected by a lock
+ * e.g. a network driver's tx queue lock
+ */
+void *
+buf_ring_dequeue_sc(struct buf_ring *br)
+{
+	uint32_t cons_head, cons_next, cons_next_next;
+	uint32_t prod_tail;
+	void *buf;
+=09
+	cons_head =3D br->br_cons_head;
+	prod_tail =3D br->br_prod_tail;
+=09
+	cons_next =3D (cons_head + 1) & br->br_cons_mask;
+	cons_next_next =3D (cons_head + 2) & br->br_cons_mask;
+=09
+	if (cons_head =3D=3D prod_tail)=20
+		return (NULL);
+
+#ifdef PREFETCH_DEFINED=09
+	if (cons_next !=3D prod_tail) {	=09
+		prefetch(br->br_ring[cons_next]);
+		if (cons_next_next !=3D prod_tail)=20
+			prefetch(br->br_ring[cons_next_next]);
+	}
+#endif
+	br->br_cons_head =3D cons_next;
+	buf =3D br->br_ring[cons_head];
+
+#ifdef DEBUG_BUFRING
+	br->br_ring[cons_head] =3D NULL;
+#endif
+	br->br_cons_tail =3D cons_next;
+	return (buf);
+}
+
+/*
+ * single-consumer advance after a peek
+ * use where it is protected by a lock
+ * e.g. a network driver's tx queue lock
+ */
+void
+buf_ring_advance_sc(struct buf_ring *br)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t prod_tail;
+=09
+	cons_head =3D br->br_cons_head;
+	prod_tail =3D br->br_prod_tail;
+=09
+	cons_next =3D (cons_head + 1) & br->br_cons_mask;
+	if (cons_head =3D=3D prod_tail)=20
+		return;
+	br->br_cons_head =3D cons_next;
+	br->br_cons_tail =3D cons_next;
+}
+
+void
+buf_ring_advance_mc(struct buf_ring *br)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t prod_tail;
+	int success;
+
+	critical_enter();
+	do {
+		cons_head =3D br->br_cons_head;
+		prod_tail =3D br->br_prod_tail;
+
+		cons_next =3D (cons_head + 1) & br->br_cons_mask;
+	=09
+		if (cons_head =3D=3D prod_tail) {
+			critical_exit();
+			return;
+		}
+	=09
+		success =3D atomic_cmpset_int(&br->br_cons_head, =
cons_head,
+		    cons_next);
+	} while (success =3D=3D 0);	=09
+	/*
+	 * The full memory barrier also avoids that br_ring[cons_read]
+	 * load is reordered after br_cons_tail is set.
+	 */
+	mb();
+=09
+	/*
+	 * If there are other dequeues in progress
+	 * that preceeded us, we need to wait for them
+	 * to complete=20
+	 */  =20
+	while (br->br_cons_tail !=3D cons_head)
+		cpu_spinwait();
+
+	br->br_cons_tail =3D cons_next;
+	critical_exit();
+}
+
+
+/*
+ * Used to return a buffer (most likely already there)
+ * to the top od the ring. The caller should *not*
+ * have used any dequeue to pull it out of the ring
+ * but instead should have used the peek() function.
+ * This is normally used where the transmit queue
+ * of a driver is full, and an mubf must be returned.
+ * Most likely whats in the ring-buffer is what
+ * is being put back (since it was not removed), but
+ * sometimes the lower transmit function may have
+ * done a pullup or other function that will have
+ * changed it. As an optimzation we always put it
+ * back (since jhb says the store is probably cheaper),
+ * if we have to do a multi-queue version we will need
+ * the compare and an atomic.
+ */
+void
+buf_ring_putback_mc(struct buf_ring *br, void *new)
+{
+	KASSERT(br->br_cons_head !=3D br->br_prod_tail,=20
+		("Buf-Ring has none in putback")) ;
+	critical_enter();
+	br->br_ring[br->br_cons_head] =3D new;
+	mb();
+	critical_exit();
+}
+
+void
+buf_ring_putback_sc(struct buf_ring *br, void *new)
+{
+	KASSERT(br->br_cons_head !=3D br->br_prod_tail,=20
+		("Buf-Ring has none in putback")) ;
+	br->br_ring[br->br_cons_head] =3D new;
+}
+
+/*
+ * return a pointer to the first entry in the ring
+ * without modifying it, or NULL if the ring is empty
+ * race-prone if not protected by a lock
+ */
+void *
+buf_ring_peek(struct buf_ring *br)
+{
+	struct mbuf *m;
+#ifdef DEBUG_BUFRING
+	if ((br->br_lock !=3D NULL) && !mtx_owned(br->br_lock)) {
+		printf("br:%p lock not held on single consumer =
dequeue\n",
+		       br);
+	}
+
+#endif=09
+	if (br->br_cons_head =3D=3D br->br_prod_tail)
+		return (NULL);
+	m =3D br->br_ring[br->br_cons_head];
+#ifdef DEBUG_BUFRING
+	br->br_ring[br->br_cons_head] =3D NULL;
+	mb();
+#endif
+	return (m);
+}
+
+int
+buf_ring_full(struct buf_ring *br)
+{
+
+	return (((br->br_prod_head + 1) & br->br_prod_mask) =3D=3D =
br->br_cons_tail);
+}
+
+int
+buf_ring_empty(struct buf_ring *br)
+{
+
+	return (br->br_cons_head =3D=3D br->br_prod_tail);
+}
+
+int
+buf_ring_count(struct buf_ring *br)
+{
+
+	return ((br->br_prod_size + br->br_prod_tail - br->br_cons_tail)
+	    & br->br_prod_mask);
+}
Index: sys/kern/subr_bus.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/kern/subr_bus.c	(revision 257322)
+++ sys/kern/subr_bus.c	(working copy)
@@ -2722,7 +2722,7 @@ device_probe(device_t dev)
 	}
 	return (0);
 }
-
+uint32_t simp_bus_debug=3D0;
 /**
  * @brief Probe a device and attach a driver if possible
  *
@@ -2742,6 +2742,11 @@ device_probe_and_attach(device_t dev)
 		return (error);
=20
 	CURVNET_SET_QUIET(vnet0);
+	if (simp_bus_debug) {
+		printf("%s:Attach for device 0x%x\n",=20
+		       __FUNCTION__,
+		       (uint32_t)dev);
+	}
 	error =3D device_attach(dev);
 	CURVNET_RESTORE();
 	return error;
@@ -2778,12 +2783,20 @@ device_attach(device_t dev)
 			 device_printf(dev, "disabled via hints =
entry\n");
 		return (ENXIO);
 	}
-
+	if (simp_bus_debug) {
+		device_printf(dev, "init its sysctl info\n");
+	}
 	device_sysctl_init(dev);
 	if (!device_is_quiet(dev))
 		device_print_child(dev->parent, dev);
 	attachtime =3D get_cyclecount();
 	dev->state =3D DS_ATTACHING;
+	if (simp_bus_debug) {
+		device_printf(dev, "Calling attach\n");
+	}
+	if (simp_bus_debug) {
+		device_printf(dev, "call the attach\n");
+	}
 	if ((error =3D DEVICE_ATTACH(dev)) !=3D 0) {
 		printf("device_attach: %s%d attach returned %d\n",
 		    dev->driver->name, dev->unit, error);
@@ -2812,6 +2825,9 @@ device_attach(device_t dev)
 	else
 		dev->state =3D DS_ATTACHED;
 	dev->flags &=3D ~DF_DONENOMATCH;
+	if (simp_bus_debug) {
+		device_printf(dev, "finish out...\n");
+	}
 	devadded(dev);
 	return (0);
 }
Index: sys/net/drbr.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/net/drbr.c	(revision 0)
+++ sys/net/drbr.c	(working copy)
@@ -0,0 +1,507 @@
+#include <net/drbr.h>
+
+SYSCTL_DECL(_net_link);
+uint32_t drbr_maxq=3DDRBR_MAXQ_DEFAULT;
+
+TUNABLE_INT("net.link.drbr_maxq", &drbr_maxq);
+SYSCTL_NODE(_net, OID_AUTO, drbr, CTLFLAG_RD, 0, "DRBR Parameters");
+SYSCTL_INT(_net_drbr, OID_AUTO, drbr_maxq, CTLFLAG_RDTUN,
+    &drbr_maxq, 0, "max number of priority queues per interface");
+
+uint8_t set_up_drbr_depth=3D0;
+uint32_t drbr_max_priority=3DDRBR_MAXQ_DEFAULT-1;
+uint32_t drbr_queue_depth=3DDRBR_MIN_DEPTH;
+uint32_t panic_on_dup_buf =3D 0;
+uint32_t use_drbr_lock =3D 0;
+
+SYSCTL_INT(_net_drbr, OID_AUTO, drbr_queue_depth, CTLFLAG_RD,
+    &drbr_queue_depth, 0, "Queue length configed via ifqmaxlen");
+
+SYSCTL_INT(_net_drbr, OID_AUTO, drbr_max_priority, CTLFLAG_RD,
+    &drbr_max_priority, 0, "Queue length configed via ifqmaxlen");
+
+SYSCTL_INT(_net_drbr, OID_AUTO, drbr_panicdup, CTLFLAG_RW,
+    &panic_on_dup_buf, 0, "Panic on dup buf into br ring");
+
+SYSCTL_INT(_net_drbr, OID_AUTO, drbr_usemtx, CTLFLAG_RW,
+    &use_drbr_lock, 0, "Use drbr mtx");
+
+struct drbr_ring *
+drbr_alloc(struct malloc_type *type, int flags, struct mtx *tmtx)
+{
+	struct drbr_ring *rng;
+	int i;
+	if (set_up_drbr_depth =3D=3D 0) {
+		drbr_max_priority =3D drbr_maxq-1;
+		set_up_drbr_depth =3D 1;
+		drbr_queue_depth =3D 1 << ((fls(ifqmaxlen)-1));
+		if (drbr_queue_depth < DRBR_MIN_DEPTH) {
+			drbr_queue_depth =3D DRBR_MIN_DEPTH;
+		}
+	}
+	rng =3D (struct drbr_ring *)malloc(sizeof(struct drbr_ring), =
type, flags);
+	if (rng =3D=3D NULL) {
+		return(NULL);
+	}
+	memset(rng, 0, sizeof(struct drbr_ring));
+	DRBR_LOCK_INIT(rng);
+	rng->re =3D (struct drbr_ring_entry *)malloc((sizeof(struct =
drbr_ring_entry)*drbr_maxq),=20
+			 type, flags);
+	if (rng->re =3D=3D NULL) {
+		free(rng, type);
+		return(NULL);
+	}
+	memset(rng->re, 0, (sizeof(struct drbr_ring_entry) * =
drbr_maxq));
+	/* Ok get the queues */
+	for (i=3D0; i<drbr_maxq; i++) {
+		rng->re[i].re_qs =3D buf_ring_alloc(drbr_queue_depth, =
type, flags, tmtx);
+		if (rng->re[i].re_qs =3D=3D NULL) {
+			goto out_err;
+		}
+	}
+	rng->lowq_with_data =3D 0xffffffff;
+	return(rng);
+out_err:
+	for(i=3D0; i<drbr_maxq; i++) {
+		if (rng->re[i].re_qs) {
+			free(rng->re[i].re_qs, type);
+		}
+	}
+	free(rng->re, type);
+	free(rng, type);
+	return (NULL);
+}
+
+#define PRIO_NAME_LEN 32
+void=20
+drbr_add_sysctl_stats(device_t dev, struct sysctl_oid_list *queue_list,=20=

+		      struct drbr_ring *rng)
+{
+	int i;
+	struct sysctl_ctx_list *ctx =3D device_get_sysctl_ctx(dev);
+	struct sysctl_oid *prio_node;
+	struct sysctl_oid_list *prio_list;
+	char namebuf[PRIO_NAME_LEN];
+
+	if (rng =3D=3D NULL)
+		/* TSNH */
+		return;
+	for (i=3D0; i<drbr_maxq; i++) {
+		snprintf(namebuf, PRIO_NAME_LEN, "prio%d", i);
+	=09
+		prio_node =3D SYSCTL_ADD_NODE(ctx, queue_list, OID_AUTO, =
namebuf,
+					    CTLFLAG_RD, NULL, "Prioity =
Info");
+		prio_list =3D SYSCTL_CHILDREN(prio_node);
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, =
"packets_sent",
+				CTLFLAG_RD, &rng->re[i].re_cnt_sent,
+				"Packets Enqueued");
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, "bytes_sent",
+				CTLFLAG_RD, &rng->re[i].re_bytecnt_sent,
+				"Bytes Enqueued");
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, =
"dropped_packets",
+				CTLFLAG_RD, &rng->re[i].re_drop_cnt,
+				"Packets Dropped");
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, =
"dropped_bytes",
+				CTLFLAG_RD, &rng->re[i].re_bytedrop_cnt,
+				"Bytes Dropped");
+		SYSCTL_ADD_UINT(ctx, prio_list, OID_AUTO, =
"on_queue_now",
+				CTLFLAG_RD, &rng->re[i].re_cnt, 0,
+				"Current Queue Size");
+
+	}
+
+}
+
+u_long
+drbr_get_dropcnt(struct drbr_ring *rng)
+{
+	u_long total;
+	int i;
+
+	total =3D 0;
+	for (i=3D0; i<drbr_maxq; i++) {
+		total +=3D rng->re[i].re_drop_cnt;
+	}
+	return (total);
+}
+
+void=20
+drbr_add_sysctl_stats_nodev(struct sysctl_oid_list *queue_list,=20
+			    struct sysctl_ctx_list *ctx,
+			    struct drbr_ring *rng)
+{
+	int i;
+	struct sysctl_oid *prio_node;
+	struct sysctl_oid_list *prio_list;
+	char namebuf[PRIO_NAME_LEN];
+
+	if (rng =3D=3D NULL)
+		return;
+	for (i=3D0; i<drbr_maxq; i++) {
+		snprintf(namebuf, PRIO_NAME_LEN, "prio%d", i);
+		prio_node =3D SYSCTL_ADD_NODE(ctx, queue_list, OID_AUTO, =
namebuf,
+					CTLFLAG_RD, NULL, "Prioity =
Info");
+		prio_list =3D SYSCTL_CHILDREN(prio_node);
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, =
"packets_sent",
+				CTLFLAG_RD, &rng->re[i].re_cnt_sent,
+				"Packets Enqueued");
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, "bytes_sent",
+				CTLFLAG_RD, &rng->re[i].re_bytecnt_sent,
+				"Bytes Enqueued");
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, =
"dropped_packets",
+				CTLFLAG_RD, &rng->re[i].re_drop_cnt,
+				"Packets Dropped");
+		SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, =
"dropped_bytes",
+				CTLFLAG_RD, &rng->re[i].re_bytedrop_cnt,
+				"Bytes Dropped");
+		SYSCTL_ADD_UINT(ctx, prio_list, OID_AUTO, =
"on_queue_now",
+				CTLFLAG_RD, &rng->re[i].re_cnt, 0,
+				"Current Queue Size");
+	}
+}
+
+int
+drbr_enqueue(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf *m)
+{=09
+	int error =3D 0;
+	uint8_t qused;
+	uint64_t bytecnt;
+	int locked =3D 0;
+
+#ifdef ALTQ
+	if ((ifp !=3D NULL) &&=20
+	    (ALTQ_IS_ENABLED(&ifp->if_snd))) {
+		IFQ_ENQUEUE(&ifp->if_snd, m, error);
+		return (error);
+	}
+#endif
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	if (m->m_pkthdr.cosqos >=3D drbr_maxq) {
+		/* Lowest priority queue */
+		qused =3D drbr_maxq - 1;
+	} else {
+		qused =3D m->m_pkthdr.cosqos;
+	}
+	bytecnt =3D m->m_pkthdr.len;
+	error =3D buf_ring_enqueue(rng->re[qused].re_qs, m);
+        if (error) {
+		m_freem(m);
+		atomic_add_long(&rng->re[qused].re_drop_cnt, 1);
+		atomic_add_long(&rng->re[qused].re_bytedrop_cnt, =
bytecnt);
+	} else {
+		if (qused < rng->lowq_with_data) {
+			atomic_clear_int(&rng->lowq_with_data, =
0xffffffff);
+			atomic_set_int(&rng->lowq_with_data, qused);
+		}
+		atomic_add_int(&rng->count_on_queues, 1);
+		atomic_add_int(&rng->re[qused].re_cnt, 1);
+		atomic_add_long(&rng->re[qused].re_cnt_sent, 1);
+		atomic_add_long(&rng->re[qused].re_bytecnt_sent, =
bytecnt);
+	}
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+	return (error);
+}
+
+int
+drbr_is_on_ring(struct drbr_ring *rng, struct mbuf *m)
+{
+	int locked =3D 0;
+	int answer =3D 0; /* No its not by default */
+	int i;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	for(i=3D0; i<drbr_maxq;i++) {
+		if (buf_ring_empty(rng->re[i].re_qs))
+			continue;
+		if (buf_ring_mbufon(rng->re[i].re_qs, m)) {
+			answer =3D 1;
+			break;
+		}
+	}=09
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+	return(answer);
+}
+
+void
+drbr_putback(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf =
*new, uint8_t qused)
+{
+	/*
+	 * The top of the list needs to be swapped=20
+	 * for this one.
+	 */
+	int locked =3D 0;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	buf_ring_putback_mc(rng->re[qused].re_qs, new);
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+}
+
+struct mbuf *
+drbr_peek(struct ifnet *ifp, struct drbr_ring *rng, uint8_t *qused)
+{
+	int i;
+	int locked =3D 0;
+	struct mbuf *m;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	if (rng->count_on_queues =3D=3D 0) {
+		/* All done now */
+		if (locked) {
+			DRBR_UNLOCK(rng);
+		}
+		return (NULL);
+	}
+	if (rng->lowq_with_data =3D=3D 0xffffffff) {
+		rng->lowq_with_data =3D 0;
+	}
+	for(i=3Drng->lowq_with_data; i<drbr_maxq;i++) {
+		if (buf_ring_empty(rng->re[i].re_qs))
+			continue;
+		rng->lowq_with_data =3D i;
+		break;
+	}
+	if (i >=3D drbr_maxq) {
+		/* Huh? */
+		rng->lowq_with_data =3D 0;
+        	for (i=3Drng->lowq_with_data; i<drbr_maxq;i++) {
+	        	if(buf_ring_empty(rng->re[i].re_qs))
+		        	continue;
+			rng->lowq_with_data =3D i;
+         		break;
+        	}
+		if (i >=3D drbr_maxq) {
+			/* Really huh? */
+			rng->count_on_queues =3D 0;
+			if (locked) {
+				DRBR_UNLOCK(rng);
+			}
+			return (NULL);
+                }
+        }
+	*qused =3D i;
+	m =3D buf_ring_peek(rng->re[i].re_qs);
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+	return(m);
+}
+
+static void
+drbr_flush_locked(struct ifnet *ifp, struct drbr_ring *rng)
+{
+	int i;
+	struct mbuf *m;
+	int locked =3D 0;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	if (rng =3D=3D NULL) {
+		return;
+	}
+	for(i=3D0; i<drbr_maxq; i++) {
+		while ((m =3D buf_ring_dequeue_mc(rng->re[i].re_qs)) !=3D =
NULL) {
+			atomic_subtract_long(&rng->re[i].re_cnt_sent, =
1);
+			if (ifp) {
+				ifp->if_oerrors++;
+			}
+			m_freem(m);
+		}
+		rng->re[i].re_cnt =3D 0;
+	}
+	rng->lowq_with_data =3D 0xffffffff;
+	rng->count_on_queues =3D 0;
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+}
+
+void
+drbr_flush(struct ifnet *ifp, struct drbr_ring *rng)
+{
+	drbr_flush_locked(ifp, rng);
+}
+
+void
+drbr_free(struct drbr_ring *rng, struct malloc_type *type)
+{
+	int i;
+	int locked =3D 0;
+	if (rng =3D=3D NULL) {
+		return;
+	}
+	drbr_flush_locked(NULL, rng);
+
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	for(i=3D0; i<drbr_maxq; i++) {
+		if (rng->re[i].re_qs) {
+			buf_ring_free(rng->re[i].re_qs, type);
+		}
+	}
+	DRBR_LOCK_DESTROY(rng);
+	free(rng->re, type);
+	free(rng, type);
+}
+
+struct mbuf *
+drbr_dequeue(struct ifnet *ifp, struct drbr_ring *rng)
+{
+	int i;
+	struct mbuf *m;
+	int locked =3D 0;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	if (rng->count_on_queues =3D=3D 0) {
+		if (locked) {
+			DRBR_UNLOCK(rng);
+		}
+		return (NULL);
+	}
+	if (rng->lowq_with_data =3D=3D 0xffffffff) {
+		rng->lowq_with_data =3D 0;
+	}
+	for(i=3Drng->lowq_with_data; i<drbr_maxq;i++) {
+		if (buf_ring_empty(rng->re[i].re_qs))
+			continue;
+		rng->lowq_with_data =3D i;
+		break;
+	}
+#ifdef INVARIANT
+	if (i >=3D drbr_maxq) {
+		/* Nothing on ring from marker up? */
+		rng->lowq_with_data =3D 0;
+        	for (i=3Drng->lowq_with_data; i<drbr_maxq;i++) {
+	        	if(buf_ring_empty(rng->re[i].re_qs))
+		        	continue;
+			rng->lowq_with_data =3D i;
+         		break;
+        	}
+		if (i >=3D drbr_maxq) {
+			/* Count was off? */
+			rng->count_on_queues =3D 0;
+			if (locked) {
+				DRBR_UNLOCK(rng);
+			}
+			return (NULL);
+                }
+        }
+#else
+	if (i >=3D drbr_maxq) {
+		/* Huh */
+		i =3D 0;
+	}
+#endif
+	m =3D buf_ring_dequeue_mc(rng->re[i].re_qs);
+	if (m) {
+		atomic_subtract_int(&rng->re[i].re_cnt, 1);
+		atomic_subtract_int(&rng->count_on_queues, 1);
+		if (rng->count_on_queues =3D=3D 0) {
+			atomic_set_int(&rng->lowq_with_data, =
0xffffffff);
+		}
+	} else {
+		/* TSNH */
+		rng->re[i].re_cnt =3D 0;
+	}
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+	return(m);
+}
+
+void
+drbr_advance(struct ifnet *ifp, struct drbr_ring *rng, uint8_t qused)
+{
+	int locked =3D 0;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	if (rng->count_on_queues =3D=3D 0) {
+		/* Huh? */
+		if (locked) {
+			DRBR_UNLOCK(rng);
+		}
+		return;
+	}
+	atomic_subtract_int(&rng->count_on_queues, 1);
+	if (rng->count_on_queues =3D=3D 0) {
+		atomic_set_int(&rng->lowq_with_data, 0xffffffff);
+	}
+	buf_ring_advance_mc(rng->re[qused].re_qs);
+	atomic_subtract_int(&rng->re[qused].re_cnt, 1);
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+}
+
+struct mbuf *
+drbr_dequeue_cond(struct ifnet *ifp, struct drbr_ring *rng,
+    int (*func) (struct mbuf *, void *), void *arg)=20
+{
+	uint8_t qused;
+	struct mbuf *m;
+	int locked =3D 0;
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	}
+	qused =3D 0;
+	m =3D drbr_peek(ifp, rng, &qused);
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+	if (m =3D=3D NULL || func(m, arg) =3D=3D 0) {
+		return (NULL);
+	}
+	if (use_drbr_lock) {
+		DRBR_LOCK(rng);
+		locked =3D 1;
+	} else {
+		locked =3D 0;
+	}
+	atomic_subtract_int(&rng->re[qused].re_cnt, 1);
+	atomic_subtract_int(&rng->count_on_queues, 1);
+	m =3D buf_ring_dequeue_mc(rng->re[qused].re_qs);
+	if (locked) {
+		DRBR_UNLOCK(rng);
+	}
+	return (m);
+}
+
+int
+drbr_empty(struct ifnet *ifp, struct drbr_ring *rng)
+{
+	return (!rng->count_on_queues);
+}
+
+int
+drbr_needs_enqueue(struct ifnet *ifp, struct drbr_ring *rng)
+{
+	return (!(rng->count_on_queues =3D=3D 0));
+}
+
+int
+drbr_inuse(struct ifnet *ifp, struct drbr_ring *rng)
+{
+	return (rng->count_on_queues);
+}

Property changes on: sys/net/drbr.c
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=3D%H
\ No newline at end of property
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: sys/net/drbr.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/net/drbr.h	(revision 0)
+++ sys/net/drbr.h	(working copy)
@@ -0,0 +1,89 @@
+#ifndef __drbr_h__
+#define __drbr_h__
+#include <sys/param.h>
+#ifdef _KERNEL
+#include <sys/systm.h>
+#include <sys/buf_ring.h>
+#include <sys/endian.h>
+#include <sys/kernel.h>
+#include <sys/malloc.h>
+#include <sys/mbuf.h>
+#include <sys/pcpu.h>
+#include <sys/smp.h>
+#include <sys/bus.h>
+#include <machine/smp.h>
+#include <machine/bus.h>
+#include <machine/resource.h>
+#endif
+#include <sys/socket.h>
+#include <sys/sockio.h>
+#include <sys/sysctl.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <net/if_types.h>
+#include <netinet/in.h>
+
+#define DRBR_MAXQ_DEFAULT 8
+#define DRBR_MIN_DEPTH 64	/* Must be power of 2 */
+
+#define USE_LOCK
+
+#ifdef _KERNEL
+extern uint32_t drbr_maxq;
+#endif
+
+struct drbr_ring_entry {
+	struct buf_ring		*re_qs;		/* Ring itself */
+	u_long			re_drop_cnt;	/* Drop count in pkts */
+	u_long			re_bytedrop_cnt;/* Drop count in bytes =
*/
+	u_long			re_cnt_sent;	/* Total sent in pkts */
+	u_long			re_bytecnt_sent;/* Total sent in bytes =
*/
+	uint32_t		re_cnt;		/* Count on ring */
+};
+
+#define DRBR_LOCK_INIT(rng) mtx_init(&(rng)->rng_mtx, "drbr_lock", =
"drbr", MTX_DEF | MTX_DUPOK)
+#define DRBR_LOCK_DESTROY(rng) 	mtx_destroy(&(rng)->rng_mtx)
+#define DRBR_LOCK(rng) 	mtx_lock(&(rng)->rng_mtx)
+#define DRBR_UNLOCK(rng) mtx_unlock(&(rng)->rng_mtx)
+#define DRBR_LOCK_OWNED(rng) mtx_owned(&(rng)->rng_mtx)
+
+struct drbr_ring {
+#ifdef _KERNEL
+	struct mtx 		rng_mtx;
+#endif
+	struct drbr_ring_entry *re;
+	uint32_t		count_on_queues;
+	uint32_t		lowq_with_data;
+};
+
+#ifdef _KERNEL
+struct drbr_ring *
+drbr_alloc(struct malloc_type *type, int flags, struct mtx *tmtx);
+int drbr_enqueue(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf =
*m);
+void drbr_putback(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf =
*new,=20
+	uint8_t qused);
+struct mbuf *drbr_peek(struct ifnet *ifp, struct drbr_ring *rng,
+	uint8_t *qused);
+void drbr_flush(struct ifnet *ifp, struct drbr_ring *rng);
+void drbr_free(struct drbr_ring *rng, struct malloc_type *type);
+struct mbuf *drbr_dequeue(struct ifnet *ifp, struct drbr_ring *rng);
+void drbr_advance(struct ifnet *ifp, struct drbr_ring *rng, uint8_t =
qused);
+struct mbuf *
+drbr_dequeue_cond(struct ifnet *ifp, struct drbr_ring *rng,
+	int (*func) (struct mbuf *, void *), void *arg) ;
+int drbr_empty(struct ifnet *ifp, struct drbr_ring *rng);
+int drbr_needs_enqueue(struct ifnet *ifp, struct drbr_ring *rng);
+int drbr_inuse(struct ifnet *ifp, struct drbr_ring *rng);
+void drbr_add_sysctl_stats(device_t dev, struct sysctl_oid_list =
*queue_list,=20
+      struct drbr_ring *rng);
+void=20
+drbr_add_sysctl_stats_nodev(struct sysctl_oid_list *queue_list,=20
+      struct sysctl_ctx_list *ctx,
+      struct drbr_ring *rng);
+
+int drbr_is_on_ring(struct drbr_ring *rng, struct mbuf *m);
+u_long drbr_get_dropcnt(struct drbr_ring *rng);
+
+#endif
+
+#endif

Property changes on: sys/net/drbr.h
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=3D%H
\ No newline at end of property
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: sys/net/if_var.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/net/if_var.h	(revision 257322)
+++ sys/net/if_var.h	(working copy)
@@ -205,7 +205,14 @@ struct ifnet {
 	 */
 	char	if_cspare[3];
 	int	if_ispare[4];
-	void	*if_pspare[8];		/* 1 netmap, 7 TDB */
+	/* Set max bytes on ring - buffer bloat managment */
+	void    (*if_maxbytes)(struct ifnet *, uint64_t maxbytes);
+	/* Get a drbr ring to peak at */
+	struct drbr_ring * (*if_getdrbr_ring)(struct ifnet *, int =
queuenum);
+	/* Is this mbuf on one of your rings? */
+	int    (*if_mbuf_on_ring)(struct ifnet *, struct mbuf *);
+
+	void	*if_pspare[5];		/* 1 netmap, 4 TDB */
 };
=20
 /*
@@ -599,165 +606,7 @@ if_initbaudrate(struct ifnet *ifp, uintmax_t baud)
 	ifp->if_baudrate =3D baud;
 }
=20
-static __inline int
-drbr_enqueue(struct ifnet *ifp, struct buf_ring *br, struct mbuf *m)
-{=09
-	int error =3D 0;
-
-#ifdef ALTQ
-	if (ALTQ_IS_ENABLED(&ifp->if_snd)) {
-		IFQ_ENQUEUE(&ifp->if_snd, m, error);
-		return (error);
-	}
 #endif
-	error =3D buf_ring_enqueue(br, m);
-	if (error)
-		m_freem(m);
-
-	return (error);
-}
-
-static __inline void
-drbr_putback(struct ifnet *ifp, struct buf_ring *br, struct mbuf *new)
-{
-	/*
-	 * The top of the list needs to be swapped=20
-	 * for this one.
-	 */
-#ifdef ALTQ
-	if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) {
-		/*=20
-		 * Peek in altq case dequeued it
-		 * so put it back.
-		 */
-		IFQ_DRV_PREPEND(&ifp->if_snd, new);
-		return;
-	}
-#endif
-	buf_ring_putback_sc(br, new);
-}
-
-static __inline struct mbuf *
-drbr_peek(struct ifnet *ifp, struct buf_ring *br)
-{
-#ifdef ALTQ
-	struct mbuf *m;
-	if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) {
-		/*=20
-		 * Pull it off like a dequeue
-		 * since drbr_advance() does nothing
-		 * for altq and drbr_putback() will
-		 * use the old prepend function.
-		 */
-		IFQ_DEQUEUE(&ifp->if_snd, m);
-		return (m);
-	}
-#endif
-	return(buf_ring_peek(br));
-}
-
-static __inline void
-drbr_flush(struct ifnet *ifp, struct buf_ring *br)
-{
-	struct mbuf *m;
-
-#ifdef ALTQ
-	if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd))
-		IFQ_PURGE(&ifp->if_snd);
-#endif=09
-	while ((m =3D buf_ring_dequeue_sc(br)) !=3D NULL)
-		m_freem(m);
-}
-
-static __inline void
-drbr_free(struct buf_ring *br, struct malloc_type *type)
-{
-
-	drbr_flush(NULL, br);
-	buf_ring_free(br, type);
-}
-
-static __inline struct mbuf *
-drbr_dequeue(struct ifnet *ifp, struct buf_ring *br)
-{
-#ifdef ALTQ
-	struct mbuf *m;
-
-	if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) {=09
-		IFQ_DEQUEUE(&ifp->if_snd, m);
-		return (m);
-	}
-#endif
-	return (buf_ring_dequeue_sc(br));
-}
-
-static __inline void
-drbr_advance(struct ifnet *ifp, struct buf_ring *br)
-{
-#ifdef ALTQ
-	/* Nothing to do here since peek dequeues in altq case */
-	if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd))
-		return;
-#endif
-	return (buf_ring_advance_sc(br));
-}
-
-
-static __inline struct mbuf *
-drbr_dequeue_cond(struct ifnet *ifp, struct buf_ring *br,
-    int (*func) (struct mbuf *, void *), void *arg)=20
-{
-	struct mbuf *m;
-#ifdef ALTQ
-	if (ALTQ_IS_ENABLED(&ifp->if_snd)) {
-		IFQ_LOCK(&ifp->if_snd);
-		IFQ_POLL_NOLOCK(&ifp->if_snd, m);
-		if (m !=3D NULL && func(m, arg) =3D=3D 0) {
-			IFQ_UNLOCK(&ifp->if_snd);
-			return (NULL);
-		}
-		IFQ_DEQUEUE_NOLOCK(&ifp->if_snd, m);
-		IFQ_UNLOCK(&ifp->if_snd);
-		return (m);
-	}
-#endif
-	m =3D buf_ring_peek(br);
-	if (m =3D=3D NULL || func(m, arg) =3D=3D 0)
-		return (NULL);
-
-	return (buf_ring_dequeue_sc(br));
-}
-
-static __inline int
-drbr_empty(struct ifnet *ifp, struct buf_ring *br)
-{
-#ifdef ALTQ
-	if (ALTQ_IS_ENABLED(&ifp->if_snd))
-		return (IFQ_IS_EMPTY(&ifp->if_snd));
-#endif
-	return (buf_ring_empty(br));
-}
-
-static __inline int
-drbr_needs_enqueue(struct ifnet *ifp, struct buf_ring *br)
-{
-#ifdef ALTQ
-	if (ALTQ_IS_ENABLED(&ifp->if_snd))
-		return (1);
-#endif
-	return (!buf_ring_empty(br));
-}
-
-static __inline int
-drbr_inuse(struct ifnet *ifp, struct buf_ring *br)
-{
-#ifdef ALTQ
-	if (ALTQ_IS_ENABLED(&ifp->if_snd))
-		return (ifp->if_snd.ifq_len);
-#endif
-	return (buf_ring_count(br));
-}
-#endif
 /*
  * 72 was chosen below because it is the size of a TCP/IP
  * header (40) + the minimum mss (32).
Index: sys/netinet/if_ether.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/netinet/if_ether.c	(revision 257322)
+++ sys/netinet/if_ether.c	(working copy)
@@ -283,6 +283,7 @@ arprequest(struct ifnet *ifp, const struct in_addr
 	sa.sa_len =3D 2;
 	m->m_flags |=3D M_BCAST;
 	m_clrprotoflags(m);	/* Avoid confusing lower layers. */
+	m->m_pkthdr.cosqos =3D 0; /* Highest Priority */
 	(*ifp->if_output)(ifp, m, &sa, NULL);
 	ARPSTAT_INC(txrequests);
 }
Index: sys/ofed/drivers/net/mlx4/en_tx.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/ofed/drivers/net/mlx4/en_tx.c	(revision 257322)
+++ sys/ofed/drivers/net/mlx4/en_tx.c	(working copy)
@@ -39,6 +39,7 @@
=20
 #include <net/ethernet.h>
 #include <net/if_vlan_var.h>
+#include <net/drbr.h>
 #include <sys/mbuf.h>
=20
 #include <netinet/in_systm.h>
@@ -78,7 +79,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *pr
 	mtx_init(&ring->comp_lock.m, "mlx4 comp", NULL, MTX_DEF);
=20
 	/* Allocate the buf ring */
-	ring->br =3D buf_ring_alloc(MLX4_EN_DEF_TX_QUEUE_SIZE, M_DEVBUF,
+	ring->br =3D drbr_alloc(M_DEVBUF,
 	    M_WAITOK, &ring->tx_lock.m);
 	if (ring->br =3D=3D NULL) {
 		en_err(priv, "Failed allocating tx_info ring\n");
@@ -155,7 +156,7 @@ err_bounce:
 	kfree(ring->bounce_buf);
 	ring->bounce_buf =3D NULL;
 err_tx:
-	buf_ring_free(ring->br, M_DEVBUF);
+	drbr_free(ring->br, M_DEVBUF);
 	kfree(ring->tx_info);
 	ring->tx_info =3D NULL;
 	return err;
@@ -167,7 +168,7 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *
 	struct mlx4_en_dev *mdev =3D priv->mdev;
 	en_dbg(DRV, priv, "Destroying tx ring, qpn: %d\n", ring->qpn);
=20
-	buf_ring_free(ring->br, M_DEVBUF);
+	drbr_free(ring->br, M_DEVBUF);
 	if (ring->bf_enabled)
 		mlx4_bf_free(mdev->dev, &ring->bf);
 	mlx4_qp_remove(mdev->dev, &ring->qp);
@@ -925,6 +926,7 @@ mlx4_en_transmit_locked(struct ifnet *dev, int tx_
 	struct mlx4_en_tx_ring *ring;
 	struct mbuf *next;
 	int enqueued, err =3D 0;
+	uint8_t queue;
=20
 	ring =3D &priv->tx_ring[tx_ind];
 	if ((dev->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=3D=

@@ -940,16 +942,16 @@ mlx4_en_transmit_locked(struct ifnet *dev, int tx_
 			return (err);
 	}
 	/* Process the queue */
-	while ((next =3D drbr_peek(dev, ring->br)) !=3D NULL) {
+	while ((next =3D drbr_peek(dev, ring->br, &queue)) !=3D NULL) {
 		if ((err =3D mlx4_en_xmit(dev, tx_ind, &next)) !=3D 0) {
 			if (next =3D=3D NULL) {
-				drbr_advance(dev, ring->br);
+				drbr_advance(dev, ring->br, queue);
 			} else {
-				drbr_putback(dev, ring->br, next);
+				drbr_putback(dev, ring->br, next, =
queue);
 			}
 			break;
 		}
-		drbr_advance(dev, ring->br);
+		drbr_advance(dev, ring->br, queue);
 		enqueued++;
 		dev->if_obytes +=3D next->m_pkthdr.len;
 		if (next->m_flags & M_MCAST)
@@ -1027,12 +1029,10 @@ mlx4_en_qflush(struct ifnet *dev)
 {
 	struct mlx4_en_priv *priv =3D netdev_priv(dev);
 	struct mlx4_en_tx_ring *ring =3D priv->tx_ring;
-	struct mbuf *m;
=20
 	for (int i =3D 0; i < priv->tx_ring_num; i++, ring++) {
 		spin_lock(&ring->tx_lock);
-		while ((m =3D buf_ring_dequeue_sc(ring->br)) !=3D NULL)
-			m_freem(m);
+		drbr_flush(dev, ring->br);
 		spin_unlock(&ring->tx_lock);
 	}
 	if_qflush(dev);
Index: sys/ofed/drivers/net/mlx4/mlx4_en.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/ofed/drivers/net/mlx4/mlx4_en.h	(revision 257322)
+++ sys/ofed/drivers/net/mlx4/mlx4_en.h	(working copy)
@@ -285,7 +285,7 @@ struct mlx4_en_tx_ring {
 	void *buf;
 	u16 poll_cnt;
 	int blocked;
-	struct buf_ring *br;
+	struct drbr_ring *br;
 	struct mlx4_en_tx_info *tx_info;
 	u8 *bounce_buf;
 	u32 last_nr_txbb;
Index: sys/sys/buf_ring.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/sys/buf_ring.h	(revision 257322)
+++ sys/sys/buf_ring.h	(working copy)
@@ -61,176 +61,25 @@ struct buf_ring {
  * multi-producer safe lock-free ring buffer enqueue
  *
  */
-static __inline int
-buf_ring_enqueue(struct buf_ring *br, void *buf)
-{
-	uint32_t prod_head, prod_next;
-	uint32_t cons_tail;
-#ifdef DEBUG_BUFRING
-	int i;
-	for (i =3D br->br_cons_head; i !=3D br->br_prod_head;
-	     i =3D ((i + 1) & br->br_cons_mask))
-		if(br->br_ring[i] =3D=3D buf)
-			panic("buf=3D%p already enqueue at %d prod=3D%d =
cons=3D%d",
-			    buf, i, br->br_prod_tail, br->br_cons_tail);
-#endif=09
-	critical_enter();
-	do {
-		prod_head =3D br->br_prod_head;
-		cons_tail =3D br->br_cons_tail;
-
-		prod_next =3D (prod_head + 1) & br->br_prod_mask;
-	=09
-		if (prod_next =3D=3D cons_tail) {
-			br->br_drops++;
-			critical_exit();
-			return (ENOBUFS);
-		}
-	} while (!atomic_cmpset_int(&br->br_prod_head, prod_head, =
prod_next));
-#ifdef DEBUG_BUFRING
-	if (br->br_ring[prod_head] !=3D NULL)
-		panic("dangling value in enqueue");
-#endif=09
-	br->br_ring[prod_head] =3D buf;
-
-	/*
-	 * The full memory barrier also avoids that br_prod_tail store
-	 * is reordered before the br_ring[prod_head] is full setup.
-	 */
-	mb();
-
-	/*
-	 * If there are other enqueues in progress
-	 * that preceeded us, we need to wait for them
-	 * to complete=20
-	 */  =20
-	while (br->br_prod_tail !=3D prod_head)
-		cpu_spinwait();
-	br->br_prod_tail =3D prod_next;
-	critical_exit();
-	return (0);
-}
-
+int buf_ring_enqueue(struct buf_ring *br, void *buf);
 /*
  * multi-consumer safe dequeue=20
  *
  */
-static __inline void *
-buf_ring_dequeue_mc(struct buf_ring *br)
-{
-	uint32_t cons_head, cons_next;
-	uint32_t prod_tail;
-	void *buf;
-	int success;
-
-	critical_enter();
-	do {
-		cons_head =3D br->br_cons_head;
-		prod_tail =3D br->br_prod_tail;
-
-		cons_next =3D (cons_head + 1) & br->br_cons_mask;
-	=09
-		if (cons_head =3D=3D prod_tail) {
-			critical_exit();
-			return (NULL);
-		}
-	=09
-		success =3D atomic_cmpset_int(&br->br_cons_head, =
cons_head,
-		    cons_next);
-	} while (success =3D=3D 0);	=09
-
-	buf =3D br->br_ring[cons_head];
-#ifdef DEBUG_BUFRING
-	br->br_ring[cons_head] =3D NULL;
-#endif
-
-	/*
-	 * The full memory barrier also avoids that br_ring[cons_read]
-	 * load is reordered after br_cons_tail is set.
-	 */
-	mb();
-=09
-	/*
-	 * If there are other dequeues in progress
-	 * that preceeded us, we need to wait for them
-	 * to complete=20
-	 */  =20
-	while (br->br_cons_tail !=3D cons_head)
-		cpu_spinwait();
-
-	br->br_cons_tail =3D cons_next;
-	critical_exit();
-
-	return (buf);
-}
-
+void *buf_ring_dequeue_mc(struct buf_ring *br);
 /*
  * single-consumer dequeue=20
  * use where dequeue is protected by a lock
  * e.g. a network driver's tx queue lock
  */
-static __inline void *
-buf_ring_dequeue_sc(struct buf_ring *br)
-{
-	uint32_t cons_head, cons_next, cons_next_next;
-	uint32_t prod_tail;
-	void *buf;
-=09
-	cons_head =3D br->br_cons_head;
-	prod_tail =3D br->br_prod_tail;
-=09
-	cons_next =3D (cons_head + 1) & br->br_cons_mask;
-	cons_next_next =3D (cons_head + 2) & br->br_cons_mask;
-=09
-	if (cons_head =3D=3D prod_tail)=20
-		return (NULL);
-
-#ifdef PREFETCH_DEFINED=09
-	if (cons_next !=3D prod_tail) {	=09
-		prefetch(br->br_ring[cons_next]);
-		if (cons_next_next !=3D prod_tail)=20
-			prefetch(br->br_ring[cons_next_next]);
-	}
-#endif
-	br->br_cons_head =3D cons_next;
-	buf =3D br->br_ring[cons_head];
-
-#ifdef DEBUG_BUFRING
-	br->br_ring[cons_head] =3D NULL;
-	if (!mtx_owned(br->br_lock))
-		panic("lock not held on single consumer dequeue");
-	if (br->br_cons_tail !=3D cons_head)
-		panic("inconsistent list cons_tail=3D%d cons_head=3D%d",
-		    br->br_cons_tail, cons_head);
-#endif
-	br->br_cons_tail =3D cons_next;
-	return (buf);
-}
-
+void *buf_ring_dequeue_sc(struct buf_ring *br);
 /*
  * single-consumer advance after a peek
  * use where it is protected by a lock
  * e.g. a network driver's tx queue lock
  */
-static __inline void
-buf_ring_advance_sc(struct buf_ring *br)
-{
-	uint32_t cons_head, cons_next;
-	uint32_t prod_tail;
-=09
-	cons_head =3D br->br_cons_head;
-	prod_tail =3D br->br_prod_tail;
-=09
-	cons_next =3D (cons_head + 1) & br->br_cons_mask;
-	if (cons_head =3D=3D prod_tail)=20
-		return;
-	br->br_cons_head =3D cons_next;
-#ifdef DEBUG_BUFRING
-	br->br_ring[cons_head] =3D NULL;
-#endif
-	br->br_cons_tail =3D cons_next;
-}
-
+void buf_ring_advance_sc(struct buf_ring *br);
+void buf_ring_advance_mc(struct buf_ring *br);
 /*
  * Used to return a buffer (most likely already there)
  * to the top od the ring. The caller should *not*
@@ -247,65 +96,27 @@ struct buf_ring {
  * if we have to do a multi-queue version we will need
  * the compare and an atomic.
  */
-static __inline void
-buf_ring_putback_sc(struct buf_ring *br, void *new)
-{
-	KASSERT(br->br_cons_head !=3D br->br_prod_tail,=20
-		("Buf-Ring has none in putback")) ;
-	br->br_ring[br->br_cons_head] =3D new;
-}
-
+void buf_ring_putback_mc(struct buf_ring *br, void *new);
+void buf_ring_putback_sc(struct buf_ring *br, void *new);
 /*
  * return a pointer to the first entry in the ring
  * without modifying it, or NULL if the ring is empty
  * race-prone if not protected by a lock
  */
-static __inline void *
-buf_ring_peek(struct buf_ring *br)
-{
+void *buf_ring_peek(struct buf_ring *br);
=20
-#ifdef DEBUG_BUFRING
-	if ((br->br_lock !=3D NULL) && !mtx_owned(br->br_lock))
-		panic("lock not held on single consumer dequeue");
-#endif=09
-	/*
-	 * I believe it is safe to not have a memory barrier
-	 * here because we control cons and tail is worst case
-	 * a lagging indicator so we worst case we might
-	 * return NULL immediately after a buffer has been enqueued
-	 */
-	if (br->br_cons_head =3D=3D br->br_prod_tail)
-		return (NULL);
-=09
-	return (br->br_ring[br->br_cons_head]);
-}
+int buf_ring_full(struct buf_ring *br);
=20
-static __inline int
-buf_ring_full(struct buf_ring *br)
-{
+int buf_ring_empty(struct buf_ring *br);
=20
-	return (((br->br_prod_head + 1) & br->br_prod_mask) =3D=3D =
br->br_cons_tail);
-}
+int buf_ring_count(struct buf_ring *br);
=20
-static __inline int
-buf_ring_empty(struct buf_ring *br)
-{
-
-	return (br->br_cons_head =3D=3D br->br_prod_tail);
-}
-
-static __inline int
-buf_ring_count(struct buf_ring *br)
-{
-
-	return ((br->br_prod_size + br->br_prod_tail - br->br_cons_tail)
-	    & br->br_prod_mask);
-}
-
 struct buf_ring *buf_ring_alloc(int count, struct malloc_type *type, =
int flags,
     struct mtx *);
+
 void buf_ring_free(struct buf_ring *br, struct malloc_type *type);
=20
+int buf_ring_mbufon(struct buf_ring *br, void *buf);
=20
=20
 #endif
Index: usr.sbin/ofwdump/ofwdump.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- usr.sbin/ofwdump/ofwdump.c	(revision 257322)
+++ usr.sbin/ofwdump/ofwdump.c	(working copy)
@@ -63,6 +63,8 @@ usage(void)
 	exit(EX_USAGE);
 }
=20
+static int query_mode =3D 0;
+
 int
 main(int argc, char *argv[])
 {
@@ -72,10 +74,13 @@ main(int argc, char *argv[])
=20
 	aflag =3D pflag =3D rflag =3D Rflag =3D Sflag =3D 0;
 	Parg =3D NULL;
-	while ((opt =3D getopt(argc, argv, "-aprP:RS")) !=3D -1) {
+	while ((opt =3D getopt(argc, argv, "-aqprP:RS")) !=3D -1) {
 		if (opt =3D=3D '-')
 			break;
 		switch (opt) {
+		case 'q':
+			query_mode =3D 1;
+			break;
 		case 'a':
 			aflag =3D 1;
 			rflag =3D 1;
@@ -209,6 +214,7 @@ ofw_dump_node(int fd, phandle_t n, int level, int
 	static int nblen =3D 0;
 	int plen;
 	phandle_t c;
+	int my_prop =3D 0;
=20
 	if (!(raw || str)) {
 		ofw_indent(level * LVLINDENT);
@@ -218,9 +224,26 @@ ofw_dump_node(int fd, phandle_t n, int level, int
 			printf(": %.*s\n", (int)plen, (char *)nbuf);
 		else
 			putchar('\n');
+		if (query_mode) {
+			char input[100];
+			fprintf(stdout, "Dump properties (y or n)?");
+			fflush(stdout);
+			input[0] =3D 0;
+			fgets(input, sizeof(input), stdin);
+			if (input[0] =3D=3D 'y') {
+				my_prop =3D 1;
+			}
+		}
+	=09
 	}
 	if (prop)
 		ofw_dump_properties(fd, n, level, pmatch, raw, str);
+	if (my_prop) {
+		ofw_dump_properties(fd, n, level, pmatch, raw, str);
+		printf("Exiting\n");
+		exit(0);
+	}
+
 	if (rec) {
 		for (c =3D ofw_child(fd, n); c !=3D 0; c =3D =
ofw_peer(fd, c)) {
 			ofw_dump_node(fd, c, level + 1, rec, prop, =
pmatch,

--Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
	charset=us-ascii


------------------------------
Randall Stewart
803-317-4952 (cell)


--Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4--

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 11:04:31 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 47F4F8EF
 for <net@freebsd.org>; Tue, 29 Oct 2013 11:04:31 +0000 (UTC)
 (envelope-from rrs@lakerest.net)
Received: from lakerest.net (lakerest.net [162.235.35.161])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id CFEC82C42
 for <net@freebsd.org>; Tue, 29 Oct 2013 11:04:30 +0000 (UTC)
Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162])
 (authenticated bits=0)
 by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TB4OPY068744
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT)
 for <net@freebsd.org>; Tue, 29 Oct 2013 07:04:24 -0400 (EDT)
 (envelope-from rrs@lakerest.net)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Apple Message framework v1283)
Subject: Re: MQ Patch.
From: Randall Stewart <rrs@lakerest.net>
In-Reply-To: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
Date: Tue, 29 Oct 2013 07:04:24 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <06B5EC19-8F81-4726-9DF1-96286B0967A5@lakerest.net>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
To: net@freebsd.org
X-Mailer: Apple Mail (2.1283)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 11:04:31 -0000

A quick follow up note.

I will have an update to this.. it looks like in
my build-universe I see if_var.h changed (includes and such) so
I will have to touch up drbr.h  (nothing like trying to hit a moving
target :-D)

I will send out an update after my build-universe completes (hopefully =
today).. but
take a look at this one anyway (understand a couple of includes and such =
may change) :-)

R
On Oct 29, 2013, at 6:50 AM, Randall Stewart wrote:

> Hi:
>=20
> As discussed at vBSDcon with andre/emaste and gnn, I am sending
> this patch out to all of you ;-)
>=20
> I have previously sent it to gnn, andre, jhb, rwatson, and several =
other
> of the usual suspects (as gnn put it) and received dead silence.
>=20
> What does this patch do?
>=20
> Well it add the ability to do multi-queue at the driver level. =
Basically
> any driver that uses the new interface gets under it N queues (default
> is 8) for each physical transmit ring it has. The driver picks up=20
> its queue 0 first, then queue 1 .. up to the max.
>=20
> This allows you to prioritize packets. Also in here is the start of =
some
> work I will be doing for AQM.. think either Pi or Codel ;-)
>=20
> Right now thats pretty simple and just (in a few drivers) as the =
ability
> to limit the amount of data on the ring=85 which can help reduce =
buffer
> bloat. That needs to be refined into a lot more.
>=20
> This work is donated by Adara Networks and has been discussed in =
several
> of the past vendor summits.
>=20
> I plan on committing this before the IETF unless I hear major =
objections.
>=20
> Please have a look ;-)
>=20
> Best wishes
>=20
> R
>=20
> <patch_mq.txt>
> ------------------------------
> Randall Stewart
> 803-317-4952 (cell)
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

------------------------------
Randall Stewart
803-317-4952 (cell)


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 13:00:02 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 9BD81843
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Tue, 29 Oct 2013 13:00:02 +0000 (UTC)
 (envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7A3DA2504
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Tue, 29 Oct 2013 13:00:02 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9TD02ag040343
 for <freebsd-net@freefall.freebsd.org>; Tue, 29 Oct 2013 13:00:02 GMT
 (envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9TD022d040342;
 Tue, 29 Oct 2013 13:00:02 GMT (envelope-from gnats)
Date: Tue, 29 Oct 2013 13:00:02 GMT
Message-Id: <201310291300.r9TD022d040342@freefall.freebsd.org>
To: freebsd-net@FreeBSD.org
Cc: 
From: dfilter@FreeBSD.ORG (dfilter service)
Subject: Re: kern/134531: commit references a PR
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: dfilter service <dfilter@FreeBSD.ORG>
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 13:00:02 -0000

The following reply was made to PR kern/134531; it has been noted by GNATS.

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/134531: commit references a PR
Date: Tue, 29 Oct 2013 12:53:33 +0000 (UTC)

 Author: melifaro
 Date: Tue Oct 29 12:53:23 2013
 New Revision: 257330
 URL: http://svnweb.freebsd.org/changeset/base/257330
 
 Log:
   MFC r256624:
   
   Fix long-standing issue with incorrect radix mask calculation.
   
   Usual symptoms are messages like
   rn_delete: inconsistent annotation
   rn_addmask: mask impossibly already in tree
   routing daemon constantly deleting IPv6 default route
   or inability to flush/delete particular prefix in ipfw table.
   
   Changes:
   * Assume 32 bytes as maximum radix key length
   * Remove rn_init()
   * Statically allocate rn_ones/rn_zeroes
   * Make separate mask tree for each "normal" tree instead of system
   global one
   * Remove "optimization" on masks reusage and key zeroying
   * Change rn_addmask() arguments to accept tree pointer (no users in base)
   
   MFC changes:
   * keep rn_init()
   * create global mask tree, protected with mutex, for old rn_addmask
   users (currently 0 in base)
   * Add new rn_addmask_r() function (rn_addmask in head) with additional
   argument to accept tree pointer
   
   PR:		kern/182851, kern/169206, kern/135476, kern/134531
   Found by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
   Reviewed by:	glebius (previous versions)
   Sponsored by:	Yandex LLC
   Approved by:	re (glebius)
 
 Modified:
   stable/10/sys/net/radix.c
   stable/10/sys/net/radix.h
 
 Modified: stable/10/sys/net/radix.c
 ==============================================================================
 --- stable/10/sys/net/radix.c	Tue Oct 29 12:34:11 2013	(r257329)
 +++ stable/10/sys/net/radix.c	Tue Oct 29 12:53:23 2013	(r257330)
 @@ -66,27 +66,27 @@ static struct radix_node
  	 *rn_search(void *, struct radix_node *),
  	 *rn_search_m(void *, struct radix_node *, void *);
  
 -static int	max_keylen;
 -static struct radix_mask *rn_mkfreelist;
 -static struct radix_node_head *mask_rnhead;
 +static void rn_detachhead_internal(void **head);
 +static int rn_inithead_internal(void **head, int off);
 +
 +#define	RADIX_MAX_KEY_LEN	32
 +
 +static char rn_zeros[RADIX_MAX_KEY_LEN];
 +static char rn_ones[RADIX_MAX_KEY_LEN] = {
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +};
 +
  /*
 - * Work area -- the following point to 3 buffers of size max_keylen,
 - * allocated in this order in a block of memory malloc'ed by rn_init.
 - * rn_zeros, rn_ones are set in rn_init and used in readonly afterwards.
 - * addmask_key is used in rn_addmask in rw mode and not thread-safe.
 + * XXX: Compat stuff for old rn_addmask() users
   */
 -static char *rn_zeros, *rn_ones, *addmask_key;
 -
 -#define MKGet(m) {						\
 -	if (rn_mkfreelist) {					\
 -		m = rn_mkfreelist;				\
 -		rn_mkfreelist = (m)->rm_mklist;			\
 -	} else							\
 -		R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); }
 - 
 -#define MKFree(m) { (m)->rm_mklist = rn_mkfreelist; rn_mkfreelist = (m);}
 +static struct radix_node_head *mask_rnhead_compat;
 +#ifdef	_KERNEL
 +static struct mtx mask_mtx;
 +#endif
  
 -#define rn_masktop (mask_rnhead->rnh_treetop)
  
  static int	rn_lexobetter(void *m_arg, void *n_arg);
  static struct radix_mask *
 @@ -230,7 +230,8 @@ rn_lookup(v_arg, m_arg, head)
  	caddr_t netmask = 0;
  
  	if (m_arg) {
 -		x = rn_addmask(m_arg, 1, head->rnh_treetop->rn_offset);
 +		x = rn_addmask_r(m_arg, head->rnh_masks, 1,
 +		    head->rnh_treetop->rn_offset);
  		if (x == 0)
  			return (0);
  		netmask = x->rn_key;
 @@ -489,53 +490,47 @@ on1:
  }
  
  struct radix_node *
 -rn_addmask(n_arg, search, skip)
 -	int search, skip;
 -	void *n_arg;
 +rn_addmask_r(void *arg, struct radix_node_head *maskhead, int search, int skip)
  {
 -	caddr_t netmask = (caddr_t)n_arg;
 +	caddr_t netmask = (caddr_t)arg;
  	register struct radix_node *x;
  	register caddr_t cp, cplim;
  	register int b = 0, mlen, j;
 -	int maskduplicated, m0, isnormal;
 +	int maskduplicated, isnormal;
  	struct radix_node *saved_x;
 -	static int last_zeroed = 0;
 +	char addmask_key[RADIX_MAX_KEY_LEN];
  
 -	if ((mlen = LEN(netmask)) > max_keylen)
 -		mlen = max_keylen;
 +	if ((mlen = LEN(netmask)) > RADIX_MAX_KEY_LEN)
 +		mlen = RADIX_MAX_KEY_LEN;
  	if (skip == 0)
  		skip = 1;
  	if (mlen <= skip)
 -		return (mask_rnhead->rnh_nodes);
 +		return (maskhead->rnh_nodes);
 +
 +	bzero(addmask_key, RADIX_MAX_KEY_LEN);
  	if (skip > 1)
  		bcopy(rn_ones + 1, addmask_key + 1, skip - 1);
 -	if ((m0 = mlen) > skip)
 -		bcopy(netmask + skip, addmask_key + skip, mlen - skip);
 +	bcopy(netmask + skip, addmask_key + skip, mlen - skip);
  	/*
  	 * Trim trailing zeroes.
  	 */
  	for (cp = addmask_key + mlen; (cp > addmask_key) && cp[-1] == 0;)
  		cp--;
  	mlen = cp - addmask_key;
 -	if (mlen <= skip) {
 -		if (m0 >= last_zeroed)
 -			last_zeroed = mlen;
 -		return (mask_rnhead->rnh_nodes);
 -	}
 -	if (m0 < last_zeroed)
 -		bzero(addmask_key + m0, last_zeroed - m0);
 -	*addmask_key = last_zeroed = mlen;
 -	x = rn_search(addmask_key, rn_masktop);
 +	if (mlen <= skip)
 +		return (maskhead->rnh_nodes);
 +	*addmask_key = mlen;
 +	x = rn_search(addmask_key, maskhead->rnh_treetop);
  	if (bcmp(addmask_key, x->rn_key, mlen) != 0)
  		x = 0;
  	if (x || search)
  		return (x);
 -	R_Zalloc(x, struct radix_node *, max_keylen + 2 * sizeof (*x));
 +	R_Zalloc(x, struct radix_node *, RADIX_MAX_KEY_LEN + 2 * sizeof (*x));
  	if ((saved_x = x) == 0)
  		return (0);
  	netmask = cp = (caddr_t)(x + 2);
  	bcopy(addmask_key, cp, mlen);
 -	x = rn_insert(cp, mask_rnhead, &maskduplicated, x);
 +	x = rn_insert(cp, maskhead, &maskduplicated, x);
  	if (maskduplicated) {
  		log(LOG_ERR, "rn_addmask: mask impossibly already in tree");
  		Free(saved_x);
 @@ -568,6 +563,23 @@ rn_addmask(n_arg, search, skip)
  	return (x);
  }
  
 +struct radix_node *
 +rn_addmask(void *n_arg, int search, int skip)
 +{
 +	struct radix_node *tt;
 +
 +#ifdef _KERNEL
 +	mtx_lock(&mask_mtx);
 +#endif
 +	tt = rn_addmask_r(&mask_rnhead_compat, n_arg, search, skip);
 +
 +#ifdef _KERNEL
 +	mtx_unlock(&mask_mtx);
 +#endif
 +
 +	return (tt);
 +}
 +
  static int	/* XXX: arbitrary ordering for non-contiguous masks */
  rn_lexobetter(m_arg, n_arg)
  	void *m_arg, *n_arg;
 @@ -590,12 +602,12 @@ rn_new_radix_mask(tt, next)
  {
  	register struct radix_mask *m;
  
 -	MKGet(m);
 +	R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask));
  	if (m == 0) {
 -		log(LOG_ERR, "Mask for route not entered\n");
 +		log(LOG_ERR, "Failed to allocate route mask\n");
  		return (0);
  	}
 -	bzero(m, sizeof *m);
 +	bzero(m, sizeof(*m));
  	m->rm_bit = tt->rn_bit;
  	m->rm_flags = tt->rn_flags;
  	if (tt->rn_flags & RNF_NORMAL)
 @@ -629,7 +641,8 @@ rn_addroute(v_arg, n_arg, head, treenode
  	 * nodes and possibly save time in calculating indices.
  	 */
  	if (netmask)  {
 -		if ((x = rn_addmask(netmask, 0, top->rn_offset)) == 0)
 +		x = rn_addmask_r(netmask, head->rnh_masks, 0, top->rn_offset);
 +		if (x == NULL)
  			return (0);
  		b_leaf = x->rn_bit;
  		b = -1 - x->rn_bit;
 @@ -808,7 +821,8 @@ rn_delete(v_arg, netmask_arg, head)
  	 * Delete our route from mask lists.
  	 */
  	if (netmask) {
 -		if ((x = rn_addmask(netmask, 1, head_off)) == 0)
 +		x = rn_addmask_r(netmask, head->rnh_masks, 1, head_off);
 +		if (x == NULL)
  			return (0);
  		netmask = x->rn_key;
  		while (tt->rn_mask != netmask)
 @@ -841,7 +855,7 @@ rn_delete(v_arg, netmask_arg, head)
  	for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist)
  		if (m == saved_m) {
  			*mp = m->rm_mklist;
 -			MKFree(m);
 +			Free(m);
  			break;
  		}
  	if (m == 0) {
 @@ -932,7 +946,7 @@ on1:
  					struct radix_mask *mm = m->rm_mklist;
  					x->rn_mklist = 0;
  					if (--(m->rm_refs) < 0)
 -						MKFree(m);
 +						Free(m);
  					m = mm;
  				}
  			if (m)
 @@ -1128,10 +1142,8 @@ rn_walktree(h, f, w)
   * bits starting at 'off'.
   * Return 1 on success, 0 on error.
   */
 -int
 -rn_inithead(head, off)
 -	void **head;
 -	int off;
 +static int
 +rn_inithead_internal(void **head, int off)
  {
  	register struct radix_node_head *rnh;
  	register struct radix_node *t, *tt, *ttt;
 @@ -1163,8 +1175,8 @@ rn_inithead(head, off)
  	return (1);
  }
  
 -int
 -rn_detachhead(void **head)
 +static void
 +rn_detachhead_internal(void **head)
  {
  	struct radix_node_head *rnh;
  
 @@ -1176,28 +1188,60 @@ rn_detachhead(void **head)
  	Free(rnh);
  
  	*head = NULL;
 +}
 +
 +int
 +rn_inithead(void **head, int off)
 +{
 +	struct radix_node_head *rnh;
 +
 +	if (*head != NULL)
 +		return (1);
 +
 +	if (rn_inithead_internal(head, off) == 0)
 +		return (0);
 +
 +	rnh = (struct radix_node_head *)(*head);
 +
 +	if (rn_inithead_internal((void **)&rnh->rnh_masks, 0) == 0) {
 +		rn_detachhead_internal(head);
 +		return (0);
 +	}
 +
 +	return (1);
 +}
 +
 +int
 +rn_detachhead(void **head)
 +{
 +	struct radix_node_head *rnh;
 +
 +	KASSERT((head != NULL && *head != NULL),
 +	    ("%s: head already freed", __func__));
 +
 +	rnh = *head;
 +
 +	rn_detachhead_internal((void **)&rnh->rnh_masks);
 +	rn_detachhead_internal(head);
  	return (1);
  }
  
  void
  rn_init(int maxk)
  {
 -	char *cp, *cplim;
 -
 -	max_keylen = maxk;
 -	if (max_keylen == 0) {
 +	if ((maxk <= 0) || (maxk > RADIX_MAX_KEY_LEN)) {
  		log(LOG_ERR,
 -		    "rn_init: radix functions require max_keylen be set\n");
 +		    "rn_init: max_keylen must be within 1..%d\n",
 +		    RADIX_MAX_KEY_LEN);
  		return;
  	}
 -	R_Malloc(rn_zeros, char *, 3 * max_keylen);
 -	if (rn_zeros == NULL)
 -		panic("rn_init");
 -	bzero(rn_zeros, 3 * max_keylen);
 -	rn_ones = cp = rn_zeros + max_keylen;
 -	addmask_key = cplim = rn_ones + max_keylen;
 -	while (cp < cplim)
 -		*cp++ = -1;
 -	if (rn_inithead((void **)(void *)&mask_rnhead, 0) == 0)
 +
 +	/*
 +	 * XXX: Compat for old rn_addmask() users
 +	 */
 +	if (rn_inithead((void **)(void *)&mask_rnhead_compat, 0) == 0)
  		panic("rn_init 2");
 +#ifdef _KERNEL
 +	mtx_init(&mask_mtx, "radix_mask", NULL, MTX_DEF);
 +#endif
  }
 
 Modified: stable/10/sys/net/radix.h
 ==============================================================================
 --- stable/10/sys/net/radix.h	Tue Oct 29 12:34:11 2013	(r257329)
 +++ stable/10/sys/net/radix.h	Tue Oct 29 12:53:23 2013	(r257330)
 @@ -136,6 +136,7 @@ struct radix_node_head {
  #ifdef _KERNEL
  	struct	rwlock rnh_lock;		/* locks entire radix tree */
  #endif
 +	struct	radix_node_head *rnh_masks;	/* Storage for our masks */
  };
  
  #ifndef _KERNEL
 @@ -167,6 +168,7 @@ int	 rn_detachhead(void **);
  int	 rn_refines(void *, void *);
  struct radix_node
  	 *rn_addmask(void *, int, int),
 +	 *rn_addmask_r(void *, struct radix_node_head *, int, int),
  	 *rn_addroute (void *, void *, struct radix_node_head *,
  			struct radix_node [2]),
  	 *rn_delete(void *, void *, struct radix_node_head *),
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 15:25:53 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 4FC0CC3E;
 Tue, 29 Oct 2013 15:25:53 +0000 (UTC)
 (envelope-from VenkatKumar.Duvvuru@Emulex.Com)
Received: from CMEXEDGE1.ext.emulex.com (cmexedge1.ext.emulex.com
 [138.239.224.99]) (using TLSv1 with cipher AES128-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 2D2F02E67;
 Tue, 29 Oct 2013 15:25:52 +0000 (UTC)
Received: from CMEXHTCAS1.ad.emulex.com (138.239.115.217) by
 CMEXEDGE1.ext.emulex.com (138.239.224.99) with Microsoft SMTP Server (TLS) id
 14.3.146.0; Tue, 29 Oct 2013 08:11:01 -0700
Received: from CMEXMB1.ad.emulex.com ([169.254.1.123]) by
 CMEXHTCAS1.ad.emulex.com ([2002:8aef:71b7::8aef:71b7]) with mapi id
 14.03.0146.002; Tue, 29 Oct 2013 08:10:44 -0700
From: Venkata Duvvuru <VenkatKumar.Duvvuru@Emulex.Com>
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>,
 "freebsd-current@freebsd.org" <freebsd-current@freebsd.org>
Subject: taskqueue_enqueue_fast in freebsd 10.0-current
Thread-Topic: taskqueue_enqueue_fast in freebsd 10.0-current
Thread-Index: Ac7UuEkxdur08036S5KLJpYxkoqGeQ==
Date: Tue, 29 Oct 2013 15:10:44 +0000
Message-ID: <BF3270C86E8B1349A26C34E4EC1C44CB2C728A31@CMEXMB1.ad.emulex.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [138.239.141.147]
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 15:25:53 -0000

Hi,
In Freebsd 10.0-current with Emulex's OCE driver, I observe that the bottom=
 half is hogging all the CPU which is leading to system sluggishness. I use=
d the same hardware to check the behavior on 9.1-RELEASE, everything is fin=
e, bottom half is not taking more than 10% of the CPU even at the line rate=
 speed. But with half the throughput of line rate in Freebsd-10.0-current a=
ll the CPUs peak and "top -aSCHIP" shows that it's all bottom half who is e=
ating the CPU. Did anything changed in Freebsd-10.0-current that I should b=
e careful about? Please clarify.

Thanks,
Venkat.

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 15:38:00 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id DD1777C5;
 Tue, 29 Oct 2013 15:38:00 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-qc0-x232.google.com (mail-qc0-x232.google.com
 [IPv6:2607:f8b0:400d:c01::232])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8CF8E2F76;
 Tue, 29 Oct 2013 15:38:00 +0000 (UTC)
Received: by mail-qc0-f178.google.com with SMTP id x19so9239qcw.37
 for <multiple recipients>; Tue, 29 Oct 2013 08:37:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type:content-transfer-encoding;
 bh=pAnEiUut7GbyPU+0rRmkwbmWBKTL0AKETdvyn83Mrwo=;
 b=wXMb6d0VE4dN5Izq3XE8ooVC5TQ6X6DRTf1KAJln74Nhs0mmCNUIKnhj3Q+sDHfBwf
 NvdDIJE8bAwwPwYneUfa0okMT2e8L7JfXJSqlrOJ8x2jyZR9WJVNfbwti78xvkDc3q0D
 rLs0QlLdELA9Bi5d9Yg/p3591y8/JtoFqR1rgw7GTHr4Nqrhy5VwF6HMDCq10IXKVQGE
 CTr0w/qs7zNHG9C3+Lggr7yfYcPh4au8YPx/groUhZNAW6EFQKW8xGtiDZxTd3OCehgN
 SOHNtRHy9ugW/y1OJzu9V/h9PBzlyNomNcBQI9oeEPP+TDO2bpJVIadSJ5W+RBAe5oMC
 Mgag==
MIME-Version: 1.0
X-Received: by 10.49.62.3 with SMTP id u3mr458231qer.6.1383061079812; Tue, 29
 Oct 2013 08:37:59 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.224.207.66 with HTTP; Tue, 29 Oct 2013 08:37:59 -0700 (PDT)
In-Reply-To: <BF3270C86E8B1349A26C34E4EC1C44CB2C728A31@CMEXMB1.ad.emulex.com>
References: <BF3270C86E8B1349A26C34E4EC1C44CB2C728A31@CMEXMB1.ad.emulex.com>
Date: Tue, 29 Oct 2013 08:37:59 -0700
X-Google-Sender-Auth: Y5sXaxVSz_GBYamW0Scor6GhLzY
Message-ID: <CAJ-VmonOzAmOi6+=o0rhwsQL5gwKHUE5HBb5qRsaNMhS9eJXDA@mail.gmail.com>
Subject: Re: taskqueue_enqueue_fast in freebsd 10.0-current
From: Adrian Chadd <adrian@freebsd.org>
To: Venkata Duvvuru <VenkatKumar.Duvvuru@emulex.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>,
 "freebsd-current@freebsd.org" <freebsd-current@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 15:38:00 -0000

Hi,

On 29 October 2013 08:10, Venkata Duvvuru
<VenkatKumar.Duvvuru@emulex.com> wrote:
> Hi,
> In Freebsd 10.0-current with Emulex's OCE driver, I observe that the bott=
om half is hogging all the CPU which is leading to system sluggishness. I u=
sed the same hardware to check the behavior on 9.1-RELEASE, everything is f=
ine, bottom half is not taking more than 10% of the CPU even at the line ra=
te speed. But with half the throughput of line rate in Freebsd-10.0-current=
 all the CPUs peak and "top -aSCHIP" shows that it's all bottom half who is=
 eating the CPU. Did anything changed in Freebsd-10.0-current that I should=
 be careful about? Please clarify.


spin up hwpmc and see what the story is.

Which CPU is it?


-a

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 18:31:12 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 438C6D5B
 for <net@freebsd.org>; Tue, 29 Oct 2013 18:31:12 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id AB3412BFE
 for <net@freebsd.org>; Tue, 29 Oct 2013 18:31:11 +0000 (UTC)
Received: (qmail 57064 invoked from network); 29 Oct 2013 19:01:41 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rrs@lakerest.net>; 29 Oct 2013 19:01:41 -0000
Message-ID: <526FFED9.1070704@freebsd.org>
Date: Tue, 29 Oct 2013 19:30:49 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Randall Stewart <rrs@lakerest.net>, net@freebsd.org
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
In-Reply-To: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 18:31:12 -0000

On 29.10.2013 11:50, Randall Stewart wrote:
> Hi:
>
> As discussed at vBSDcon with andre/emaste and gnn, I am sending
> this patch out to all of you ;-)

I wasn't at vBSDcon but it's good that you're sending it (again). ;)

> I have previously sent it to gnn, andre, jhb, rwatson, and several other
> of the usual suspects (as gnn put it) and received dead silence.

Sorry 'bout that.  Too many things going on recently.

> What does this patch do?
>
> Well it add the ability to do multi-queue at the driver level. Basically
> any driver that uses the new interface gets under it N queues (default
> is 8) for each physical transmit ring it has. The driver picks up
> its queue 0 first, then queue 1 .. up to the max.

To make I understand this correctly there are 8 soft-queues for each real
transmit ring, correct?  And the driver will dequeue the lowest numbered
queue for as long as there are packets in it.  Like a hierarchical strict
queuing discipline.

This is prone to head of line blocking and starvation by higher priority
queues.  May become a big problem under adverse traffic patterns.

> This allows you to prioritize packets. Also in here is the start of some
> work I will be doing for AQM.. think either Pi or Codel ;-)
>
> Right now thats pretty simple and just (in a few drivers) as the ability
> to limit the amount of data on the ring� which can help reduce buffer
> bloat. That needs to be refined into a lot more.

We actually have two queues, the soft-queue and the hardware ring which
both can be rather large leading to various issues as you mention.

I've started work on an FF contract to rethink the whole IFQ* model and
to propose and benchmark different approaches.  After that to convert all
drivers in the tree to the chosen model(s) and get rid of the legacy.  In
general the choice of model will be done in the driver and no longer by
the ifnet layer.  One or (most likely) more optimized models will be
provided by the kernel for drivers to chose from.  The idea that most,
if not all drivers use these standard kernel provided models to avoid
code duplication.  However as the pace of new features is quite high
we provide the full discretion for the driver to choose and experiment
with their own ways of dealing with it.  This is under the assumption
that once a now model has been found it is later moved to the kernel
side and subsequently used by other drivers as well.

> This work is donated by Adara Networks and has been discussed in several
> of the past vendor summits.
>
> I plan on committing this before the IETF unless I hear major objections.

There seems to be a couple of white space issues where first there is a tab
and then actual whitespace for the second one and others all over the place.

There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c,
usr.sbin/ofwdump/ofwdump.c.

It would be good to separate out the soft multi-queue changes from the ring
depth changes and do each in at least one commit.

There are two separate changes to sys/dev/oce/, one is renaming of the lock
macros and the other the change to drbr.

The changes to sys/kern/subr_bufring.c are not style compliant and we normally
don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* should
be used instead.

Why would we need a multi-consumer dequeue?

The new bufring functions on a first glance do seem to be safe on architectures
with a more relaxed memory ordering / cache coherency model than x86.

The atomic dance in a number of drbr_* functions doesn't seem to make much sense
and a single spin-lock may result in atomic operations and bus lock cycles.

There is a huge amount of includes pollution in sys/net/drbr.h which we are
currently trying to get rid of and to avoid for the future.


I like the general conceptual approach but the implementation feels bumpy and
not (yet) ready for prime time.  In any case I'd like to take forward conceptual
parts for the FF sponsored IFQ* rework.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 19:36:03 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 17B3672F;
 Tue, 29 Oct 2013 19:36:03 +0000 (UTC)
 (envelope-from rrs@lakerest.net)
Received: from lakerest.net (lakerest.net [162.235.35.161])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 978112FFB;
 Tue, 29 Oct 2013 19:36:02 +0000 (UTC)
Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162])
 (authenticated bits=0)
 by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TJZeCj074918
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT);
 Tue, 29 Oct 2013 15:35:40 -0400 (EDT)
 (envelope-from rrs@lakerest.net)
Subject: Re: MQ Patch.
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=windows-1252
From: Randall Stewart <rrs@lakerest.net>
In-Reply-To: <526FFED9.1070704@freebsd.org>
Date: Tue, 29 Oct 2013 15:35:40 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <A3A878D4-A157-430D-A023-EB1607DE9E5B@lakerest.net>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
To: Andre Oppermann <andre@FreeBSD.org>
X-Mailer: Apple Mail (2.1283)
Cc: net@FreeBSD.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 19:36:03 -0000


On Oct 29, 2013, at 2:30 PM, Andre Oppermann wrote:

> On 29.10.2013 11:50, Randall Stewart wrote:
>> Hi:
>>=20
>> As discussed at vBSDcon with andre/emaste and gnn, I am sending
>> this patch out to all of you ;-)
>=20
> I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>=20
>> I have previously sent it to gnn, andre, jhb, rwatson, and several =
other
>> of the usual suspects (as gnn put it) and received dead silence.
>=20
> Sorry 'bout that.  Too many things going on recently.
>=20
>> What does this patch do?
>>=20
>> Well it add the ability to do multi-queue at the driver level. =
Basically
>> any driver that uses the new interface gets under it N queues =
(default
>> is 8) for each physical transmit ring it has. The driver picks up
>> its queue 0 first, then queue 1 .. up to the max.
>=20
> To make I understand this correctly there are 8 soft-queues for each =
real
> transmit ring, correct?  And the driver will dequeue the lowest =
numbered
> queue for as long as there are packets in it.  Like a hierarchical =
strict
> queuing discipline.
>=20
> This is prone to head of line blocking and starvation by higher =
priority
> queues.  May become a big problem under adverse traffic patterns.

Thats the whole idea of QOS.. you take and prioritize your traffic
if you don't have enough b/w.

The guys at the bottom get none..=20

If you don't want it, you can either turn QOS off.. i.e. let
everything fall to the bottom bucket. Or even set the number
of queues to 1, and then nothing changes 1:1 queues to transmit-ring


>=20
>> This allows you to prioritize packets. Also in here is the start of =
some
>> work I will be doing for AQM.. think either Pi or Codel ;-)
>>=20
>> Right now thats pretty simple and just (in a few drivers) as the =
ability
>> to limit the amount of data on the ring=85 which can help reduce =
buffer
>> bloat. That needs to be refined into a lot more.
>=20
> We actually have two queues, the soft-queue and the hardware ring =
which
> both can be rather large leading to various issues as you mention.


Which is why I first of all set the soft-queue default at 64.. That in
some ways is still big.

In order to get rid of the hard-queue you really just have to limit
how much you put in. I have some hooks in for igb here (and em) that
do this but its just a first step. The right thing (long term) is
to go to a AQM like Codel or Pi.=20

Pi would give you coverage of both queue's at ingress to the first one =
(thinking
of a single queue model)

Codel can only handle the soft-> hard queue transition.

But Pi has the standard Cisco patent so it will probably have to be
a loadable module=85 sigh..

>=20
> I've started work on an FF contract to rethink the whole IFQ* model =
and

What is an FF contract?

> to propose and benchmark different approaches.  After that to convert =
all
> drivers in the tree to the chosen model(s) and get rid of the legacy.  =
In
> general the choice of model will be done in the driver and no longer =
by
> the ifnet layer.  One or (most likely) more optimized models will be
> provided by the kernel for drivers to chose from.  The idea that most,
> if not all drivers use these standard kernel provided models to avoid
> code duplication.  However as the pace of new features is quite high
> we provide the full discretion for the driver to choose and experiment
> with their own ways of dealing with it.  This is under the assumption
> that once a now model has been found it is later moved to the kernel
> side and subsequently used by other drivers as well.
>=20
>> This work is donated by Adara Networks and has been discussed in =
several
>> of the past vendor summits.
>>=20
>> I plan on committing this before the IETF unless I hear major =
objections.
>=20
> There seems to be a couple of white space issues where first there is =
a tab
> and then actual whitespace for the second one and others all over the =
place.
>=20
> There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
> sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, =
sys/kern/subr_bus.c,
> usr.sbin/ofwdump/ofwdump.c.
>=20

Yeah Fabien Thomas and I have already talked on that.

I had some hold over cruft that I had thought I got out.

The cesa.c changes I committed this AM and the debug stuff was
all reverted out.

Plus a couple of other little tweaks.

I will resend an updated (cleaned up patch) once my build-universe =
completes :-)

> It would be good to separate out the soft multi-queue changes from the =
ring
> depth changes and do each in at least one commit.

I am not sure what you are suggesting here.=20

>=20
> There are two separate changes to sys/dev/oce/, one is renaming of the =
lock
> macros and the other the change to drbr.
Yeah I hit that because the LOCK name unfortunately conflicted with =
another so
on one of my build-universe runs LINT would blow up ;-(

That could definitely be done separately..


>=20
> The changes to sys/kern/subr_bufring.c are not style compliant and we =
normally
> don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* =
should
> be used instead.
>=20

Those are taken *directly* the original code put in by Kip.. I just =
moved
them over when I was refactoring things.

> Why would we need a multi-consumer dequeue?

I can think of one reason.. its called lagg=20

R


>=20
> The new bufring functions on a first glance do seem to be safe on =
architectures
> with a more relaxed memory ordering / cache coherency model than x86.
>=20
> The atomic dance in a number of drbr_* functions doesn't seem to make =
much sense
> and a single spin-lock may result in atomic operations and bus lock =
cycles.
>=20
> There is a huge amount of includes pollution in sys/net/drbr.h which =
we are
> currently trying to get rid of and to avoid for the future.
>=20
>=20
> I like the general conceptual approach but the implementation feels =
bumpy and
> not (yet) ready for prime time.  In any case I'd like to take forward =
conceptual
> parts for the FF sponsored IFQ* rework.

>=20
> --=20
> Andre
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20

------------------------------
Randall Stewart
803-317-4952 (cell)


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 19:39:29 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 15AD898D;
 Tue, 29 Oct 2013 19:39:29 +0000 (UTC)
 (envelope-from linimon@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id DC475206F;
 Tue, 29 Oct 2013 19:39:28 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9TJdSrc046700;
 Tue, 29 Oct 2013 19:39:28 GMT
 (envelope-from linimon@freefall.freebsd.org)
Received: (from linimon@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9TJdSQV046699;
 Tue, 29 Oct 2013 19:39:28 GMT (envelope-from linimon)
Date: Tue, 29 Oct 2013 19:39:28 GMT
Message-Id: <201310291939.r9TJdSQV046699@freefall.freebsd.org>
To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org
From: linimon@FreeBSD.org
Subject: Re: conf/183407: [rc.d] [patch] Routing restart returns non-zero
 exitcode in case of no extra routing parameter or missing atm/ipx
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 19:39:29 -0000

Old Synopsis: Routing restart returns non-zero exitcode in case of no extra routing parameter or missing atm/ipx
New Synopsis: [rc.d] [patch] Routing restart returns non-zero exitcode in case of no extra routing parameter or missing atm/ipx

Responsible-Changed-From-To: freebsd-bugs->freebsd-net
Responsible-Changed-By: linimon
Responsible-Changed-When: Tue Oct 29 19:38:37 UTC 2013
Responsible-Changed-Why: 
Over to maintainer(s).

http://www.freebsd.org/cgi/query-pr.cgi?pr=183407

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 19:58:49 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 2420A4BB;
 Tue, 29 Oct 2013 19:58:49 +0000 (UTC)
 (envelope-from rizzo.unipi@gmail.com)
Received: from mail-la0-x22b.google.com (mail-la0-x22b.google.com
 [IPv6:2a00:1450:4010:c03::22b])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 312F221D4;
 Tue, 29 Oct 2013 19:58:48 +0000 (UTC)
Received: by mail-la0-f43.google.com with SMTP id el20so311891lab.30
 for <multiple recipients>; Tue, 29 Oct 2013 12:58:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=F5XCkNO1hZrR3ouPXbNqpGHUu7N7YnzpDaE132j70SY=;
 b=mId824GIe9gPmPicnQkgsTWczZlFWQeNHFJDjbuHRlMIbD2QDbbfoJn+TCvk4xvIs9
 XPJWCDgbbwC2FWD9CABGQ18fyXCbIqzEPOyZcn6pxZ6Yk/yVLpkF+bluDMvSgFnRLYc5
 uf7Kh4ayIP8vlzhHq6mB78gaclIA8zpTIy0ycQKyrlBo6NH7eqpmVuXeQXYYD35KtGpa
 fkU2r/53JTVyJHjL7Oh99pfLPLccZzNfCYM/aziLwHuyJfXsTYOZyiTRxV750N3Ewoi6
 OggtCj8vJwkltbYKB8NPDs3BzmZcQvtlPvyO9Yc2Lwd1XfmjBJYqJIKLHuyjEdk1yp5F
 NPIg==
MIME-Version: 1.0
X-Received: by 10.112.235.3 with SMTP id ui3mr1087178lbc.44.1383076726178;
 Tue, 29 Oct 2013 12:58:46 -0700 (PDT)
Sender: rizzo.unipi@gmail.com
Received: by 10.114.172.105 with HTTP; Tue, 29 Oct 2013 12:58:46 -0700 (PDT)
In-Reply-To: <526FFED9.1070704@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
Date: Tue, 29 Oct 2013 12:58:46 -0700
X-Google-Sender-Auth: ASpkNZvaZKzNaCv6n1oyTYCmwMA
Message-ID: <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
Subject: Re: MQ Patch.
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 19:58:49 -0000

my short, top-post comment is that I'd rather see some more
coordination with Andre, and especially some high level README
or other form of documentation explaining the architecture
you have in mind before this goes in.

To expand my point of view (and please do not read me as negative,
i am trying to be constructive and avoid future troubles and
volunteer to help with the design and implementation):

(i'll omit issues re. style and unrelated patches in the diff
because they are premature)

1. Having multiple separate software queues attached to a physical queue
makes sense only if we have a clear and documented plan
for scheduling traffic from these queues into the hw one.
Otherwise it ends up being just another confusing hack
that makes it difficult to reason about device drivers.

We already have something similar now (with the drbr queue on top
used in some cases when the hw ring overflows), the ALTQ hooks,
and without documentation this does not seem to improve the
current situation.

2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI,
but a coherent framework where you can classify/partition traffic
into separate queues, apply one of several queue management
(taildrop/RED/CODEL/whatever) and scheduling (which queue to serve next)
policies in an efficient way.

Linux mostly gets this right (they even support hierarchical schedulers).

Dummynet has a reasonable architecture although not hierarchical
and it operates at the IP level (or possibly at layer 2),
which is probably too high (but not necessarily).
We can also recycle the components, i.e. the classifier in ipfw
and the scheduling algorithms. I am happy to help on this.

ALTQ is too old and complex and inefficient and unmaintained to be
considered.

And i cannot comment on your code because you don't really explain
what you want to do and how. Codel/PI are only queue management,
not qos; and strict priority is just one (and probably the worse) policy
one can have.

One comment i can make, however, on the fact that 256 queues are
way too few for a proper system. You need the number to be
dynamic and much larger (e.g. using flowid as a key).

So, to conclude: i fully support any plan to design something that lets us
implement scheduling (and qos, if you want to call it this way)
in a reasonable way, but what is in your patch now does not really
seem to improve the current situation in any way.

cheers
luigi


On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann <andre@freebsd.org> wrote=
:

> On 29.10.2013 11:50, Randall Stewart wrote:
>
>> Hi:
>>
>> As discussed at vBSDcon with andre/emaste and gnn, I am sending
>> this patch out to all of you ;-)
>>
>
> I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>
>
>  I have previously sent it to gnn, andre, jhb, rwatson, and several other
>> of the usual suspects (as gnn put it) and received dead silence.
>>
>
> Sorry 'bout that.  Too many things going on recently.
>
>
>  What does this patch do?
>>
>> Well it add the ability to do multi-queue at the driver level. Basically
>> any driver that uses the new interface gets under it N queues (default
>> is 8) for each physical transmit ring it has. The driver picks up
>> its queue 0 first, then queue 1 .. up to the max.
>>
>
> To make I understand this correctly there are 8 soft-queues for each real
> transmit ring, correct?  And the driver will dequeue the lowest numbered
> queue for as long as there are packets in it.  Like a hierarchical strict
> queuing discipline.
>
> This is prone to head of line blocking and starvation by higher priority
> queues.  May become a big problem under adverse traffic patterns.
>
>
>  This allows you to prioritize packets. Also in here is the start of some
>> work I will be doing for AQM.. think either Pi or Codel ;-)
>>
>> Right now thats pretty simple and just (in a few drivers) as the ability
>> to limit the amount of data on the ring=85 which can help reduce buffer
>> bloat. That needs to be refined into a lot more.
>>
>
> We actually have two queues, the soft-queue and the hardware ring which
> both can be rather large leading to various issues as you mention.
>
> I've started work on an FF contract to rethink the whole IFQ* model and
> to propose and benchmark different approaches.  After that to convert all
> drivers in the tree to the chosen model(s) and get rid of the legacy.  In
> general the choice of model will be done in the driver and no longer by
> the ifnet layer.  One or (most likely) more optimized models will be
> provided by the kernel for drivers to chose from.  The idea that most,
> if not all drivers use these standard kernel provided models to avoid
> code duplication.  However as the pace of new features is quite high
> we provide the full discretion for the driver to choose and experiment
> with their own ways of dealing with it.  This is under the assumption
> that once a now model has been found it is later moved to the kernel
> side and subsequently used by other drivers as well.
>
>
>  This work is donated by Adara Networks and has been discussed in several
>> of the past vendor summits.
>>
>> I plan on committing this before the IETF unless I hear major objections=
.
>>
>
> There seems to be a couple of white space issues where first there is a t=
ab
> and then actual whitespace for the second one and others all over the
> place.
>
> There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
> sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c,
> usr.sbin/ofwdump/ofwdump.c.
>
> It would be good to separate out the soft multi-queue changes from the ri=
ng
> depth changes and do each in at least one commit.
>
> There are two separate changes to sys/dev/oce/, one is renaming of the lo=
ck
> macros and the other the change to drbr.
>
> The changes to sys/kern/subr_bufring.c are not style compliant and we
> normally
> don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_*
> should
> be used instead.
>
> Why would we need a multi-consumer dequeue?
>
> The new bufring functions on a first glance do seem to be safe on
> architectures
> with a more relaxed memory ordering / cache coherency model than x86.
>
> The atomic dance in a number of drbr_* functions doesn't seem to make muc=
h
> sense
> and a single spin-lock may result in atomic operations and bus lock cycle=
s.
>
> There is a huge amount of includes pollution in sys/net/drbr.h which we a=
re
> currently trying to get rid of and to avoid for the future.
>
>
> I like the general conceptual approach but the implementation feels bumpy
> and
> not (yet) ready for prime time.  In any case I'd like to take forward
> conceptual
> parts for the FF sponsored IFQ* rework.
>
> --
> Andre
>
>
> ______________________________**_________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.free=
bsd.org/mailman/listinfo/freebsd-net>
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@**freebsd.org<f=
reebsd-net-unsubscribe@freebsd.org>
> "
>


--=20
-----------------------------------------+-------------------------------
 Prof. Luigi RIZZO, rizzo@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
 TEL      +39-050-2211611               . via Diotisalvi 2
 Mobile   +39-338-6809875               . 56122 PISA (Italy)
-----------------------------------------+-------------------------------

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 20:03:55 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 2330389D
 for <net@FreeBSD.org>; Tue, 29 Oct 2013 20:03:55 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 5403B225C
 for <net@FreeBSD.org>; Tue, 29 Oct 2013 20:03:54 +0000 (UTC)
Received: (qmail 57437 invoked from network); 29 Oct 2013 20:34:24 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rrs@lakerest.net>; 29 Oct 2013 20:34:24 -0000
Message-ID: <52701494.6050404@freebsd.org>
Date: Tue, 29 Oct 2013 21:03:32 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Randall Stewart <rrs@lakerest.net>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <A3A878D4-A157-430D-A023-EB1607DE9E5B@lakerest.net>
In-Reply-To: <A3A878D4-A157-430D-A023-EB1607DE9E5B@lakerest.net>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Cc: net@FreeBSD.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 20:03:55 -0000

On 29.10.2013 20:35, Randall Stewart wrote:
>
> On Oct 29, 2013, at 2:30 PM, Andre Oppermann wrote:
>
>> On 29.10.2013 11:50, Randall Stewart wrote:
>>> Hi:
>>>
>>> As discussed at vBSDcon with andre/emaste and gnn, I am sending
>>> this patch out to all of you ;-)
>>
>> I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>>
>>> I have previously sent it to gnn, andre, jhb, rwatson, and several other
>>> of the usual suspects (as gnn put it) and received dead silence.
>>
>> Sorry 'bout that.  Too many things going on recently.
>>
>>> What does this patch do?
>>>
>>> Well it add the ability to do multi-queue at the driver level. Basically
>>> any driver that uses the new interface gets under it N queues (default
>>> is 8) for each physical transmit ring it has. The driver picks up
>>> its queue 0 first, then queue 1 .. up to the max.
>>
>> To make I understand this correctly there are 8 soft-queues for each real
>> transmit ring, correct?  And the driver will dequeue the lowest numbered
>> queue for as long as there are packets in it.  Like a hierarchical strict
>> queuing discipline.
>>
>> This is prone to head of line blocking and starvation by higher priority
>> queues.  May become a big problem under adverse traffic patterns.
>
> Thats the whole idea of QOS.. you take and prioritize your traffic
> if you don't have enough b/w.

That is understood.  In most cases it's done on a WFQ basis though and
strict priority is limited to realtime (VoIP) traffic and also bound
overall not to monopolize the entire link if something goes wrong.
Almost all documentation from C and J recommends against unbounded
strict priority scheduling for that reason.

> The guys at the bottom get none..

I wonder how useful an 8 level strict priority actually can be under
load for everything below level 1.  Normally strategic packet loss
as in RED or its more efficient variants together with some WFQ scheme
signals the senders not to increase pace, or actually to slow down a
bit if the link is at capacity.

In practice I've never seen a case where full starvation of lower classes
made any sense.  You'd want at least some packets go through every now
and then even in scavenger class.

> If you don't want it, you can either turn QOS off.. i.e. let
> everything fall to the bottom bucket. Or even set the number
> of queues to 1, and then nothing changes 1:1 queues to transmit-ring

The default setting probably should be the lowest priority available
and then only have the more important stuff get a higher level rather
than the other way around.

>>> This allows you to prioritize packets. Also in here is the start of some
>>> work I will be doing for AQM.. think either Pi or Codel ;-)
>>>
>>> Right now thats pretty simple and just (in a few drivers) as the ability
>>> to limit the amount of data on the ring� which can help reduce buffer
>>> bloat. That needs to be refined into a lot more.
>>
>> We actually have two queues, the soft-queue and the hardware ring which
>> both can be rather large leading to various issues as you mention.
>
>
> Which is why I first of all set the soft-queue default at 64.. That in
> some ways is still big.

If it's MTU sized packets it should be manageable.  If it's TSO chains
though...

> In order to get rid of the hard-queue you really just have to limit
> how much you put in. I have some hooks in for igb here (and em) that
> do this but its just a first step. The right thing (long term) is
> to go to a AQM like Codel or Pi.

I actually wonder if there is any benefit in soft-queuing at all,
except for the multiple-writer concurrency situation.  The DMA rings
are deep enough already.  If they are full just drop the packet without
tacking another soft-queue at the back of it.

> Pi would give you coverage of both queue's at ingress to the first one (thinking
> of a single queue model)
>
> Codel can only handle the soft-> hard queue transition.

Yup.

> But Pi has the standard Cisco patent so it will probably have to be
> a loadable module� sigh..

Haven't looked at Pi yet.  Do you have a pointer to a sufficiently detailed
paper on it?

>> I've started work on an FF contract to rethink the whole IFQ* model and
>
> What is an FF contract?

FreeBSD Foundation.

>> to propose and benchmark different approaches.  After that to convert all
>> drivers in the tree to the chosen model(s) and get rid of the legacy.  In
>> general the choice of model will be done in the driver and no longer by
>> the ifnet layer.  One or (most likely) more optimized models will be
>> provided by the kernel for drivers to chose from.  The idea that most,
>> if not all drivers use these standard kernel provided models to avoid
>> code duplication.  However as the pace of new features is quite high
>> we provide the full discretion for the driver to choose and experiment
>> with their own ways of dealing with it.  This is under the assumption
>> that once a now model has been found it is later moved to the kernel
>> side and subsequently used by other drivers as well.
>>
>>> This work is donated by Adara Networks and has been discussed in several
>>> of the past vendor summits.
>>>
>>> I plan on committing this before the IETF unless I hear major objections.
>>
>> There seems to be a couple of white space issues where first there is a tab
>> and then actual whitespace for the second one and others all over the place.
>>
>> There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
>> sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c,
>> usr.sbin/ofwdump/ofwdump.c.
>>
>
> Yeah Fabien Thomas and I have already talked on that.
>
> I had some hold over cruft that I had thought I got out.
>
> The cesa.c changes I committed this AM and the debug stuff was
> all reverted out.
>
> Plus a couple of other little tweaks.
>
> I will resend an updated (cleaned up patch) once my build-universe completes :-)

OK.

>> It would be good to separate out the soft multi-queue changes from the ring
>> depth changes and do each in at least one commit.
>
> I am not sure what you are suggesting here.

The multi-queue and the ring-depth changes in igb(4) et al should be separate
commits because they are distinct features.  The driver maintainer should sign
off on them too before committing.

>> There are two separate changes to sys/dev/oce/, one is renaming of the lock
>> macros and the other the change to drbr.
> Yeah I hit that because the LOCK name unfortunately conflicted with another so
> on one of my build-universe runs LINT would blow up ;-(
>
> That could definitely be done separately..

Please do so.  All separate function units should be done as individual commits
to better track it and also to be able to back them out if there's a problem
with one of them.

>> The changes to sys/kern/subr_bufring.c are not style compliant and we normally
>> don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* should
>> be used instead.
>>
>
> Those are taken *directly* the original code put in by Kip.. I just moved
> them over when I was refactoring things.

Ugh...

>> Why would we need a multi-consumer dequeue?
>
> I can think of one reason.. its called lagg

Lagg should be hash based so there it could process down through to the real
interface instead of doing such a dance which only re-orders the packets of
the same stream.

-- 
Andre

> R
>
>
>>
>> The new bufring functions on a first glance do seem to be safe on architectures
>> with a more relaxed memory ordering / cache coherency model than x86.
>>
>> The atomic dance in a number of drbr_* functions doesn't seem to make much sense
>> and a single spin-lock may result in atomic operations and bus lock cycles.
>>
>> There is a huge amount of includes pollution in sys/net/drbr.h which we are
>> currently trying to get rid of and to avoid for the future.
>>
>>
>> I like the general conceptual approach but the implementation feels bumpy and
>> not (yet) ready for prime time.  In any case I'd like to take forward conceptual
>> parts for the FF sponsored IFQ* rework.
>
>>
>> --
>> Andre
>>
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>>
>
> ------------------------------
> Randall Stewart
> 803-317-4952 (cell)
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>
>


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 20:20:36 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 94E81245;
 Tue, 29 Oct 2013 20:20:36 +0000 (UTC)
 (envelope-from rrs@lakerest.net)
Received: from lakerest.net (lakerest.net [162.235.35.161])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 02ABD23C6;
 Tue, 29 Oct 2013 20:20:35 +0000 (UTC)
Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162])
 (authenticated bits=0)
 by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TKK8eU075478
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT);
 Tue, 29 Oct 2013 16:20:19 -0400 (EDT)
 (envelope-from rrs@lakerest.net)
Subject: Re: MQ Patch.
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=windows-1252
From: Randall Stewart <rrs@lakerest.net>
In-Reply-To: <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
Date: Tue, 29 Oct 2013 16:20:08 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
To: Luigi Rizzo <rizzo@iet.unipi.it>
X-Mailer: Apple Mail (2.1283)
Cc: Andre Oppermann <andre@freebsd.org>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 20:20:36 -0000

Lugi:

 comments in line..


On Oct 29, 2013, at 3:58 PM, Luigi Rizzo wrote:

> my short, top-post comment is that I'd rather see some more
> coordination with Andre, and especially some high level README
> or other form of documentation explaining the architecture
> you have in mind before this goes in.
>=20
> To expand my point of view (and please do not read me as negative,
> i am trying to be constructive and avoid future troubles and
> volunteer to help with the design and implementation):
>=20
> (i'll omit issues re. style and unrelated patches in the diff
> because they are premature)
>=20
> 1. Having multiple separate software queues attached to a physical =
queue
> makes sense only if we have a clear and documented plan
> for scheduling traffic from these queues into the hw one.
> Otherwise it ends up being just another confusing hack
> that makes it difficult to reason about device drivers.
>=20
> We already have something similar now (with the drbr queue on top
> used in some cases when the hw ring overflows), the ALTQ hooks,
> and without documentation this does not seem to improve the
> current situation.
>=20


Well I can't get Adara to give up how it uses these in its product.. I =
was
lucky to get them to give back the low level work.

The problem with ALTQ is that it is really broken if you want to do any =
sort
of decent performance with queueing. However with a small bit of work =
(aka throw
away the altq queues themselves and set ALTQ to place the ac_qos number =
in here
and queue the packet) you could have ALTQ able to transmit at line-rate =
and
have proper QOS.

> 2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI,
> but a coherent framework where you can classify/partition traffic
> into separate queues, apply one of several queue management
> (taildrop/RED/CODEL/whatever) and scheduling (which queue to serve =
next)
> policies in an efficient way.
>=20
> Linux mostly gets this right (they even support hierarchical =
schedulers).

Which is also what ALTq attempts to do as well. Again I can't get Adara
to give there top level code.. but someone *could* hint hint hook altq =
up
to this and be able to have a reasonable performance model with altq...


>=20
> Dummynet has a reasonable architecture although not hierarchical
> and it operates at the IP level (or possibly at layer 2),
> which is probably too high (but not necessarily).
> We can also recycle the components, i.e. the classifier in ipfw
> and the scheduling algorithms. I am happy to help on this.
>=20
> ALTQ is too old and complex and inefficient and unmaintained to be =
considered.

Exactly..

>=20
> And i cannot comment on your code because you don't really explain
> what you want to do and how. Codel/PI are only queue management,
> not qos; and strict priority is just one (and probably the worse) =
policy
> one can have.

Of course but you need them if you want to prevent buffer-bloat.


>=20
> One comment i can make, however, on the fact that 256 queues are
> way too few for a proper system. You need the number to be
> dynamic and much larger (e.g. using flowid as a key).
>=20
> So, to conclude: i fully support any plan to design something that =
lets us
> implement scheduling (and qos, if you want to call it this way)
> in a reasonable way, but what is in your patch now does not really
> seem to improve the current situation in any way.
>=20


Its a step towards fixing that I am allowed to give. I can see
why Company's get frustrated with trying to give anything to the =
project.

R

> cheers
> luigi
>=20
>=20
>=20
> On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann <andre@freebsd.org> =
wrote:
> On 29.10.2013 11:50, Randall Stewart wrote:
> Hi:
>=20
> As discussed at vBSDcon with andre/emaste and gnn, I am sending
> this patch out to all of you ;-)
>=20
> I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>=20
>=20
> I have previously sent it to gnn, andre, jhb, rwatson, and several =
other
> of the usual suspects (as gnn put it) and received dead silence.
>=20
> Sorry 'bout that.  Too many things going on recently.
>=20
>=20
> What does this patch do?
>=20
> Well it add the ability to do multi-queue at the driver level. =
Basically
> any driver that uses the new interface gets under it N queues (default
> is 8) for each physical transmit ring it has. The driver picks up
> its queue 0 first, then queue 1 .. up to the max.
>=20
> To make I understand this correctly there are 8 soft-queues for each =
real
> transmit ring, correct?  And the driver will dequeue the lowest =
numbered
> queue for as long as there are packets in it.  Like a hierarchical =
strict
> queuing discipline.
>=20
> This is prone to head of line blocking and starvation by higher =
priority
> queues.  May become a big problem under adverse traffic patterns.
>=20
>=20
> This allows you to prioritize packets. Also in here is the start of =
some
> work I will be doing for AQM.. think either Pi or Codel ;-)
>=20
> Right now thats pretty simple and just (in a few drivers) as the =
ability
> to limit the amount of data on the ring=85 which can help reduce =
buffer
> bloat. That needs to be refined into a lot more.
>=20
> We actually have two queues, the soft-queue and the hardware ring =
which
> both can be rather large leading to various issues as you mention.
>=20
> I've started work on an FF contract to rethink the whole IFQ* model =
and
> to propose and benchmark different approaches.  After that to convert =
all
> drivers in the tree to the chosen model(s) and get rid of the legacy.  =
In
> general the choice of model will be done in the driver and no longer =
by
> the ifnet layer.  One or (most likely) more optimized models will be
> provided by the kernel for drivers to chose from.  The idea that most,
> if not all drivers use these standard kernel provided models to avoid
> code duplication.  However as the pace of new features is quite high
> we provide the full discretion for the driver to choose and experiment
> with their own ways of dealing with it.  This is under the assumption
> that once a now model has been found it is later moved to the kernel
> side and subsequently used by other drivers as well.
>=20
>=20
> This work is donated by Adara Networks and has been discussed in =
several
> of the past vendor summits.
>=20
> I plan on committing this before the IETF unless I hear major =
objections.
>=20
> There seems to be a couple of white space issues where first there is =
a tab
> and then actual whitespace for the second one and others all over the =
place.
>=20
> There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
> sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, =
sys/kern/subr_bus.c,
> usr.sbin/ofwdump/ofwdump.c.
>=20
> It would be good to separate out the soft multi-queue changes from the =
ring
> depth changes and do each in at least one commit.
>=20
> There are two separate changes to sys/dev/oce/, one is renaming of the =
lock
> macros and the other the change to drbr.
>=20
> The changes to sys/kern/subr_bufring.c are not style compliant and we =
normally
> don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* =
should
> be used instead.
>=20
> Why would we need a multi-consumer dequeue?
>=20
> The new bufring functions on a first glance do seem to be safe on =
architectures
> with a more relaxed memory ordering / cache coherency model than x86.
>=20
> The atomic dance in a number of drbr_* functions doesn't seem to make =
much sense
> and a single spin-lock may result in atomic operations and bus lock =
cycles.
>=20
> There is a huge amount of includes pollution in sys/net/drbr.h which =
we are
> currently trying to get rid of and to avoid for the future.
>=20
>=20
> I like the general conceptual approach but the implementation feels =
bumpy and
> not (yet) ready for prime time.  In any case I'd like to take forward =
conceptual
> parts for the FF sponsored IFQ* rework.
>=20
> --=20
> Andre
>=20
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20
>=20
>=20
> --=20
> =
-----------------------------------------+-------------------------------
>  Prof. Luigi RIZZO, rizzo@iet.unipi.it  . Dip. di Ing. =
dell'Informazione
>  http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
>  TEL      +39-050-2211611               . via Diotisalvi 2
>  Mobile   +39-338-6809875               . 56122 PISA (Italy)
> =
-----------------------------------------+-------------------------------

------------------------------
Randall Stewart
803-317-4952 (cell)


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 20:42:10 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 213012F4
 for <net@freebsd.org>; Tue, 29 Oct 2013 20:42:10 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3DC992533
 for <net@freebsd.org>; Tue, 29 Oct 2013 20:42:08 +0000 (UTC)
Received: (qmail 57695 invoked from network); 29 Oct 2013 21:12:39 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rizzo@iet.unipi.it>; 29 Oct 2013 21:12:39 -0000
Message-ID: <52701D8B.8050907@freebsd.org>
Date: Tue, 29 Oct 2013 21:41:47 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
In-Reply-To: <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 20:42:10 -0000

Let me jump in here and explain roughly the ideas/path I'm exploring
in creating and eventually implementing a big picture for drivers,
queues, queue management, various QoS and so on:

Situation: We're still mostly based on the old 4.4BSD IFQ model with
a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
have in tree aren't helpful at all.

Steps:

1. take the soft-queuing method out of the ifnet layer and make it
    a property of the driver, so that the upper stack (or actually
    protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
    without any queuing at that point.  It then is up to the driver
    to decide how it multiplexes multi-core access to its queue(s)
    and how they are configured.  Some hardware supports multiple
    queues and some even support WFQ models among these queues in
    hardware.  In that case any soft-queue layer would be omitted.
    For the other cases the kernel will provide one or two proven
    and optimized soft-queue and multi-writer access implementations
    to be used by the drivers.  Drivers should avoid having their
    own soft-queue implementations but they can if they really want
    to.

2. make flowid's (or hashes) an integral part of the network stack.
    The mbuf header fully supports it.  If the hardware provides a
    flowid (toeplitz for example) use it, otherwise compute a hash
    a bit up the stack for incoming packets.  Outgoing packets get
    their hash based on the inpcb or whatever.  In- and outbound
    directions are totally separate and don't have to use the same
    hash, it only has to be constant with a flow.  In theory it
    could be randomly chosen at flow setup time (eg. tcp connect).
    This way the load can be distributed among multiple hw queues
    or interfaces in the case of lagg(4) with a single mbuf header
    lookup.  When we can make sure that every packet has a flowid
    many things become possible and even easy.  Again drivers should
    not invent their own software implementations and rely on the
    kernel to provide it.

3. make QoS/CoS an integral part of the network stack.  The first
    step is done with the qoscos field in the mbuf header.  It is
    eight bits wide and its use/semantics haven't been fully
    established yet.  However the idea is to have a classifier tag
    the packet when it enters the network stack, either by coming
    in on an interface or by being generated within the stack.
    The qoscos tag can be taken from layer2 information (vlan header)
    or chosen based on more complex rules through a packet filter
    such as ipfw, pf or ipf.  There won't be any separate classifier
    as in ALTQ anymore.  This is also the path OpenBSD has taken.
    Depending on the ingress/egress encapsulation the range of
    qos/cos information may be more limited than the 8 bits we have
    in the mbuf header.  In that case the larger range has to be
    mapped into the smaller range by putting neighboring bins together.
    This is how it is done in all routers and routing switches by
    various vendors.  The administrator decides how the mapping is
    done and where it is taken from.

4. adjust the stack and drivers to do all of the above and to
    optimally make use of the hardware capabilities.  If a hardware
    supports multi-queue and SP/WFQ at once (ie. ixgbe(4)) then there
    is no need for any soft-queuing.  Otherwise the various queuing
    and queue management disciplines will hook into (*if_transmit)
    and do their magic before the packet reaches the DMA ring.  To
    reach this level a bit of infrastructure work has to be done
    first, for example the DMA ring depth needs to be adjustable
    through a generic mechanism for all drivers, and the new-ALTQ
    should be able to hook into the drivers TX completion interrupt
    to clock out the packets.

This should give a rough outline of the path(s) to be explored in
the next weeks.

-- 
Andre


On 29.10.2013 20:58, Luigi Rizzo wrote:
> my short, top-post comment is that I'd rather see some more
> coordination with Andre, and especially some high level README
> or other form of documentation explaining the architecture
> you have in mind before this goes in.
>
> To expand my point of view (and please do not read me as negative,
> i am trying to be constructive and avoid future troubles and
> volunteer to help with the design and implementation):
>
> (i'll omit issues re. style and unrelated patches in the diff
> because they are premature)
>
> 1. Having multiple separate software queues attached to a physical queue
> makes sense only if we have a clear and documented plan
> for scheduling traffic from these queues into the hw one.
> Otherwise it ends up being just another confusing hack
> that makes it difficult to reason about device drivers.
>
> We already have something similar now (with the drbr queue on top
> used in some cases when the hw ring overflows), the ALTQ hooks,
> and without documentation this does not seem to improve the
> current situation.
>
> 2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI,
> but a coherent framework where you can classify/partition traffic
> into separate queues, apply one of several queue management
> (taildrop/RED/CODEL/whatever) and scheduling (which queue to serve next)
> policies in an efficient way.
>
> Linux mostly gets this right (they even support hierarchical schedulers).
>
> Dummynet has a reasonable architecture although not hierarchical
> and it operates at the IP level (or possibly at layer 2),
> which is probably too high (but not necessarily).
> We can also recycle the components, i.e. the classifier in ipfw
> and the scheduling algorithms. I am happy to help on this.
>
> ALTQ is too old and complex and inefficient and unmaintained to be considered.
>
> And i cannot comment on your code because you don't really explain
> what you want to do and how. Codel/PI are only queue management,
> not qos; and strict priority is just one (and probably the worse) policy
> one can have.
>
> One comment i can make, however, on the fact that 256 queues are
> way too few for a proper system. You need the number to be
> dynamic and much larger (e.g. using flowid as a key).
>
> So, to conclude: i fully support any plan to design something that lets us
> implement scheduling (and qos, if you want to call it this way)
> in a reasonable way, but what is in your patch now does not really
> seem to improve the current situation in any way.
>
> cheers
> luigi
>
>
>
> On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann <andre@freebsd.org <mailto:andre@freebsd.org>> wrote:
>
>     On 29.10.2013 11:50, Randall Stewart wrote:
>
>         Hi:
>
>         As discussed at vBSDcon with andre/emaste and gnn, I am sending
>         this patch out to all of you ;-)
>
>
>     I wasn't at vBSDcon but it's good that you're sending it (again). ;)
>
>
>         I have previously sent it to gnn, andre, jhb, rwatson, and several other
>         of the usual suspects (as gnn put it) and received dead silence.
>
>
>     Sorry 'bout that.  Too many things going on recently.
>
>
>         What does this patch do?
>
>         Well it add the ability to do multi-queue at the driver level. Basically
>         any driver that uses the new interface gets under it N queues (default
>         is 8) for each physical transmit ring it has. The driver picks up
>         its queue 0 first, then queue 1 .. up to the max.
>
>
>     To make I understand this correctly there are 8 soft-queues for each real
>     transmit ring, correct?  And the driver will dequeue the lowest numbered
>     queue for as long as there are packets in it.  Like a hierarchical strict
>     queuing discipline.
>
>     This is prone to head of line blocking and starvation by higher priority
>     queues.  May become a big problem under adverse traffic patterns.
>
>
>         This allows you to prioritize packets. Also in here is the start of some
>         work I will be doing for AQM.. think either Pi or Codel ;-)
>
>         Right now thats pretty simple and just (in a few drivers) as the ability
>         to limit the amount of data on the ring� which can help reduce buffer
>         bloat. That needs to be refined into a lot more.
>
>
>     We actually have two queues, the soft-queue and the hardware ring which
>     both can be rather large leading to various issues as you mention.
>
>     I've started work on an FF contract to rethink the whole IFQ* model and
>     to propose and benchmark different approaches.  After that to convert all
>     drivers in the tree to the chosen model(s) and get rid of the legacy.  In
>     general the choice of model will be done in the driver and no longer by
>     the ifnet layer.  One or (most likely) more optimized models will be
>     provided by the kernel for drivers to chose from.  The idea that most,
>     if not all drivers use these standard kernel provided models to avoid
>     code duplication.  However as the pace of new features is quite high
>     we provide the full discretion for the driver to choose and experiment
>     with their own ways of dealing with it.  This is under the assumption
>     that once a now model has been found it is later moved to the kernel
>     side and subsequently used by other drivers as well.
>
>
>         This work is donated by Adara Networks and has been discussed in several
>         of the past vendor summits.
>
>         I plan on committing this before the IETF unless I hear major objections.
>
>
>     There seems to be a couple of white space issues where first there is a tab
>     and then actual whitespace for the second one and others all over the place.
>
>     There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c,
>     sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c,
>     usr.sbin/ofwdump/ofwdump.c.
>
>     It would be good to separate out the soft multi-queue changes from the ring
>     depth changes and do each in at least one commit.
>
>     There are two separate changes to sys/dev/oce/, one is renaming of the lock
>     macros and the other the change to drbr.
>
>     The changes to sys/kern/subr_bufring.c are not style compliant and we normally
>     don't use Linux "wb()" barriers in FreeBSD native code.  The atomics_* should
>     be used instead.
>
>     Why would we need a multi-consumer dequeue?
>
>     The new bufring functions on a first glance do seem to be safe on architectures
>     with a more relaxed memory ordering / cache coherency model than x86.
>
>     The atomic dance in a number of drbr_* functions doesn't seem to make much sense
>     and a single spin-lock may result in atomic operations and bus lock cycles.
>
>     There is a huge amount of includes pollution in sys/net/drbr.h which we are
>     currently trying to get rid of and to avoid for the future.
>
>
>     I like the general conceptual approach but the implementation feels bumpy and
>     not (yet) ready for prime time.  In any case I'd like to take forward conceptual
>     parts for the FF sponsored IFQ* rework.
>
>     --
>     Andre
>
>
>     _________________________________________________
>     freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org> mailing list
>     http://lists.freebsd.org/__mailman/listinfo/freebsd-net
>     <http://lists.freebsd.org/mailman/listinfo/freebsd-net>
>     To unsubscribe, send any mail to "freebsd-net-unsubscribe@__freebsd.org
>     <mailto:freebsd-net-unsubscribe@freebsd.org>"
>
>
>
>
> --
> -----------------------------------------+-------------------------------
>   Prof. Luigi RIZZO, rizzo@iet.unipi.it <mailto:rizzo@iet.unipi.it>  . Dip. di Ing. dell'Informazione
> http://www.iet.unipi.it/~luigi/ <http://www.iet.unipi.it/%7Eluigi/>        . Universita` di Pisa
>   TEL      +39-050-2211611               . via Diotisalvi 2
>   Mobile   +39-338-6809875               . 56122 PISA (Italy)
> -----------------------------------------+-------------------------------


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 20:50:28 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 77851A24
 for <net@freebsd.org>; Tue, 29 Oct 2013 20:50:28 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id DB88625EE
 for <net@freebsd.org>; Tue, 29 Oct 2013 20:50:27 +0000 (UTC)
Received: (qmail 57757 invoked from network); 29 Oct 2013 21:20:57 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rrs@lakerest.net>; 29 Oct 2013 21:20:57 -0000
Message-ID: <52701F7E.2060604@freebsd.org>
Date: Tue, 29 Oct 2013 21:50:06 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Randall Stewart <rrs@lakerest.net>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>
In-Reply-To: <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 20:50:28 -0000

On 29.10.2013 21:20, Randall Stewart wrote:
>> So, to conclude: i fully support any plan to design something that lets us
>> implement scheduling (and qos, if you want to call it this way)
>> in a reasonable way, but what is in your patch now does not really
>> seem to improve the current situation in any way.
>
> Its a step towards fixing that I am allowed to give. I can see
> why Company's get frustrated with trying to give anything to the project.

Well, that we have a problem in that area is known and acknowledged and
there is active work in this area going on.

It would be very problematic if every vendor were just to through some
stuff over the fence and have it integrated as is.  It would quickly
become very messy.  In many specific purpose geared products a number
of shortcuts can be taken that may not be appropriate for a general
purpose OS that does more than routing.

I believe we value the contribution by Adara and you but at the same
time want to integrate it into a bigger picture for the entire kernel.
When you pull up your product to FreeBSD 11 in the future it should
be easy to stack your functionality again on the new base infrastructure
without many/any modifications.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 21:02:40 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id B67EF430;
 Tue, 29 Oct 2013 21:02:40 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-qc0-x231.google.com (mail-qc0-x231.google.com
 [IPv6:2607:f8b0:400d:c01::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 64751272B;
 Tue, 29 Oct 2013 21:02:40 +0000 (UTC)
Received: by mail-qc0-f177.google.com with SMTP id u18so280575qcx.36
 for <multiple recipients>; Tue, 29 Oct 2013 14:02:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=869yhwA5JFgzW7eN27uMblMpWHeA+FnAVb49v/fEe20=;
 b=eSUvg/Rsio9unPDX7nyGqYBsAgeRR65Z6u8Yoa+VfqWR1sB27iWfAx3dHfGMoHnhC/
 N/9Jjcp9orVYbAslXS3Zw5EilGz0NpNS17xG3+XWxZ1YQygcTDms39pdrav+s/Wm1bBY
 J92gWeZCQ8kW3+ap8NMXC5oHA5S+yC/nMJmLblix1C49rxThmuqCf2SLms6Uw0w/zj3e
 vhC8f0y+JQ7ty5umirJj2aVUPT/5lZLg5xUXupEYt5EUu8Mth3wngLVWfUxqp+CBln89
 ucyW7rv8cyB0e3+nfHWxEywTSIiKznd2OzUiCW/wyR27cKR+lnhoeLKog2VzI8wpnPTX
 iQLg==
MIME-Version: 1.0
X-Received: by 10.49.12.14 with SMTP id u14mr2335894qeb.74.1383080559519; Tue,
 29 Oct 2013 14:02:39 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.224.207.66 with HTTP; Tue, 29 Oct 2013 14:02:39 -0700 (PDT)
In-Reply-To: <52701F7E.2060604@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>
 <52701F7E.2060604@freebsd.org>
Date: Tue, 29 Oct 2013 14:02:39 -0700
X-Google-Sender-Auth: CM5CFZliHd3rd1Ywv_53dQLgg3I
Message-ID: <CAJ-VmokJaBhZE+3ZDsi0Yybuvtb_d7AH_RThCKs4inUM=UQrAQ@mail.gmail.com>
Subject: Re: MQ Patch.
From: Adrian Chadd <adrian@freebsd.org>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: Luigi Rizzo <rizzo@iet.unipi.it>, Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 21:02:40 -0000

[snip everything]

ok, I've reviewed the work.

TL;DR - it's a clearly correct step in the right direction, but I
think we need to just think it through a tad bit more first.

There have been queue discipline and queue management discussions in
the past. Randall's work is a good step in that direction.

I think though that we can take a step back up a little further.

* In terms of queuing frames into multiple queues - yes, we absolutely
should have an if_transmit() path to the driver that obeys "a" QoS
field in the mbuf and pushes it into the relevant queue - with
randalls work, it's in the driver, but it doesn't _have_ to be;
* In terms of queue servicing and management - we likely need to have
a variety of queue plugins that determine which frame from which queue
gets chosen next to hand to the hardware. The hardware may have
multiple queues! The hardware may have one queue! The application
developer may only want to use one queue! That should be flexible and
easy to plug into things.
* Then we need to support dropping frames during queue and dropping
frames during dequeue (ie, on its way to the hardware). That way we
can implement the currently interesting kinds of queue disciplines (eg
CODEL, etc.)
* Should this be done at the driver layer (ie it's a library that each
driver creates and owns), or as a layer above it, controlling the
network device (ie, the linux queue discipline method.)

So, my comments:

* I don't like how it's hard-coding drbr's into the drivers. Yes, the
underlying state should be a drbr for now. But I'd rather we have a
queue discipline plugin API that drivers create an instance of.
* It'll have methods to init/flush the rings, queue a frame into a
ring, dequeue a frame from a ring, be notified of transmit completions
so more work can be done, etc.
* For people who do latency-sensitive things, they can just bypass
this entirely and go straight to the hardware queues without going
through this kind of intermediary queue layer.

Randall - I think we can take your work and turn it into a net library
that implements your queue management routines. That way we can start
enabling people to tinker with it and replace it if they need to.

What do you think?

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 21:03:45 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 98DF956C;
 Tue, 29 Oct 2013 21:03:45 +0000 (UTC)
 (envelope-from nparhar@gmail.com)
Received: from mail-pb0-x22e.google.com (mail-pb0-x22e.google.com
 [IPv6:2607:f8b0:400e:c01::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6961F2748;
 Tue, 29 Oct 2013 21:03:45 +0000 (UTC)
Received: by mail-pb0-f46.google.com with SMTP id un4so402292pbc.33
 for <multiple recipients>; Tue, 29 Oct 2013 14:03:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type:content-transfer-encoding;
 bh=GX8gLD4SxWt6kl9FF0AgLSK5vSzpNxvKaob1H5XX4mA=;
 b=r6PfG3Mwne9Xc2whqETNi5BgbVvnkR2JQt/MRbLhNhQ3Wl5pmTos9mcHRPjgR6dlpW
 QrzKrk8iIp+GKf0P5+RBE9nu4UWouAf7GsfEwlzc/WYGKxpL5XulywYwdpLoXe9NNr5x
 Mc80LtdMKsqrbFjijFfdMgUDwJrBlxtYgP5pHaCgXytkP4CktaVhrBz9VNfy+aCeBwUJ
 Jg2UfGIVCfMXuaMZ+eEQyO7jXcziOG5br6tPPXmpC5p0K2dkBjg6/bnHUEX4hCBzERyj
 RoXYIgFK29JEc899HG8Nt9Gm2ClzgE9sU/KCwFm1nM3Fw90lsEIHtF8IgypUpD9AZLvL
 pIBQ==
X-Received: by 10.66.233.69 with SMTP id tu5mr2467394pac.78.1383080624103;
 Tue, 29 Oct 2013 14:03:44 -0700 (PDT)
Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58])
 by mx.google.com with ESMTPSA id v4sm36857732pbq.31.2013.10.29.14.03.42
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Oct 2013 14:03:43 -0700 (PDT)
Sender: Navdeep Parhar <nparhar@gmail.com>
Message-ID: <527022AC.4030502@FreeBSD.org>
Date: Tue, 29 Oct 2013 14:03:40 -0700
From: Navdeep Parhar <np@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Andre Oppermann <andre@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org>
In-Reply-To: <52701D8B.8050907@freebsd.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 21:03:45 -0000

On 10/29/13 13:41, Andre Oppermann wrote:
> Let me jump in here and explain roughly the ideas/path I'm exploring
> in creating and eventually implementing a big picture for drivers,
> queues, queue management, various QoS and so on:
> 
> Situation: We're still mostly based on the old 4.4BSD IFQ model with
> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
> have in tree aren't helpful at all.
> 
> Steps:
> 
> 1. take the soft-queuing method out of the ifnet layer and make it
>    a property of the driver, so that the upper stack (or actually
>    protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>    without any queuing at that point.  It then is up to the driver
>    to decide how it multiplexes multi-core access to its queue(s)
>    and how they are configured. 

It would work out much better if the kernel was aware of the number of
tx queues of a multiq driver and explicitly selected one in if_transmit.
 The driver has no information on the CPU affinity etc. of the
applications generating the traffic; the kernel does.  In general, the
kernel has a much better "global view" of the system and some of the
stuff currently in the drivers really should move up into the stack.

Regards,
Navdeep


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 21:25:56 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 897865A0
 for <net@freebsd.org>; Tue, 29 Oct 2013 21:25:56 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id EAD65291A
 for <net@freebsd.org>; Tue, 29 Oct 2013 21:25:55 +0000 (UTC)
Received: (qmail 57950 invoked from network); 29 Oct 2013 21:56:25 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <np@FreeBSD.org>; 29 Oct 2013 21:56:25 -0000
Message-ID: <527027CE.5040806@freebsd.org>
Date: Tue, 29 Oct 2013 22:25:34 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Navdeep Parhar <np@FreeBSD.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
In-Reply-To: <527022AC.4030502@FreeBSD.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 21:25:56 -0000

On 29.10.2013 22:03, Navdeep Parhar wrote:
> On 10/29/13 13:41, Andre Oppermann wrote:
>> Let me jump in here and explain roughly the ideas/path I'm exploring
>> in creating and eventually implementing a big picture for drivers,
>> queues, queue management, various QoS and so on:
>>
>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>> have in tree aren't helpful at all.
>>
>> Steps:
>>
>> 1. take the soft-queuing method out of the ifnet layer and make it
>>     a property of the driver, so that the upper stack (or actually
>>     protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>     without any queuing at that point.  It then is up to the driver
>>     to decide how it multiplexes multi-core access to its queue(s)
>>     and how they are configured.
>
> It would work out much better if the kernel was aware of the number of
> tx queues of a multiq driver and explicitly selected one in if_transmit.
>   The driver has no information on the CPU affinity etc. of the
> applications generating the traffic; the kernel does.  In general, the
> kernel has a much better "global view" of the system and some of the
> stuff currently in the drivers really should move up into the stack.

I've been thinking a lot about this and come to the preliminary conclusion
that the upper stack should not tell the driver which queue to use.  There
are way to many possible and depending on the use-case, better or worse
performing approaches.  Also we have a big problem with cores vs. queues
mismatches either way (more cores than queues or more queues than cores,
though the latter is much less of problem).

For now I see these primary multi-hardware-queue approaches to be implemented
first:

a) the drivers (*if_transmit) takes the flowid from the mbuf header and
    selects one of the N hardware DMA rings based on it.  Each of the DMA
    rings is protected by a lock.  Here the assumption is that by having
    enough DMA rings the contention on each of them will be relatively low
    and ideally a flow and ring sort of sticks to a core that sends lots
    of packets into that flow.  Of course it is a statistical certainty that
    some bouncing will be going on.

b) the driver assigns the DMA rings to particular cores which by that, through
    a critnest++ can drive them lockless.  The drivers (*if_transmit) will look
    up the core it got called on and push the traffic out on that DMA ring.
    The problem is the actual upper stacks affinity which is not guaranteed.
    This has to consequences: there may be reordering of packets of the same
    flow because the protocols send function happens to be called from a
    different core the second time.  Or the drivers (*if_transmit) has to
    switch to the right core to complete the transmit for this flow if the
    upper stack migrated/bounced around.  It is rather difficult to assure
    full affinity from userspace down through the upper stack and then to
    the driver.

c) non-multi-queue capable hardware uses a kernel provided set of functions
    to manage the contention for the single resource of a DMA ring.

The point here is that the driver is the right place to make these decisions
because the upper stack lacks (and shouldn't care about) the actual available
hardware and its capabilities.  All necessary information is available to the
driver as well through the appropriate mbuf header fields and the core it is
called on.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 21:35:34 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 96C1ABC2;
 Tue, 29 Oct 2013 21:35:34 +0000 (UTC)
 (envelope-from rizzo.unipi@gmail.com)
Received: from mail-la0-x235.google.com (mail-la0-x235.google.com
 [IPv6:2a00:1450:4010:c03::235])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id CDD9629E1;
 Tue, 29 Oct 2013 21:35:33 +0000 (UTC)
Received: by mail-la0-f53.google.com with SMTP id eo20so388122lab.40
 for <multiple recipients>; Tue, 29 Oct 2013 14:35:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=UJbPHBJ5U+JO635o1N/VBaC+cD0OeJweC3LtPpfS8XE=;
 b=y4eWumh4Ll8thrCUKpXW3ffd74f2+ccEKrVOL5HXLtiDedEP0A2bEx1LBixV0TUFax
 ZWPb0tbPwCXvbiIiG895zkWkDXe+MGLZdpP8l75JgvObH0Lfyxcecv88CF/3YITj8p8/
 +QFQt3khQEKa8BxelP4cefX3OVTHncV9JowoDaLRhNwuHpSiwFPVX12m5COEHeBvIT+e
 z8H407LrDOz2lhshBO2+7fjjZYZDsnRY3yQ/A6hp4kdijd+PqginDOkOTgkxTaG/c5q6
 j8ioPEPGRp3x54/vQLwpiymAfzl5OR1uAAu1rHkWYSSwEUrdK4UCvZSrRDkh40oCDW0+
 WNoQ==
MIME-Version: 1.0
X-Received: by 10.112.167.99 with SMTP id zn3mr1317789lbb.34.1383082531605;
 Tue, 29 Oct 2013 14:35:31 -0700 (PDT)
Sender: rizzo.unipi@gmail.com
Received: by 10.114.172.105 with HTTP; Tue, 29 Oct 2013 14:35:31 -0700 (PDT)
In-Reply-To: <52701F7E.2060604@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>
 <52701F7E.2060604@freebsd.org>
Date: Tue, 29 Oct 2013 14:35:31 -0700
X-Google-Sender-Auth: LWJQwASElfs9xLH9VsySXAdty9I
Message-ID: <CA+hQ2+iLsHXqvk+XvECJz-NfKa=5BSz-YjRGYZ+Bv2Vbtd0Nbw@mail.gmail.com>
Subject: Re: MQ Patch.
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 21:35:34 -0000

On Tue, Oct 29, 2013 at 1:50 PM, Andre Oppermann <andre@freebsd.org> wrote:

> On 29.10.2013 21:20, Randall Stewart wrote:
>
>> So, to conclude: i fully support any plan to design something that lets us
>>> implement scheduling (and qos, if you want to call it this way)
>>> in a reasonable way, but what is in your patch now does not really
>>> seem to improve the current situation in any way.
>>>
>>
>> Its a step towards fixing that I am allowed to give. I can see
>> why Company's get frustrated with trying to give anything to the project.
>>
>
> Well, that we have a problem in that area is known and acknowledged and
> there is active work in this area going on.
>
> It would be very problematic if every vendor were just to through some
> stuff over the fence and have it integrated as is.  It would quickly
> become very messy.  In many specific purpose geared products a number
> of shortcuts can be taken that may not be appropriate for a general
> purpose OS that does more than routing.
>

that is exactly the issue.
It is not just FreeBSD that has strict policies on what gets accepted.

Several times (though mostly in the past) I myself have
been suggested to reconsider submissions that were too intrusive
or lacking from an architectural point of view. And as much i
could have been annoyed, i have to recognise that the criticism
was legitimate and eventually led to better implementations.

Of course one has much more freedom when playing with a standalone component
(say netmap, or a device driver, or SCTP...)
which does not interfere with the rest of the kernel,
and possibly even fills a hole in the OS.
But this is not one of those cases.

cheers
luigi

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 21:45:02 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id AA544F53
 for <net@freebsd.org>; Tue, 29 Oct 2013 21:45:02 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 0CF542A9C
 for <net@freebsd.org>; Tue, 29 Oct 2013 21:45:01 +0000 (UTC)
Received: (qmail 58032 invoked from network); 29 Oct 2013 22:15:31 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <adrian@freebsd.org>; 29 Oct 2013 22:15:31 -0000
Message-ID: <52702C48.3010706@freebsd.org>
Date: Tue, 29 Oct 2013 22:44:40 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>	<CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>	<13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>	<52701F7E.2060604@freebsd.org>
 <CAJ-VmokJaBhZE+3ZDsi0Yybuvtb_d7AH_RThCKs4inUM=UQrAQ@mail.gmail.com>
In-Reply-To: <CAJ-VmokJaBhZE+3ZDsi0Yybuvtb_d7AH_RThCKs4inUM=UQrAQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Luigi Rizzo <rizzo@iet.unipi.it>, Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 21:45:02 -0000

On 29.10.2013 22:02, Adrian Chadd wrote:
> [snip everything]
>
> ok, I've reviewed the work.
>
> TL;DR - it's a clearly correct step in the right direction, but I
> think we need to just think it through a tad bit more first.
>
> There have been queue discipline and queue management discussions in
> the past. Randall's work is a good step in that direction.
>
> I think though that we can take a step back up a little further.
>
> * In terms of queuing frames into multiple queues - yes, we absolutely
> should have an if_transmit() path to the driver that obeys "a" QoS
> field in the mbuf and pushes it into the relevant queue - with
> randalls work, it's in the driver, but it doesn't _have_ to be;

Only the driver can know how much it can do in hardware and how
much has to be emulated in software.  The kernel should provide
a couple of optimized software emulation to driver should link into.

> * In terms of queue servicing and management - we likely need to have
> a variety of queue plugins that determine which frame from which queue
> gets chosen next to hand to the hardware. The hardware may have
> multiple queues! The hardware may have one queue! The application
> developer may only want to use one queue! That should be flexible and
> easy to plug into things.

We have to get rid of the current (and mostly mental) model of a
software queue.  The software queue only exists a) for historical
reasons as the first interface didn't have any DMA rings at all;
b) to manage concurrent access to a single or limited shared resource.
In reality the DMA ring is deep enough and *all the queue* we need.

> * Then we need to support dropping frames during queue and dropping
> frames during dequeue (ie, on its way to the hardware). That way we
> can implement the currently interesting kinds of queue disciplines (eg
> CODEL, etc.)

DMA rings by definition are tail drop.  If you want to do active QoS
and queue management you trade the DMA ring size for a software queue
size.  However this is only really an issue for routing types of traffic.
With TCP getting an ENOBUFS on a send attempt is perfectly valid and the
send socket buffer works as our queue.  No need to deep buffer yet once
more in software before the DMA ring.  The only thing is that TCP needs
some polish in that area to prevent it from thinking about a loss event.
Maybe Lawrence can audit and adjust the relevant parts of tcp_output()s
error handling.  It should simply try again a few milliseconds later
without waiting for a retransmit timeout or the ACK clocking again.

> * Should this be done at the driver layer (ie it's a library that each
> driver creates and owns), or as a layer above it, controlling the
> network device (ie, the linux queue discipline method.)

If the hardware actually supports it, then it should be done in the
driver.  Otherwise the qos and queue management would get shimmed in
and highjack the (*if_transmit) function pointer to do the stuff in
software and ticking out the packets through TX complete callbacks
(or alternatively a timer as in dummynet).

> So, my comments:
>
> * I don't like how it's hard-coding drbr's into the drivers. Yes, the
> underlying state should be a drbr for now. But I'd rather we have a
> queue discipline plugin API that drivers create an instance of.

Full ACK.  That's the plan.

> * It'll have methods to init/flush the rings, queue a frame into a
> ring, dequeue a frame from a ring, be notified of transmit completions
> so more work can be done, etc.

Pretty much.  Drivers will be required to implement certain functionality
to manage the DMA ring depth and to provide a TX completion callback into
the software qos/queue shim but not the upper stack.

> * For people who do latency-sensitive things, they can just bypass
> this entirely and go straight to the hardware queues without going
> through this kind of intermediary queue layer.

IMHO this should be the default anyways with some provision to manage
contention by multiple cores.  For example by having a single packet
slot for each core in case the DMA ring is already locked by another
core.

> Randall - I think we can take your work and turn it into a net library
> that implements your queue management routines. That way we can start
> enabling people to tinker with it and replace it if they need to.

Moving struct ifnet and the drivers into the new model and making ifnet
opaque has already been signed up for by Gleb and me.  When that is in
place in the next weeks any kind of queue model can be implemented at
the drivers discretion, including Randalls.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 22:03:14 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 3518960A;
 Tue, 29 Oct 2013 22:03:14 +0000 (UTC)
 (envelope-from nparhar@gmail.com)
Received: from mail-pb0-x233.google.com (mail-pb0-x233.google.com
 [IPv6:2607:f8b0:400e:c01::233])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 052802BBC;
 Tue, 29 Oct 2013 22:03:13 +0000 (UTC)
Received: by mail-pb0-f51.google.com with SMTP id wz7so465047pbc.10
 for <multiple recipients>; Tue, 29 Oct 2013 15:03:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type:content-transfer-encoding;
 bh=n5pZqujjMbW1gsx4qa8sGOM3f+lVa4umij4bRjcFZKM=;
 b=huuCOQ4dMOx5dzw3I/jgJ4+t8H6GakygCfCdcTknnlGKka26pWxE7uKmJHoG80X3HB
 j1Tzmw74N9PUY7/bLEuto5UXWqsoz+pDeAbUh8W+CR/gJoqUSpi7mvtsFvRn6AEb9Qs9
 XObSNe/F0CfjyouAiPNLj6VRxcNCzzPF20feax+y7TJxfJWZe8Bqom2mVz2O7UUe3YUZ
 nrLQuaLbvpO7RVIhfsiAhXu0mSVucynUXhFvc1mnxL6p2XD0H5WGCuQUgmohVAjvf/wI
 Rzj1/HGUgClmd0oXRKPSMRnciy7XDcjUolFnuajaVTu4GgLGk3wuhkrhJVjhjXw8oj9S
 KG2w==
X-Received: by 10.68.228.138 with SMTP id si10mr838544pbc.13.1383084193322;
 Tue, 29 Oct 2013 15:03:13 -0700 (PDT)
Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58])
 by mx.google.com with ESMTPSA id qp10sm44953730pab.13.2013.10.29.15.03.11
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Oct 2013 15:03:12 -0700 (PDT)
Sender: Navdeep Parhar <nparhar@gmail.com>
Message-ID: <5270309E.5090403@FreeBSD.org>
Date: Tue, 29 Oct 2013 15:03:10 -0700
From: Navdeep Parhar <np@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Andre Oppermann <andre@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org>
In-Reply-To: <527027CE.5040806@freebsd.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 22:03:14 -0000

On 10/29/13 14:25, Andre Oppermann wrote:
> On 29.10.2013 22:03, Navdeep Parhar wrote:
>> On 10/29/13 13:41, Andre Oppermann wrote:
>>> Let me jump in here and explain roughly the ideas/path I'm exploring
>>> in creating and eventually implementing a big picture for drivers,
>>> queues, queue management, various QoS and so on:
>>>
>>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>>> have in tree aren't helpful at all.
>>>
>>> Steps:
>>>
>>> 1. take the soft-queuing method out of the ifnet layer and make it
>>>     a property of the driver, so that the upper stack (or actually
>>>     protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>>     without any queuing at that point.  It then is up to the driver
>>>     to decide how it multiplexes multi-core access to its queue(s)
>>>     and how they are configured.
>>
>> It would work out much better if the kernel was aware of the number of
>> tx queues of a multiq driver and explicitly selected one in if_transmit.
>>   The driver has no information on the CPU affinity etc. of the
>> applications generating the traffic; the kernel does.  In general, the
>> kernel has a much better "global view" of the system and some of the
>> stuff currently in the drivers really should move up into the stack.
> 
> I've been thinking a lot about this and come to the preliminary conclusion
> that the upper stack should not tell the driver which queue to use.  There
> are way to many possible and depending on the use-case, better or worse
> performing approaches.  Also we have a big problem with cores vs. queues
> mismatches either way (more cores than queues or more queues than cores,
> though the latter is much less of problem).
> 
> For now I see these primary multi-hardware-queue approaches to be
> implemented
> first:
> 
> a) the drivers (*if_transmit) takes the flowid from the mbuf header and
>    selects one of the N hardware DMA rings based on it.  Each of the DMA
>    rings is protected by a lock.  Here the assumption is that by having
>    enough DMA rings the contention on each of them will be relatively low
>    and ideally a flow and ring sort of sticks to a core that sends lots
>    of packets into that flow.  Of course it is a statistical certainty that
>    some bouncing will be going on.
> 
> b) the driver assigns the DMA rings to particular cores which by that,
> through
>    a critnest++ can drive them lockless.  The drivers (*if_transmit)
> will look
>    up the core it got called on and push the traffic out on that DMA ring.
>    The problem is the actual upper stacks affinity which is not guaranteed.
>    This has to consequences: there may be reordering of packets of the same
>    flow because the protocols send function happens to be called from a
>    different core the second time.  Or the drivers (*if_transmit) has to
>    switch to the right core to complete the transmit for this flow if the
>    upper stack migrated/bounced around.  It is rather difficult to assure
>    full affinity from userspace down through the upper stack and then to
>    the driver.
> 
> c) non-multi-queue capable hardware uses a kernel provided set of functions
>    to manage the contention for the single resource of a DMA ring.
> 
> The point here is that the driver is the right place to make these
> decisions
> because the upper stack lacks (and shouldn't care about) the actual
> available
> hardware and its capabilities.  All necessary information is available
> to the
> driver as well through the appropriate mbuf header fields and the core
> it is
> called on.
> 

I mildly disagree with most of this, specifically with the part that the
driver is the right place to make these decisions.  But you did say this
was a "preliminary conclusion" so there's hope yet ;-)

Let's wait till you have an early implementation and we are all able to
experiment with it.  To be continued...

Regards,
Navdeep

From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 23:35:36 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 46E73198
 for <net@freebsd.org>; Tue, 29 Oct 2013 23:35:36 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7875920E7
 for <net@freebsd.org>; Tue, 29 Oct 2013 23:35:35 +0000 (UTC)
Received: (qmail 58447 invoked from network); 30 Oct 2013 00:05:57 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <np@FreeBSD.org>; 30 Oct 2013 00:05:57 -0000
Message-ID: <5270462B.8050305@freebsd.org>
Date: Wed, 30 Oct 2013 00:35:07 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Navdeep Parhar <np@FreeBSD.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
In-Reply-To: <5270309E.5090403@FreeBSD.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 23:35:36 -0000

On 29.10.2013 23:03, Navdeep Parhar wrote:
> On 10/29/13 14:25, Andre Oppermann wrote:
>> On 29.10.2013 22:03, Navdeep Parhar wrote:
>>> On 10/29/13 13:41, Andre Oppermann wrote:
>>>> Let me jump in here and explain roughly the ideas/path I'm exploring
>>>> in creating and eventually implementing a big picture for drivers,
>>>> queues, queue management, various QoS and so on:
>>>>
>>>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>>>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>>>> have in tree aren't helpful at all.
>>>>
>>>> Steps:
>>>>
>>>> 1. take the soft-queuing method out of the ifnet layer and make it
>>>>      a property of the driver, so that the upper stack (or actually
>>>>      protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>>>      without any queuing at that point.  It then is up to the driver
>>>>      to decide how it multiplexes multi-core access to its queue(s)
>>>>      and how they are configured.
>>>
>>> It would work out much better if the kernel was aware of the number of
>>> tx queues of a multiq driver and explicitly selected one in if_transmit.
>>>    The driver has no information on the CPU affinity etc. of the
>>> applications generating the traffic; the kernel does.  In general, the
>>> kernel has a much better "global view" of the system and some of the
>>> stuff currently in the drivers really should move up into the stack.
>>
>> I've been thinking a lot about this and come to the preliminary conclusion
>> that the upper stack should not tell the driver which queue to use.  There
>> are way to many possible and depending on the use-case, better or worse
>> performing approaches.  Also we have a big problem with cores vs. queues
>> mismatches either way (more cores than queues or more queues than cores,
>> though the latter is much less of problem).
>>
>> For now I see these primary multi-hardware-queue approaches to be
>> implemented
>> first:
>>
>> a) the drivers (*if_transmit) takes the flowid from the mbuf header and
>>     selects one of the N hardware DMA rings based on it.  Each of the DMA
>>     rings is protected by a lock.  Here the assumption is that by having
>>     enough DMA rings the contention on each of them will be relatively low
>>     and ideally a flow and ring sort of sticks to a core that sends lots
>>     of packets into that flow.  Of course it is a statistical certainty that
>>     some bouncing will be going on.
>>
>> b) the driver assigns the DMA rings to particular cores which by that,
>> through
>>     a critnest++ can drive them lockless.  The drivers (*if_transmit)
>> will look
>>     up the core it got called on and push the traffic out on that DMA ring.
>>     The problem is the actual upper stacks affinity which is not guaranteed.
>>     This has to consequences: there may be reordering of packets of the same
>>     flow because the protocols send function happens to be called from a
>>     different core the second time.  Or the drivers (*if_transmit) has to
>>     switch to the right core to complete the transmit for this flow if the
>>     upper stack migrated/bounced around.  It is rather difficult to assure
>>     full affinity from userspace down through the upper stack and then to
>>     the driver.
>>
>> c) non-multi-queue capable hardware uses a kernel provided set of functions
>>     to manage the contention for the single resource of a DMA ring.
>>
>> The point here is that the driver is the right place to make these
>> decisions
>> because the upper stack lacks (and shouldn't care about) the actual
>> available
>> hardware and its capabilities.  All necessary information is available
>> to the
>> driver as well through the appropriate mbuf header fields and the core
>> it is
>> called on.
>>
>
> I mildly disagree with most of this, specifically with the part that the
> driver is the right place to make these decisions.  But you did say this
> was a "preliminary conclusion" so there's hope yet ;-)

I've mostly arrived at this conclusion as the least evil place to do it
because of the complexity that would otherwise hit the ifnet boundary.
Having to deal with simple one DMA ring only cards and high end cards
that support 64 times 8 QoS WFQ classes DMA rings in one place is messy
to properly abstract.  Also supporting API/ABI forward and backwards
compatibility would likely be nightmarish.

The driver isn't really making the decision, it is acting upon the mbuf
header information (flowid, qoscos) and using it together with its intimate
knowledge of the hardware capabilities to get a hopefully close to optimal
result.

The holy grail so to say would be to run the entire stack with full
affinity up and down.  That is certainly possible, provided the application
is fully aware of it as well.  In typical mixed load cases this is unlikely
the case and the application(s) are floating around.  A full affinity stack
then would have to switch to the right core when the kernel is entered.
This has its own drawbacks again.  However nothing in the new implementations
should prevent us from running the stack in full affinity mode.

> Let's wait till you have an early implementation and we are all able to
> experiment with it.  To be continued...

By all means feel free to bring up your own ideas and experiences from
other implementations as well, either in public or private.  I'm more
than happy to discuss and include other ideas.  In the end the cold hard
numbers and the suitability for a general purpose OS.  My goal is to be
good to very good in > 90% of all common use cases, while providing all
necessary knobs, and be it in the form of KLDs with a well defined API,
to push particular workloads to the full 99.9%.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 01:24:24 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 3BC9CB9D
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 01:24:24 +0000 (UTC)
 (envelope-from www-data@modersmal.skolverket.se)
Received: from modersmal.skolverket.se (dns.skolverket.se [62.13.78.2])
 by mx1.freebsd.org (Postfix) with ESMTP id 02E342659
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 01:24:23 +0000 (UTC)
Received: by modersmal.skolverket.se (Postfix, from userid 33)
 id 11125BA82B; Wed, 30 Oct 2013 02:10:22 +0100 (CET)
To: freebsd-net@freebsd.org
Subject: Re: Assalam
X-PHP-Originating-Script: 33:247@abu.php
From: Mohamad Hassan <mohamad.hassan@rediffmail.com>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 8bit
Message-Id: <20131030011510.11125BA82B@modersmal.skolverket.se>
Date: Wed, 30 Oct 2013 02:10:22 +0100 (CET)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: mohamad_hassan@rediffmail.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 01:24:24 -0000


Assalamalaikum Wr Wb

I hope in the name of ALLAH that I have the right person who will assist me. I got your contact through a web directory.

I want to transfer my family's money into your country/ business for investment purposes and to secure the future of my 3 children because we are uncertain of the future of this country; as such I would like to make contact with you residing in that country for assistance.

Note these funds are already in a security company which has branches around the world for safe keeping.

I would have done this myself but my present health condition will not warrant me to do so. Kindly help with this because I cannot travel out of libya at the moment due to some certain conditions and great difficulties added to the fact that am disabled on a wheel chair due to a bombing that occurred in Benghazi I will explain more to you when I am certain that I can trust you.

The fall of Muammar Gaddafi came with a lot of destruction / Hell to our great country Libya and everything is practically difficult now and opportunities are closing up, the new government is trying to frustrate our life.

Please if you accept this offer of assistance you are required to give me your Name, age, occupation, address also enclosing your telephone fax numbers.

What I now need from you are as follows:

1. You will help me receive and secure the funds from the security company on my family's behalf and open a Bank account for my children in your country with the credentials i will give you.

2. You will be entitled to 30% of the total sum involved for your assistance.

3. As soon as you confirm to me by e-mail your readiness to assist with this, I will give you more details as regards  claiming the funds from the security company.

4. Please note that this project is 100% risk free but you must keep it very secret and confidential with strong assurance that you will not let me down at all.

Regards,
Mohamad Hassan al-Rida


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 01:43:23 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 1381C552;
 Wed, 30 Oct 2013 01:43:23 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-qc0-x230.google.com (mail-qc0-x230.google.com
 [IPv6:2607:f8b0:400d:c01::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id A4D73290E;
 Wed, 30 Oct 2013 01:43:22 +0000 (UTC)
Received: by mail-qc0-f176.google.com with SMTP id s19so440471qcw.21
 for <multiple recipients>; Tue, 29 Oct 2013 18:43:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=D0YUnNrTReQQQDH3znp0Xh5/TZKggRchWybc/mEmT8Y=;
 b=zNb9G375f3L6FybkPfedwnwZliOQsx0NqkUwu/HqUOzcGjLgDTopLVmnn7cjWJLpic
 Th21QNzAag3rcve/U3HZgSCnV6PvuCG2rj6Vr5cVELlTUTeJ4eGDTAcFHd8Oh7+zz41S
 fp6Zg86n5MiqrqCCiZnVBDdsq/cs/L+Se3Ebh5jOyJPQxFiKv/Q1YgNRV1cvc2c0G/Re
 vItz6M+AYGOuJRvNJyLiYO17fP+I1/EPtxQqUCh036AMh7N5Ffmf3cZbWkngVck98eI7
 kmI5XUNykJB6BkrES6zSlbxsXHe4Ikn/C0vC16iPL0lsrq+IP/k9i0zYEr/xwPZXfWNM
 qeOA==
MIME-Version: 1.0
X-Received: by 10.224.37.198 with SMTP id y6mr4756827qad.104.1383097401726;
 Tue, 29 Oct 2013 18:43:21 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.224.207.66 with HTTP; Tue, 29 Oct 2013 18:43:21 -0700 (PDT)
In-Reply-To: <5270462B.8050305@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
Date: Tue, 29 Oct 2013 18:43:21 -0700
X-Google-Sender-Auth: H-5o5ybupz8gIqhONyo4Y6qTIv4
Message-ID: <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
Subject: Re: MQ Patch.
From: Adrian Chadd <adrian@freebsd.org>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 01:43:23 -0000

Hi,

We can't assume the hardware has deep queues _and_ we can't just hand
packets to the DMA engine.

Why?

Because once you've pushed it into the transmit ring, you can't
guarantee / impose any ordering on things. You can't guarantee that
you can abort a frame that has been queued because it now breaks the
queue rules.

That's why we don't want to just have a light wrapper around hardware
transmit queues. We give up way too much useful control.

I've seen this both when doing wifi (where I absolutely have to have
per-node, per-TID queues, far before it hits the hardware) and doing
WAN style optimisation, where I want to ensure I only queue a handful
of milliseconds of frames to the hardware so I can ensure I can hit
QoS requirements (eg there being a large amount of bulk data, then I
want to inject some voice traffic that should go out sooner..)

Thanks,


-adrian

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 03:16:45 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 3287F79B;
 Wed, 30 Oct 2013 03:16:45 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 012B12E4D;
 Wed, 30 Oct 2013 03:16:44 +0000 (UTC)
Received: from Julian-MBP3.local
 (ppp121-45-253-246.lns20.per2.internode.on.net [121.45.253.246])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id r9U3Gcrv021556
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 29 Oct 2013 20:16:41 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <52707A10.6040105@freebsd.org>
Date: Wed, 30 Oct 2013 11:16:32 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net>
 <52701F7E.2060604@freebsd.org>
 <CAJ-VmokJaBhZE+3ZDsi0Yybuvtb_d7AH_RThCKs4inUM=UQrAQ@mail.gmail.com>
In-Reply-To: <CAJ-VmokJaBhZE+3ZDsi0Yybuvtb_d7AH_RThCKs4inUM=UQrAQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Luigi Rizzo <rizzo@iet.unipi.it>, Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 03:16:45 -0000

On 10/30/13, 5:02 AM, Adrian Chadd wrote:
> [snip everything]
>
>
> Randall - I think we can take your work and turn it into a net library
> that implements your queue management routines. That way we can start
> enabling people to tinker with it and replace it if they need to.

to make a point on Randall's comment on contributing code..
The advantage to you (adara) is that even if we don't put your code in 
directly
we now are on notice that whatever we do must take into account your 
requirements
so that in 11 while it may not be a 'coding-free' upgrade.. it should 
at worst be a 'trivial coding' upgrade.

>
> What do you think?
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 03:30:52 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 589DDA8A;
 Wed, 30 Oct 2013 03:30:52 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 27DAA2F03;
 Wed, 30 Oct 2013 03:30:51 +0000 (UTC)
Received: from Julian-MBP3.local
 (ppp121-45-253-246.lns20.per2.internode.on.net [121.45.253.246])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id r9U3UjXT021610
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 29 Oct 2013 20:30:48 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <52707D60.1070001@freebsd.org>
Date: Wed, 30 Oct 2013 11:30:40 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Andre Oppermann <andre@freebsd.org>, Navdeep Parhar <np@FreeBSD.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
In-Reply-To: <5270462B.8050305@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 03:30:52 -0000

On 10/30/13, 7:35 AM, Andre Oppermann wrote:
>
> The holy grail so to say would be to run the entire stack with full
> affinity up and down.  That is certainly possible, provided the 
> application
> is fully aware of it as well.  In typical mixed load cases this is 
> unlikely
> the case and the application(s) are floating around.
with multithreaded apps it's *most likely* that writes will be coming 
from several differnent CPUs..


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 04:59:21 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 86E1CF4D;
 Wed, 30 Oct 2013 04:59:21 +0000 (UTC)
 (envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
 by mx1.freebsd.org (Postfix) with ESMTP id EDC382329;
 Wed, 30 Oct 2013 04:59:16 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
 id 1B4267300A; Wed, 30 Oct 2013 06:00:56 +0100 (CET)
Date: Wed, 30 Oct 2013 06:00:56 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>,
 Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
Subject: [long] Network stack -> NIC flow (was Re: MQ Patch.)
Message-ID: <20131030050056.GA84368@onelab2.iet.unipi.it>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 04:59:21 -0000

On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote:
> Hi,
> 
> We can't assume the hardware has deep queues _and_ we can't just hand
> packets to the DMA engine.
> [Adrian explains why]

i have the feeling that the variuos folks who stepped into this
discussion seem to have completely different (and orthogonal) goals
and as such these goals should be discussed separately.

Below is the architecture i have in mind and how i would implement it
(and it would be extremely simple since we have most of the pieces
in place).

It would be useful if people could discuss what problem they are
addressing before coming up with patches.

---

The architecture i think we should pursue is this (which happens to be
what linux implements, and also what dummynet implements, except
that the output is to a dummynet pipe or to ether_output() or to
ip_output() depending on the configuration):

   1. multiple (one per core) concurrent transmitters t_c

	which use ether_output_frame() to send to

   2. multiple disjoint queues q_j
	(one per traffic group, can be *a lot*, say 10^6)

	which are scheduled with a scheduler S
        (iterate step 2 for hierarchical schedulers)
	and

   3. eventually feed ONE transmit ring R_j on the NIC.
	Once a packet reaches R_j, for all practical purpose
	is on the wire. We cannot intercept extractions,
	we cannot interfere with the scheduler in the NIC in
	case of multiqueue NICs. The most we can do (and should,
	as in Linux) is notify the owner of the packet once its
	transmission is complete.

Just to set the terminology:
QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT
	or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES .
	This is what implements DROPTAIL (also improperly called FIFO),
	RED, CODEL. Note that for CODEL you need to intercept extractions
	from the queue, whereas DROPTAIL and RED only act on
	insertions.

SCHEDULER is the entity which decides which queue to serve among
	the many possible ones. It is called on INSERTIONS and
	EXTRACTIONS from a queue, and passes packets to the NIC's queue.

The decision on which queue and ring (Q_i and R_j) to use should be made
by a classifier at the beginning of step 2 (or once per iteration,
if using a hierarchical scheduler). Of course they can be precomputed
(e.g. with annotations in the mbuf coming from the socket).

Now when it comes to implementing the above, we have three
cases (or different optimization levels, if you like)

-- 1. THE SIMPLE CASE ---

In the simplest possible case we have can let the NIC do everything.
Necessary conditions are:
- queue management policies acting only on insertions
  (e.g. DROPTAIL or RED or similar);
- # of traffic classes <= # number of NIC rings
- scheduling policy S equal to the one implemented in the NIC
  (trivial case: one queue, one ring, no scheduler)

All these cases match exactly what the hardware provides, so we can just
use the NIC ring(s) without extra queue(s), and possibly use something
like buf_ring to manage insertions (but note that insertions in
an empty queue will end up requiring a lock; and i think the
same happens even now with the extra drbr queue in front of the ring).


-- 2. THE INTERMEDIATE CASE ---

If we do not care about a scheduler but want a more complex QUEUE
MANAGEMENT, such as CODEL, that acts on extractions, we _must_
implement an intermediate queue Q_i before the NIC ring.  This is
our only chance to act on extractions from the queue (which CODEL
requires).  Note that we DO NOT NEED to create multiple queues for
each ring.

-- 3. THE COMPLETE CASE ---

This is when the scheduler we want (DRR, WFQ variants, PRIORITY...)
is not implemented in the NIC, or we have more queues than those
available in the NIC. In this case we need to invoke this extra
block before passing packets to the NIC.

Remember that dummynet implements exactly #3, and it comes with a
set of pretty efficient schedulers (i have made extensive measurements
on them, see links to papers on my research page
http://info.iet.unipi.it/~luigi/research.html ).
They are by no means a performance bottleneck (scheduling takes
50..200ns depending on the circumstances) in the cases where
it matters to have a scheduler (which is, when the sender is
faster than the NIC, which in turn only happens with large packets
which take 1..30us to get through at the very least..

--- IMPLEMENTATION ---

Apart from ALTQ (which is very slow and has inefficient schedulers
and i don't think anybody wants to maintain), and with the exception
of dummynet which I'll discuss later, at the moment FreeBSD do not
support schedulers in the tx path of the device driver.

So we can only deal with cases 1 and 2, and for them the software
queue + ring suffices to implement any QUEUE MANAGEMENT policy
(but we don't implement anything).

If we want support the generic case (#3), we should do the following:

1. device drivers export a function to transmit on an individual ring,
  basically the current if_transmit(), and a hook to play with the
  corresponding queue lock (the scheduler needs to run under lock,
  and we can as well use the ring lock for that).
  Note that the ether_output_frame does not always need to
  call the scheduler: if a packet enters a non-empty queue, we are done.
  
2. device drivers also export the number of tx queues, and
  some (advisory) information on queue status

3. ether_output_frame() runs the classifier (if needed), invokes
  the scheduler (if needed) and possibly falls through into if_transmit()
  for the specific ring.

4. on transmit completions (*_txeof(), typically), a callback invokes
  the scheduler to feed the NIC ring with more packets

I mentioned dummynet: it already implements ALL of this,
including the completion callback in #4. There is a hook
in ether_output_frame(), and the hook was called (up to 8.0
i believe) if_tx_rdy(). You can see wat it does in
RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy()

http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup

if_tx_rdy() does not exist anymore because almost nobody used it,
but it is trivial to reimplement, and can be called by device drivers
when *_txeof() finds that is running low on packets _and_ the
specific NIC needs to implement the "complete" scheduling.

The way it worked in dummynet (I think i used it in on 'tun' and 'ed')
is also documented in the manpage:
define a pipe whose bandwidth is set as a the device name instead
of a number. Then you can attach a scheduler to the pipe, queues
to the scheduler, and you are done.  Example:

    // this is the scheduler's configuration
	ipfw pipe 10 config bw 'em2' sched 
	ipfw sched 10 config type drr // deficit round robin
	ipfw queue 1 config weight 30 sched 10 // important
	ipfw queue 2 config weight 5 sched 10 // less important
	ipfw queue 3 config weight 1 sched 10 // who cares...

    // and this is the classifier, which you can skip if the
    // packets are already pre-classified.
    // The infrastructure is already there to implement per-interface
    // configurations.
	ipfw add queue 1 src-port 53
	ipfw add queue 2 src-port 22
	ipfw add queue 2 ip from any to any

Now, surely we can replace the implementation of packet queues in dummynet
from the current TAILQ to something resembling buf_ring to improve
write parallelism; and a bit of glue code is needed to attach
per-interface ipfw instances to each interface, and some smarts in
the configuration commands is needed to figure out when we can
bypass everything or not.

But this seems to me a much more viable approach to achieve proper QoS
support in our architecture.

cheers
luigi

cheers
luigi

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 05:47:52 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 050E7EF7;
 Wed, 30 Oct 2013 05:47:52 +0000 (UTC)
 (envelope-from jfvogel@gmail.com)
Received: from mail-ve0-x231.google.com (mail-ve0-x231.google.com
 [IPv6:2607:f8b0:400c:c01::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8FE9F2569;
 Wed, 30 Oct 2013 05:47:51 +0000 (UTC)
Received: by mail-ve0-f177.google.com with SMTP id oz11so633720veb.22
 for <multiple recipients>; Tue, 29 Oct 2013 22:47:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=hxuYzFcbtk8SsafKaASTm9NEz2rfB6I7OVicnZGVURk=;
 b=vfArYMQNx4x+0Nlgms6of/MOqm1r50EOCMWkdBtFmJISGzJjPSnWzuLMnkngq1sO2G
 bz88RwfPKvX626NtLtJMkkO1bS2emGjgIPJDkdthMTtswN5ElWZIp7nefq6LqQfugehy
 lrKh44qt90vB5M30kmvSoHMQu9YaajEwKlpH5r4rS+y/TzumCD//ZXQIIgKyeuWgvfoU
 AQJ8CbkUI7y3cxIeit4Z6edRczGmnZLrme9gKAKDXgHjGEXkzSjdXFgEWZUSsCGIHg3m
 2bccNVbU2/NMSSh1iOxbkjBI8RiTrTBlspj2+YqybAXDISZlXBiajHQq3GiAU2Qxng5r
 +Mwg==
MIME-Version: 1.0
X-Received: by 10.52.119.198 with SMTP id kw6mr97706vdb.47.1383112070199; Tue,
 29 Oct 2013 22:47:50 -0700 (PDT)
Received: by 10.220.155.148 with HTTP; Tue, 29 Oct 2013 22:47:50 -0700 (PDT)
In-Reply-To: <5270309E.5090403@FreeBSD.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
Date: Tue, 29 Oct 2013 22:47:50 -0700
Message-ID: <CAFOYbcm44v4yP4v05DiHURePsHH=SYJexdUAt0MsQZtu6RTVMA@mail.gmail.com>
Subject: Re: MQ Patch.
From: Jack Vogel <jfvogel@gmail.com>
To: Navdeep Parhar <np@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Luigi Rizzo <rizzo@iet.unipi.it>, Andre Oppermann <andre@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 05:47:52 -0000

I find myself agreeing with Navdeep,  what Windows does might provide a hint
(my god did I say that :)), the driver provides hints to the kernel, but
its from
"above" that the ultimate decisions are made based on what the hardware
hints are. So, its not either or, its both and....

Jack


On Tue, Oct 29, 2013 at 3:03 PM, Navdeep Parhar <np@freebsd.org> wrote:

> On 10/29/13 14:25, Andre Oppermann wrote:
> > On 29.10.2013 22:03, Navdeep Parhar wrote:
> >> On 10/29/13 13:41, Andre Oppermann wrote:
> >>> Let me jump in here and explain roughly the ideas/path I'm exploring
> >>> in creating and eventually implementing a big picture for drivers,
> >>> queues, queue management, various QoS and so on:
> >>>
> >>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
> >>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
> >>> have in tree aren't helpful at all.
> >>>
> >>> Steps:
> >>>
> >>> 1. take the soft-queuing method out of the ifnet layer and make it
> >>>     a property of the driver, so that the upper stack (or actually
> >>>     protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
> >>>     without any queuing at that point.  It then is up to the driver
> >>>     to decide how it multiplexes multi-core access to its queue(s)
> >>>     and how they are configured.
> >>
> >> It would work out much better if the kernel was aware of the number of
> >> tx queues of a multiq driver and explicitly selected one in if_transmit.
> >>   The driver has no information on the CPU affinity etc. of the
> >> applications generating the traffic; the kernel does.  In general, the
> >> kernel has a much better "global view" of the system and some of the
> >> stuff currently in the drivers really should move up into the stack.
> >
> > I've been thinking a lot about this and come to the preliminary
> conclusion
> > that the upper stack should not tell the driver which queue to use.
>  There
> > are way to many possible and depending on the use-case, better or worse
> > performing approaches.  Also we have a big problem with cores vs. queues
> > mismatches either way (more cores than queues or more queues than cores,
> > though the latter is much less of problem).
> >
> > For now I see these primary multi-hardware-queue approaches to be
> > implemented
> > first:
> >
> > a) the drivers (*if_transmit) takes the flowid from the mbuf header and
> >    selects one of the N hardware DMA rings based on it.  Each of the DMA
> >    rings is protected by a lock.  Here the assumption is that by having
> >    enough DMA rings the contention on each of them will be relatively low
> >    and ideally a flow and ring sort of sticks to a core that sends lots
> >    of packets into that flow.  Of course it is a statistical certainty
> that
> >    some bouncing will be going on.
> >
> > b) the driver assigns the DMA rings to particular cores which by that,
> > through
> >    a critnest++ can drive them lockless.  The drivers (*if_transmit)
> > will look
> >    up the core it got called on and push the traffic out on that DMA
> ring.
> >    The problem is the actual upper stacks affinity which is not
> guaranteed.
> >    This has to consequences: there may be reordering of packets of the
> same
> >    flow because the protocols send function happens to be called from a
> >    different core the second time.  Or the drivers (*if_transmit) has to
> >    switch to the right core to complete the transmit for this flow if the
> >    upper stack migrated/bounced around.  It is rather difficult to assure
> >    full affinity from userspace down through the upper stack and then to
> >    the driver.
> >
> > c) non-multi-queue capable hardware uses a kernel provided set of
> functions
> >    to manage the contention for the single resource of a DMA ring.
> >
> > The point here is that the driver is the right place to make these
> > decisions
> > because the upper stack lacks (and shouldn't care about) the actual
> > available
> > hardware and its capabilities.  All necessary information is available
> > to the
> > driver as well through the appropriate mbuf header fields and the core
> > it is
> > called on.
> >
>
> I mildly disagree with most of this, specifically with the part that the
> driver is the right place to make these decisions.  But you did say this
> was a "preliminary conclusion" so there's hope yet ;-)
>
> Let's wait till you have an early implementation and we are all able to
> experiment with it.  To be continued...
>
> Regards,
> Navdeep
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 06:41:13 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id EE720F6B;
 Wed, 30 Oct 2013 06:41:13 +0000 (UTC)
 (envelope-from jmg@h2.funkthat.com)
Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id AB73D27E2;
 Wed, 30 Oct 2013 06:41:13 +0000 (UTC)
Received: from h2.funkthat.com (localhost [127.0.0.1])
 by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id r9U6f6WC024909
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 29 Oct 2013 23:41:07 -0700 (PDT)
 (envelope-from jmg@h2.funkthat.com)
Received: (from jmg@localhost)
 by h2.funkthat.com (8.14.3/8.14.3/Submit) id r9U6f502024907;
 Tue, 29 Oct 2013 23:41:05 -0700 (PDT) (envelope-from jmg)
Date: Tue, 29 Oct 2013 23:41:05 -0700
From: John-Mark Gurney <jmg@funkthat.com>
To: Andre Oppermann <andre@freebsd.org>
Subject: Re: MQ Patch.
Message-ID: <20131030064105.GV58155@funkthat.com>
Mail-Followup-To: Andre Oppermann <andre@freebsd.org>,
 Navdeep Parhar <np@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>,
 Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <527027CE.5040806@freebsd.org>
User-Agent: Mutt/1.4.2.3i
X-Operating-System: FreeBSD 7.2-RELEASE i386
X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88  9322 9CB1 8F74 6D3F A396
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2
 (h2.funkthat.com [127.0.0.1]); Tue, 29 Oct 2013 23:41:07 -0700 (PDT)
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 06:41:14 -0000

Andre Oppermann wrote this message on Tue, Oct 29, 2013 at 22:25 +0100:
> b) the driver assigns the DMA rings to particular cores which by that, 
> through
>    a critnest++ can drive them lockless.  The drivers (*if_transmit) will 
>    look
>    up the core it got called on and push the traffic out on that DMA ring.
>    The problem is the actual upper stacks affinity which is not guaranteed.
>    This has to consequences: there may be reordering of packets of the same
>    flow because the protocols send function happens to be called from a
>    different core the second time.  Or the drivers (*if_transmit) has to
>    switch to the right core to complete the transmit for this flow if the
>    upper stack migrated/bounced around.  It is rather difficult to assure
>    full affinity from userspace down through the upper stack and then to
>    the driver.

I'll point you to the paper:
http://arxiv.org/abs/1106.0443

Please don't reorder packets.

Binding TX queues to cores seems not very useful, sure you can do a
lockless implementation, but is running the scheduler to change cpu's
really cheaper than paying the cost of migrating the lock?

I'll admit I haven't run benchmarks, but I doubt it.

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 10:40:53 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id D8A1D2B7
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 10:40:53 +0000 (UTC)
 (envelope-from dyr@smartspb.net)
Received: from quix.smartspb.net (quix.smartspb.net [217.119.16.133])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 967EA25A7
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 10:40:53 +0000 (UTC)
Received: from dyr.smartspb.net ([217.119.16.26] helo=[127.0.0.1])
 by quix.smartspb.net with esmtpsa (TLSv1:AES256-SHA:256)
 (Exim 4.61 (FreeBSD)) (envelope-from <dyr@smartspb.net>)
 id 1VbTCk-000I97-Ub
 for freebsd-net@freebsd.org; Wed, 30 Oct 2013 14:40:51 +0400
Message-ID: <5270E22C.1060408@smartspb.net>
Date: Wed, 30 Oct 2013 14:40:44 +0400
From: Dennis Yusupoff <dyr@smartspb.net>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Subject: [Feature Request] (ng_)netflow additional
X-Enigmail-Version: 1.6
X-Antivirus: avast! (VPS 131029-1, 30.10.2013), Outbound message
X-Antivirus-Status: Clean
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 10:40:53 -0000

Good day everyone.

To be brief:

1. It would be really usefull for CGNAT providers have ability to record
customers IPs in traffic before and after NAT, as it already has done in
ipt_NETFLOW under Linux or in the Cisco ASA series.

=== begin of cut https://github.com/aabc/ipt-netflow/blob/master/README ===
natevents=1
     - Collect and send NAT translation events as NetFlow Event Logging
(NEL)
       for NetFlow v9/IPFIX, or as dummy flows compatible with NetFlow v5.
       Default is 0 (don't send).

       For NetFlow v5 protocol meaning of fields in dummy flows is such:
         Src IP, Src Port is Pre-nat source address.
         Dst IP, Dst Port is Post-nat destination address.
           - These two fields made equal to data flows catched in
FORWARD chain.
         Nexthop, Src AS is Post-nat source address for SNAT. Or,
         Nexthop, Dst AS is Pre-nat destination address for DNAT.
         TCP Flags is SYN+SCK for start event, RST+FIN for stop event.
         Pkt/Traffic size is 0 (zero), so it won't interfere with
accounting.

=== end of cut ===

2. Is it possible to specify by user some field in Netflow v9, for
example /IF_DESC/ or /APPLICATION DESCRIPTION/, according to
http://www.cisco.com/en/US/technologies/tk648/tk362/technologies_white_paper09186a00800a3db9_ps6601_Products_White_Paper.html?
If no, it would be really nice to see. Using example: customers
requested other ip on a interface, where we collect netflow traffic so
when we should to give traffic report we haven't any *unique* identifier
in netflow flows, which can be helpful. It's a real pity.

Thank you for your consideration!


-- 
Best regards,
Dennis Yusupoff,
network engineer of
Smart-Telecom ISP
Russia, Saint-Petersburg 


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 11:44:25 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 744B33F3
 for <net@freebsd.org>; Wed, 30 Oct 2013 11:44:25 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id BDF5129AA
 for <net@freebsd.org>; Wed, 30 Oct 2013 11:44:24 +0000 (UTC)
Received: (qmail 61448 invoked from network); 30 Oct 2013 12:14:47 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <adrian@freebsd.org>; 30 Oct 2013 12:14:47 -0000
Message-ID: <5270F101.6020701@freebsd.org>
Date: Wed, 30 Oct 2013 12:44:01 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>	<CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>	<52701D8B.8050907@freebsd.org>	<527022AC.4030502@FreeBSD.org>	<527027CE.5040806@freebsd.org>	<5270309E.5090403@FreeBSD.org>	<5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
In-Reply-To: <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 11:44:25 -0000

On 30.10.2013 02:43, Adrian Chadd wrote:
> Hi,

[Meta: following your replies is often difficult because you're omitting
context and citations]

> We can't assume the hardware has deep queues _and_ we can't just hand
> packets to the DMA engine.
 >
> Why?
>
> Because once you've pushed it into the transmit ring, you can't
> guarantee / impose any ordering on things. You can't guarantee that
> you can abort a frame that has been queued because it now breaks the
> queue rules.
>
> That's why we don't want to just have a light wrapper around hardware
> transmit queues. We give up way too much useful control.

The stack can't possibly know about all these differences in current
and future technologies and requirements.  That's why this decision
should be pushed into the L3/L2 mapping/encapsulation and driver layer.

Only those actually know about the requirements and constraints of any
given technology.

For wired ethernet there isn't any control over a packet once it has
been inserted into the DMA ring and the packets are going to be processed
sequentially.  In that case the driver likely will chose a rather light
wrapper to protect concurrent access to the DMA ring.  An optimized
version of such a wrapper will be provided by the kernel for the driver
to link to.

For other kinds of interfaces a very different strategy may be chosen.
In your case with ieee80211 a more elaborate transmit scheme can be
implemented without having to hack the kernel.  In fact that's what
you already mostly do there with the frame fragmentation, priority and
retransmission code if I'm reading it correctly.  The only difference
in future being that the upper stack wont enforce any of the old IFQ,
bufring or drbr handoff on you.  You can chose one of the stock models
or develop your own specially optimized version.

> I've seen this both when doing wifi (where I absolutely have to have
> per-node, per-TID queues, far before it hits the hardware) and doing
> WAN style optimisation, where I want to ensure I only queue a handful
> of milliseconds of frames to the hardware so I can ensure I can hit
> QoS requirements (eg there being a large amount of bulk data, then I
> want to inject some voice traffic that should go out sooner..)

Sure.  The ideas is to make it even easier for you to implement that
without having to work around anything above ifnet.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 11:51:10 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id B27ED5AB
 for <net@freebsd.org>; Wed, 30 Oct 2013 11:51:10 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 2283C2A4C
 for <net@freebsd.org>; Wed, 30 Oct 2013 11:51:09 +0000 (UTC)
Received: (qmail 61485 invoked from network); 30 Oct 2013 12:21:32 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <np@freebsd.org>; 30 Oct 2013 12:21:32 -0000
Message-ID: <5270F297.4090001@freebsd.org>
Date: Wed, 30 Oct 2013 12:50:47 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Navdeep Parhar <np@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>, 
 Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <20131030064105.GV58155@funkthat.com>
In-Reply-To: <20131030064105.GV58155@funkthat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 11:51:10 -0000

On 30.10.2013 07:41, John-Mark Gurney wrote:
> Andre Oppermann wrote this message on Tue, Oct 29, 2013 at 22:25 +0100:
>> b) the driver assigns the DMA rings to particular cores which by that,
>> through
>>     a critnest++ can drive them lockless.  The drivers (*if_transmit) will
>>     look
>>     up the core it got called on and push the traffic out on that DMA ring.
>>     The problem is the actual upper stacks affinity which is not guaranteed.
>>     This has to consequences: there may be reordering of packets of the same
>>     flow because the protocols send function happens to be called from a
>>     different core the second time.  Or the drivers (*if_transmit) has to
>>     switch to the right core to complete the transmit for this flow if the
>>     upper stack migrated/bounced around.  It is rather difficult to assure
>>     full affinity from userspace down through the upper stack and then to
>>     the driver.
>
> I'll point you to the paper:
> http://arxiv.org/abs/1106.0443
>
> Please don't reorder packets.
>
> Binding TX queues to cores seems not very useful, sure you can do a
> lockless implementation, but is running the scheduler to change cpu's
> really cheaper than paying the cost of migrating the lock?
>
> I'll admit I haven't run benchmarks, but I doubt it.

Don't worry.  My list was about the possible ways of dealing with it
and their constrains/disadvantage.  Packet reordering is one part of it
that pretty much makes approach b) non-viable as you correctly point out.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 14:14:36 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 4F6B9DC0
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 14:14:36 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 08B5D24CD
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 14:14:35 +0000 (UTC)
Received: from jre-mbp.elischer.org
 (ppp121-45-246-96.lns20.per2.internode.on.net [121.45.246.96])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id r9UEEUkS023605
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Wed, 30 Oct 2013 07:14:33 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <52711440.5060405@freebsd.org>
Date: Wed, 30 Oct 2013 22:14:24 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
To: Dennis Yusupoff <dyr@smartspb.net>, freebsd-net@freebsd.org
Subject: Re: [Feature Request] (ng_)netflow additional
References: <5270E22C.1060408@smartspb.net>
In-Reply-To: <5270E22C.1060408@smartspb.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 14:14:36 -0000

On 10/30/13, 6:40 PM, Dennis Yusupoff wrote:
> Good day everyone.
>
> To be brief:
>
> 1. It would be really usefull for CGNAT providers have ability to record
> customers IPs in traffic before and after NAT, as it already has done in
> ipt_NETFLOW under Linux or in the Cisco ASA series.
>
> === begin of cut https://github.com/aabc/ipt-netflow/blob/master/README ===
> natevents=1
>       - Collect and send NAT translation events as NetFlow Event Logging
> (NEL)
>         for NetFlow v9/IPFIX, or as dummy flows compatible with NetFlow v5.
>         Default is 0 (don't send).
>
>         For NetFlow v5 protocol meaning of fields in dummy flows is such:
>           Src IP, Src Port is Pre-nat source address.
>           Dst IP, Dst Port is Post-nat destination address.
>             - These two fields made equal to data flows catched in
> FORWARD chain.
>           Nexthop, Src AS is Post-nat source address for SNAT. Or,
>           Nexthop, Dst AS is Pre-nat destination address for DNAT.
>           TCP Flags is SYN+SCK for start event, RST+FIN for stop event.
>           Pkt/Traffic size is 0 (zero), so it won't interfere with
> accounting.
I think this would be very hard because the netflow module looks at 
the packets at one place. Eihter it is before or after NAT but not 
during.. so the information is not available.. we would have to add a 
netflow source into the NAT code to do this (and then the other net 
flow code would need to be turned off if NAT was on.. but since 
netgraph is like lego, and no part of it knows abut any other part of 
it, it would be quite a challenge as to how this could be done.)

> === end of cut ===
>
> 2. Is it possible to specify by user some field in Netflow v9, for
> example /IF_DESC/ or /APPLICATION DESCRIPTION/, according to
> http://www.cisco.com/en/US/technologies/tk648/tk362/technologies_white_paper09186a00800a3db9_ps6601_Products_White_Paper.html?
> If no, it would be really nice to see. Using example: customers
> requested other ip on a interface, where we collect netflow traffic so
> when we should to give traffic report we haven't any *unique* identifier
> in netflow flows, which can be helpful. It's a real pity.
I leave this to the people who know more about netflow...

> Thank you for your consideration!
>
>


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 16:10:03 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id DCA622BB
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Wed, 30 Oct 2013 16:10:02 +0000 (UTC)
 (envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id BC94F2D82
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Wed, 30 Oct 2013 16:10:02 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9UGA2HP037946
 for <freebsd-net@freefall.freebsd.org>; Wed, 30 Oct 2013 16:10:02 GMT
 (envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9UGA2EE037945;
 Wed, 30 Oct 2013 16:10:02 GMT (envelope-from gnats)
Date: Wed, 30 Oct 2013 16:10:02 GMT
Message-Id: <201310301610.r9UGA2EE037945@freefall.freebsd.org>
To: freebsd-net@FreeBSD.org
Cc: 
From: dfilter@FreeBSD.ORG (dfilter service)
Subject: Re: kern/134531: commit references a PR
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: dfilter service <dfilter@FreeBSD.ORG>
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 16:10:03 -0000

The following reply was made to PR kern/134531; it has been noted by GNATS.

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/134531: commit references a PR
Date: Wed, 30 Oct 2013 16:08:42 +0000 (UTC)

 Author: melifaro
 Date: Wed Oct 30 16:08:27 2013
 New Revision: 257389
 URL: http://svnweb.freebsd.org/changeset/base/257389
 
 Log:
   MFC r256624:
   
   Fix long-standing issue with incorrect radix mask calculation.
   
   Usual symptoms are messages like
   rn_delete: inconsistent annotation
   rn_addmask: mask impossibly already in tree
   routing daemon constantly deleting IPv6 default route
   or inability to flush/delete particular prefix in ipfw table.
   
   Changes:
   * Assume 32 bytes as maximum radix key length
   * Remove rn_init()
   * Statically allocate rn_ones/rn_zeroes
   * Make separate mask tree for each "normal" tree instead of system
   global one
   * Remove "optimization" on masks reusage and key zeroying
   * Change rn_addmask() arguments to accept tree pointer (no users in base)
   
   MFC changes:
   * keep rn_init()
   * create global mask tree, protected with mutex, for old rn_addmask
   users (currently 0 in base)
   * Add new rn_addmask_r() function (rn_addmask in head) with additional
   argument to accept tree pointer
   
   PR:		kern/182851, kern/169206, kern/135476, kern/134531
   Found by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
   Reviewed by:	glebius (previous versions)
   Sponsored by:	Yandex LLC
 
 Modified:
   stable/9/sys/net/radix.c
   stable/9/sys/net/radix.h
 Directory Properties:
   stable/9/sys/   (props changed)
   stable/9/sys/net/   (props changed)
 
 Modified: stable/9/sys/net/radix.c
 ==============================================================================
 --- stable/9/sys/net/radix.c	Wed Oct 30 15:46:50 2013	(r257388)
 +++ stable/9/sys/net/radix.c	Wed Oct 30 16:08:27 2013	(r257389)
 @@ -66,27 +66,27 @@ static struct radix_node
  	 *rn_search(void *, struct radix_node *),
  	 *rn_search_m(void *, struct radix_node *, void *);
  
 -static int	max_keylen;
 -static struct radix_mask *rn_mkfreelist;
 -static struct radix_node_head *mask_rnhead;
 +static void rn_detachhead_internal(void **head);
 +static int rn_inithead_internal(void **head, int off);
 +
 +#define	RADIX_MAX_KEY_LEN	32
 +
 +static char rn_zeros[RADIX_MAX_KEY_LEN];
 +static char rn_ones[RADIX_MAX_KEY_LEN] = {
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +	-1, -1, -1, -1, -1, -1, -1, -1,
 +};
 +
  /*
 - * Work area -- the following point to 3 buffers of size max_keylen,
 - * allocated in this order in a block of memory malloc'ed by rn_init.
 - * rn_zeros, rn_ones are set in rn_init and used in readonly afterwards.
 - * addmask_key is used in rn_addmask in rw mode and not thread-safe.
 + * XXX: Compat stuff for old rn_addmask() users
   */
 -static char *rn_zeros, *rn_ones, *addmask_key;
 -
 -#define MKGet(m) {						\
 -	if (rn_mkfreelist) {					\
 -		m = rn_mkfreelist;				\
 -		rn_mkfreelist = (m)->rm_mklist;			\
 -	} else							\
 -		R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); }
 - 
 -#define MKFree(m) { (m)->rm_mklist = rn_mkfreelist; rn_mkfreelist = (m);}
 +static struct radix_node_head *mask_rnhead_compat;
 +#ifdef	_KERNEL
 +static struct mtx mask_mtx;
 +#endif
  
 -#define rn_masktop (mask_rnhead->rnh_treetop)
  
  static int	rn_lexobetter(void *m_arg, void *n_arg);
  static struct radix_mask *
 @@ -230,7 +230,8 @@ rn_lookup(v_arg, m_arg, head)
  	caddr_t netmask = 0;
  
  	if (m_arg) {
 -		x = rn_addmask(m_arg, 1, head->rnh_treetop->rn_offset);
 +		x = rn_addmask_r(m_arg, head->rnh_masks, 1,
 +		    head->rnh_treetop->rn_offset);
  		if (x == 0)
  			return (0);
  		netmask = x->rn_key;
 @@ -489,53 +490,47 @@ on1:
  }
  
  struct radix_node *
 -rn_addmask(n_arg, search, skip)
 -	int search, skip;
 -	void *n_arg;
 +rn_addmask_r(void *arg, struct radix_node_head *maskhead, int search, int skip)
  {
 -	caddr_t netmask = (caddr_t)n_arg;
 +	caddr_t netmask = (caddr_t)arg;
  	register struct radix_node *x;
  	register caddr_t cp, cplim;
  	register int b = 0, mlen, j;
 -	int maskduplicated, m0, isnormal;
 +	int maskduplicated, isnormal;
  	struct radix_node *saved_x;
 -	static int last_zeroed = 0;
 +	char addmask_key[RADIX_MAX_KEY_LEN];
  
 -	if ((mlen = LEN(netmask)) > max_keylen)
 -		mlen = max_keylen;
 +	if ((mlen = LEN(netmask)) > RADIX_MAX_KEY_LEN)
 +		mlen = RADIX_MAX_KEY_LEN;
  	if (skip == 0)
  		skip = 1;
  	if (mlen <= skip)
 -		return (mask_rnhead->rnh_nodes);
 +		return (maskhead->rnh_nodes);
 +
 +	bzero(addmask_key, RADIX_MAX_KEY_LEN);
  	if (skip > 1)
  		bcopy(rn_ones + 1, addmask_key + 1, skip - 1);
 -	if ((m0 = mlen) > skip)
 -		bcopy(netmask + skip, addmask_key + skip, mlen - skip);
 +	bcopy(netmask + skip, addmask_key + skip, mlen - skip);
  	/*
  	 * Trim trailing zeroes.
  	 */
  	for (cp = addmask_key + mlen; (cp > addmask_key) && cp[-1] == 0;)
  		cp--;
  	mlen = cp - addmask_key;
 -	if (mlen <= skip) {
 -		if (m0 >= last_zeroed)
 -			last_zeroed = mlen;
 -		return (mask_rnhead->rnh_nodes);
 -	}
 -	if (m0 < last_zeroed)
 -		bzero(addmask_key + m0, last_zeroed - m0);
 -	*addmask_key = last_zeroed = mlen;
 -	x = rn_search(addmask_key, rn_masktop);
 +	if (mlen <= skip)
 +		return (maskhead->rnh_nodes);
 +	*addmask_key = mlen;
 +	x = rn_search(addmask_key, maskhead->rnh_treetop);
  	if (bcmp(addmask_key, x->rn_key, mlen) != 0)
  		x = 0;
  	if (x || search)
  		return (x);
 -	R_Zalloc(x, struct radix_node *, max_keylen + 2 * sizeof (*x));
 +	R_Zalloc(x, struct radix_node *, RADIX_MAX_KEY_LEN + 2 * sizeof (*x));
  	if ((saved_x = x) == 0)
  		return (0);
  	netmask = cp = (caddr_t)(x + 2);
  	bcopy(addmask_key, cp, mlen);
 -	x = rn_insert(cp, mask_rnhead, &maskduplicated, x);
 +	x = rn_insert(cp, maskhead, &maskduplicated, x);
  	if (maskduplicated) {
  		log(LOG_ERR, "rn_addmask: mask impossibly already in tree");
  		Free(saved_x);
 @@ -568,6 +563,23 @@ rn_addmask(n_arg, search, skip)
  	return (x);
  }
  
 +struct radix_node *
 +rn_addmask(void *n_arg, int search, int skip)
 +{
 +	struct radix_node *tt;
 +
 +#ifdef _KERNEL
 +	mtx_lock(&mask_mtx);
 +#endif
 +	tt = rn_addmask_r(&mask_rnhead_compat, n_arg, search, skip);
 +
 +#ifdef _KERNEL
 +	mtx_unlock(&mask_mtx);
 +#endif
 +
 +	return (tt);
 +}
 +
  static int	/* XXX: arbitrary ordering for non-contiguous masks */
  rn_lexobetter(m_arg, n_arg)
  	void *m_arg, *n_arg;
 @@ -590,12 +602,12 @@ rn_new_radix_mask(tt, next)
  {
  	register struct radix_mask *m;
  
 -	MKGet(m);
 +	R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask));
  	if (m == 0) {
 -		log(LOG_ERR, "Mask for route not entered\n");
 +		log(LOG_ERR, "Failed to allocate route mask\n");
  		return (0);
  	}
 -	bzero(m, sizeof *m);
 +	bzero(m, sizeof(*m));
  	m->rm_bit = tt->rn_bit;
  	m->rm_flags = tt->rn_flags;
  	if (tt->rn_flags & RNF_NORMAL)
 @@ -629,7 +641,8 @@ rn_addroute(v_arg, n_arg, head, treenode
  	 * nodes and possibly save time in calculating indices.
  	 */
  	if (netmask)  {
 -		if ((x = rn_addmask(netmask, 0, top->rn_offset)) == 0)
 +		x = rn_addmask_r(netmask, head->rnh_masks, 0, top->rn_offset);
 +		if (x == NULL)
  			return (0);
  		b_leaf = x->rn_bit;
  		b = -1 - x->rn_bit;
 @@ -808,7 +821,8 @@ rn_delete(v_arg, netmask_arg, head)
  	 * Delete our route from mask lists.
  	 */
  	if (netmask) {
 -		if ((x = rn_addmask(netmask, 1, head_off)) == 0)
 +		x = rn_addmask_r(netmask, head->rnh_masks, 1, head_off);
 +		if (x == NULL)
  			return (0);
  		netmask = x->rn_key;
  		while (tt->rn_mask != netmask)
 @@ -841,7 +855,7 @@ rn_delete(v_arg, netmask_arg, head)
  	for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist)
  		if (m == saved_m) {
  			*mp = m->rm_mklist;
 -			MKFree(m);
 +			Free(m);
  			break;
  		}
  	if (m == 0) {
 @@ -932,7 +946,7 @@ on1:
  					struct radix_mask *mm = m->rm_mklist;
  					x->rn_mklist = 0;
  					if (--(m->rm_refs) < 0)
 -						MKFree(m);
 +						Free(m);
  					m = mm;
  				}
  			if (m)
 @@ -1128,10 +1142,8 @@ rn_walktree(h, f, w)
   * bits starting at 'off'.
   * Return 1 on success, 0 on error.
   */
 -int
 -rn_inithead(head, off)
 -	void **head;
 -	int off;
 +static int
 +rn_inithead_internal(void **head, int off)
  {
  	register struct radix_node_head *rnh;
  	register struct radix_node *t, *tt, *ttt;
 @@ -1163,8 +1175,8 @@ rn_inithead(head, off)
  	return (1);
  }
  
 -int
 -rn_detachhead(void **head)
 +static void
 +rn_detachhead_internal(void **head)
  {
  	struct radix_node_head *rnh;
  
 @@ -1176,28 +1188,60 @@ rn_detachhead(void **head)
  	Free(rnh);
  
  	*head = NULL;
 +}
 +
 +int
 +rn_inithead(void **head, int off)
 +{
 +	struct radix_node_head *rnh;
 +
 +	if (*head != NULL)
 +		return (1);
 +
 +	if (rn_inithead_internal(head, off) == 0)
 +		return (0);
 +
 +	rnh = (struct radix_node_head *)(*head);
 +
 +	if (rn_inithead_internal((void **)&rnh->rnh_masks, 0) == 0) {
 +		rn_detachhead_internal(head);
 +		return (0);
 +	}
 +
 +	return (1);
 +}
 +
 +int
 +rn_detachhead(void **head)
 +{
 +	struct radix_node_head *rnh;
 +
 +	KASSERT((head != NULL && *head != NULL),
 +	    ("%s: head already freed", __func__));
 +
 +	rnh = *head;
 +
 +	rn_detachhead_internal((void **)&rnh->rnh_masks);
 +	rn_detachhead_internal(head);
  	return (1);
  }
  
  void
  rn_init(int maxk)
  {
 -	char *cp, *cplim;
 -
 -	max_keylen = maxk;
 -	if (max_keylen == 0) {
 +	if ((maxk <= 0) || (maxk > RADIX_MAX_KEY_LEN)) {
  		log(LOG_ERR,
 -		    "rn_init: radix functions require max_keylen be set\n");
 +		    "rn_init: max_keylen must be within 1..%d\n",
 +		    RADIX_MAX_KEY_LEN);
  		return;
  	}
 -	R_Malloc(rn_zeros, char *, 3 * max_keylen);
 -	if (rn_zeros == NULL)
 -		panic("rn_init");
 -	bzero(rn_zeros, 3 * max_keylen);
 -	rn_ones = cp = rn_zeros + max_keylen;
 -	addmask_key = cplim = rn_ones + max_keylen;
 -	while (cp < cplim)
 -		*cp++ = -1;
 -	if (rn_inithead((void **)(void *)&mask_rnhead, 0) == 0)
 +
 +	/*
 +	 * XXX: Compat for old rn_addmask() users
 +	 */
 +	if (rn_inithead((void **)(void *)&mask_rnhead_compat, 0) == 0)
  		panic("rn_init 2");
 +#ifdef _KERNEL
 +	mtx_init(&mask_mtx, "radix_mask", NULL, MTX_DEF);
 +#endif
  }
 
 Modified: stable/9/sys/net/radix.h
 ==============================================================================
 --- stable/9/sys/net/radix.h	Wed Oct 30 15:46:50 2013	(r257388)
 +++ stable/9/sys/net/radix.h	Wed Oct 30 16:08:27 2013	(r257389)
 @@ -136,6 +136,7 @@ struct radix_node_head {
  #ifdef _KERNEL
  	struct	rwlock rnh_lock;		/* locks entire radix tree */
  #endif
 +	struct	radix_node_head *rnh_masks;	/* Storage for our masks */
  };
  
  #ifndef _KERNEL
 @@ -167,6 +168,7 @@ int	 rn_detachhead(void **);
  int	 rn_refines(void *, void *);
  struct radix_node
  	 *rn_addmask(void *, int, int),
 +	 *rn_addmask_r(void *, struct radix_node_head *, int, int),
  	 *rn_addroute (void *, void *, struct radix_node_head *,
  			struct radix_node [2]),
  	 *rn_delete(void *, void *, struct radix_node_head *),
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 17:48:32 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id EB720F4D;
 Wed, 30 Oct 2013 17:48:32 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-qe0-x232.google.com (mail-qe0-x232.google.com
 [IPv6:2607:f8b0:400d:c02::232])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8690F259A;
 Wed, 30 Oct 2013 17:48:32 +0000 (UTC)
Received: by mail-qe0-f50.google.com with SMTP id 1so1043614qee.37
 for <multiple recipients>; Wed, 30 Oct 2013 10:48:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=zXkuCpJ6VkRxJpBShr9XM63P1+xOmVxiX2y3yJCdXcE=;
 b=zt+iL6mD3emcJmGfqsKFTJK6HTrqfbfGotsKUlBU1yqzuK6u5qk8UF3Mx/Uzsah17c
 vXP2/Cugblb43Ypkl9I80KCJfpPAOf1kYJAqxV/eXpWwpVTk3YaHShVnuTptBcXcjFGL
 w9ztQvHKy04r2D3dOPAPM/48Svxt/Gg8x/52yQoDv3f6+wA1elINachcmgykmHF7PXdi
 PYVyFoD9DzV9GPN0qav8+XlqBguHz5IDPBMSrZreFc90pMeGIkjWFApp+ngjEHUgmRa0
 ON/kFOOryfBjOOJNF/dlTGBA9OARGyp45tsKrD8qUOwgKrk4W/s/GdvHRFwcDMm3dKm2
 35+g==
MIME-Version: 1.0
X-Received: by 10.49.59.115 with SMTP id y19mr8596891qeq.8.1383155311679; Wed,
 30 Oct 2013 10:48:31 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.224.207.66 with HTTP; Wed, 30 Oct 2013 10:48:31 -0700 (PDT)
In-Reply-To: <5270F101.6020701@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <5270F101.6020701@freebsd.org>
Date: Wed, 30 Oct 2013 10:48:31 -0700
X-Google-Sender-Auth: ERXLSL7s9c9TbRE1KgK-ujhtSl4
Message-ID: <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com>
Subject: Re: MQ Patch.
From: Adrian Chadd <adrian@freebsd.org>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 17:48:33 -0000

On 30 October 2013 04:44, Andre Oppermann <andre@freebsd.org> wrote:

>> We can't assume the hardware has deep queues _and_ we can't just hand
>> packets to the DMA engine.
>
>>
>>
>> Why?
>>
>> Because once you've pushed it into the transmit ring, you can't
>> guarantee / impose any ordering on things. You can't guarantee that
>> you can abort a frame that has been queued because it now breaks the
>> queue rules.
>>
>> That's why we don't want to just have a light wrapper around hardware
>> transmit queues. We give up way too much useful control.
>
>
> The stack can't possibly know about all these differences in current
> and future technologies and requirements.  That's why this decision
> should be pushed into the L3/L2 mapping/encapsulation and driver layer.

That's why you split it.

You allow the upper layers (things like altq) to track things like
per-IP, per-traffic-class traffic and tag things appropriate.

You then let some software queue implement the queue discipline and
only drain frames to the hardware at a rate that's fast enough to keep
up with the hardware, and no faster.

Why?

Because if you have new traffic come along from a new client, it may
be higher priority than the traffic queued to the hardware. But it's
at the same QoS level as what's currently queued to the hardware, or
map to the same physical queue.

So yes, we do need that split for a lot of cases. There will be
bare-metal cases for highly low latency but if we implement the
correct queue API here it'll just collapse down to either NULL, or
just the existing software queue in front of the DMA rings to avoid
locking overhead.

Thanks,


-adrian

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 21:24:31 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BE5A42F2
 for <net@freebsd.org>; Wed, 30 Oct 2013 21:24:31 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 22BD824CA
 for <net@freebsd.org>; Wed, 30 Oct 2013 21:24:30 +0000 (UTC)
Received: (qmail 64106 invoked from network); 30 Oct 2013 21:54:49 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <adrian@freebsd.org>; 30 Oct 2013 21:54:49 -0000
Message-ID: <527178F7.1070800@freebsd.org>
Date: Wed, 30 Oct 2013 22:24:07 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>	<CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>	<52701D8B.8050907@freebsd.org>	<527022AC.4030502@FreeBSD.org>	<527027CE.5040806@freebsd.org>	<5270309E.5090403@FreeBSD.org>	<5270462B.8050305@freebsd.org>	<CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>	<5270F101.6020701@freebsd.org>
 <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com>
In-Reply-To: <CAJ-VmonW=LQ32_XNP0GnQ=gehLO0Lf8APPHF5jpT-SjRGSw7MQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 21:24:31 -0000

On 30.10.2013 18:48, Adrian Chadd wrote:
> On 30 October 2013 04:44, Andre Oppermann <andre@freebsd.org> wrote:
>
>>> We can't assume the hardware has deep queues _and_ we can't just hand
>>> packets to the DMA engine.
>>
>>>
>>>
>>> Why?
>>>
>>> Because once you've pushed it into the transmit ring, you can't
>>> guarantee / impose any ordering on things. You can't guarantee that
>>> you can abort a frame that has been queued because it now breaks the
>>> queue rules.
>>>
>>> That's why we don't want to just have a light wrapper around hardware
>>> transmit queues. We give up way too much useful control.
>>
>>
>> The stack can't possibly know about all these differences in current
>> and future technologies and requirements.  That's why this decision
>> should be pushed into the L3/L2 mapping/encapsulation and driver layer.
>
> That's why you split it.
>
> You allow the upper layers (things like altq) to track things like
> per-IP, per-traffic-class traffic and tag things appropriate.

Any QoS scheme is split into two distinct steps: a) the classifier;
b) the queuing and packet scheduler.

The classification is totally taken out of ifnet/IFQ* and done a) through
a packet filter, ipfw, pf, ipf; b) taken from the PCB if the packet is
locally generated; c) on ingress packet from a vlan or IP header.  The
last for example is typically done in MPLS network where classification
only happens at the edges and the way all brand name routers work, with
the option of doing a) as well.

The queuing and scheduling happens after L3/L2 mapping/encapsulation and
before the packets are put onto the DMA ring.  Please not that this is
somewhat independent from additional pre-DMA queuing as in ieee80211 and
comes before it.

> You then let some software queue implement the queue discipline and
> only drain frames to the hardware at a rate that's fast enough to keep
> up with the hardware, and no faster.

For a QoS queue/scheduler to be fully effective the DMA ring should be
as small as reasonable to keep the interface busy, but not more.  All
queuing then happens in software with appropriately sized queues.

> Why?
>
> Because if you have new traffic come along from a new client, it may
> be higher priority than the traffic queued to the hardware. But it's
> at the same QoS level as what's currently queued to the hardware, or
> map to the same physical queue.

When a packet has been handed to the DMA ring there's no stopping it
anymore and the order is fixed.  That's why in a QoS setup the DMA
ring should be as small as it can be to barely keep the interface
busy.  Everything else happens in software and is subject to packet
scheduler decisions.  If a higher priority packet arrives before the
next packet scheduler run it will be dequeued first (subject to WFQ
or other fair scheduling disciplines to prevent total starvation).

You may find this presentation I did some time back at SWINOG helpful:
http://www.networx.ch/Understanding%20QoS%20by%20Andre%20Oppermann%20-%2020090402.pdf

When QoS is active there can be only one active DMA ring per interface
unless the hardware supports the necessary scheduling discipline among
the DMA rings.  Most multi DMA ring NICs employ a simple round-robin
algorithm on a per-packet basis.  With TSO these packets can be very
large.  Any such multi DMA ring setup would render any software QoS
attempts futile.  Hence only one DMA ring can be used/active with QoS.

As far as I'm aware the only NIC that officially supports multi DMA
rings including WFQ among them is the Intel ixgbe(4).  Other 10G cards
may support it but their datasheets are not public.

> So yes, we do need that split for a lot of cases. There will be
> bare-metal cases for highly low latency but if we implement the
> correct queue API here it'll just collapse down to either NULL, or
> just the existing software queue in front of the DMA rings to avoid
> locking overhead.

The L3/L2 mapping/encapsulation step may or may not need any locking
depending on what it has to do.  However its locking requirements may
be totally different from the DMA ring protection.

If there is no QoS enabled/active on an interface the packet after the
L3/L2 step goes straight through to the driver.  If there are multiple
DMA rings the driver looks at the flowid field in the mbuf header and
selects one of the DMA rings.  These DMA rings naturally have to be
protected by a (spin) lock to prevent concurrent access by multiple
cores.  Unless there is contention software queuing doesn't happen and
the DMA rings are sufficiently deep.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 21:30:35 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 6EC105C8
 for <net@freebsd.org>; Wed, 30 Oct 2013 21:30:35 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id C5AEC253F
 for <net@freebsd.org>; Wed, 30 Oct 2013 21:30:34 +0000 (UTC)
Received: (qmail 64140 invoked from network); 30 Oct 2013 22:00:52 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rizzo@iet.unipi.it>; 30 Oct 2013 22:00:52 -0000
Message-ID: <52717A62.7040600@freebsd.org>
Date: Wed, 30 Oct 2013 22:30:10 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>, Adrian Chadd <adrian@freebsd.org>, 
 Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>, 
 "freebsd-net@freebsd.org" <net@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 21:30:35 -0000

On 30.10.2013 06:00, Luigi Rizzo wrote:
> On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote:
>> Hi,
>>
>> We can't assume the hardware has deep queues _and_ we can't just hand
>> packets to the DMA engine.
>> [Adrian explains why]
>
> i have the feeling that the variuos folks who stepped into this
> discussion seem to have completely different (and orthogonal) goals
> and as such these goals should be discussed separately.

It looks like it and it is great to have this discussion. :)

> Below is the architecture i have in mind and how i would implement it
> (and it would be extremely simple since we have most of the pieces
> in place).

[Omitted citation further down of good and throughout qos description,
  to be replied to separately]

> It would be useful if people could discuss what problem they are
> addressing before coming up with patches.

Right now Glebius and I are working on the struct ifnet abstraction
which has severely bloated and blurred over the years.  The goal is
to make is opaque to the drivers for better API/ABI stability in the
first step.

When looking at struct ifnet and its place in the kernel then it
becomes evident that it's actual purpose is to serve as abstraction
of a logical layer 3 protocol interface towards the layer 2 mapping
and encapsulation, and eventually and sort of tangentially the real
hardware.

Now ifnet has become very complex and large and should be brought
back to its original purpose of the being the logical layer 3 interface
abstraction.  There isn't necessarily a 1:1 mapping from one ifnet
instance to one hardware interface.  In fact there are pure logical
ifnets (gre, tun, ...), direct hardware ifnets (simple network interfaces
like fxp(4)), and multiple logic interfaces on top a single hardware
(vlan, lagg, ...).  Depending on the ifnets purpose the backend can
be very different.  Thus I want to decouple the current implicit
notion of ifnet==hardware with associated queuing and such.  Instead
it should become a layer 3 abstraction inside the kernel again and
delegate all lower layers to appropriate protocol, layer 2, and
hardware specific implementations.

 From this comes the following *rough* implementation approach to be
tested (ignore naming for now):

/* Function pointers for packets descending into layer 2 */
   (*if_l2map)(ifnet, mbuf, sockaddr, [route]);	/* from upper stack */
   (*if_tx)(ifnet, mbuf);			/* to driver or qos */
   (*if_txframe)(ifnet, mbuf);			/* to driver */
   (*if_txframedone)(ifnet);			/* callback to qos */

/* Function pointers for packets coming up from layer 1 */
   (*if_l2demap)(ifnet, mbuf);			/* l2/l3 unmapping */

When a packet comes down that stack (*if_l2map) gets called to map
and encapsulate a layer 3 packet into an appropriate layer 2 frame.
For IP this would be ether_output() together with ARP and so on.
The result of that step is the ethernet header in front of the IP
packet.  Ether_output() then calls (*if_tx) to have the frame sent
out on the wire(less) which is the driver handoff point for DMA
ring addition.  Normally (*if_tx) and (*if_txframe) are the same
and the job is done.  When software QoS is active (*if_tx) points
into the soft-qos enqueue implementation and will eventually use
(*if_txframe) to push out those packets onto the wire it sees fit.

In addition the drivers have to expose functions to manage the number
and depth of their DMA rings, or rather the number/size of packets
that can be enqueued onto them.  And the (*if_txframedone) callback
to clock out packets from a soft-queue or QoS discipline.  When QoS
is active it probably wants to make the DMA rings small and the
software queue(s) large to be effective.

As default setup and when running a server no QoS will be active
or inserted.  No or only very small software queues exist to handle
concurrency (except for ieee80211 to do sophisticated frame management
inside *if_txframe).  Whenever the DMA ring is full there is no point
in queuing up more packets.  Instead the socket buffers act as buffers
and also ensure flow control and backpressure up to userspace to limit
kernel memory usage from double and triple buffering.

How the packets are efficiently pushed out onto the wire is up to the
drivers and depends on the hardware capabilities.  It can be multiple
hardware DMA rings, or just a single ring with an efficient concurrent
access method.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 21:53:24 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id B686E212;
 Wed, 30 Oct 2013 21:53:24 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-qa0-x22e.google.com (mail-qa0-x22e.google.com
 [IPv6:2607:f8b0:400d:c00::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 530B026B2;
 Wed, 30 Oct 2013 21:53:24 +0000 (UTC)
Received: by mail-qa0-f46.google.com with SMTP id j15so4105948qaq.12
 for <multiple recipients>; Wed, 30 Oct 2013 14:53:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=YbkY7Urr5B8bLFKIzNB0U8SrzTyjLAS2IhuUCMcPyuw=;
 b=fh3YpF9i/4PgegyNcJ/PjigtGunCGuGNKErshEhm3k6JQQdWbrqvZVPZsHgBdax97q
 Dc/4C6CENlZgyZOhx12ETqBKcemGVQZAdQNKC+f4GoAp7JCuvCD8bwFBeKRbLJSY/eiR
 Ii/hZtnzfJtfz2CGQmAwLiHwLNZLMi9aCKkA0LwQ2ojmCUxM8tmbBb3Ork+xPgcWAte4
 UGVGlKRtvj8yJcEfsvmMp6zcQo/uSkvLD7bBHA0KEihjmgZ3ItlhCYXFr8SRu41F+KX5
 CzwX5PBd4pCirRfeRlRgOABBjHIc4obm12qCSUE4kDhOBa5sd2FTm1EymhFGDFLcaKEk
 TG8g==
MIME-Version: 1.0
X-Received: by 10.224.113.199 with SMTP id b7mr980525qaq.4.1383170003477; Wed,
 30 Oct 2013 14:53:23 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.224.207.66 with HTTP; Wed, 30 Oct 2013 14:53:23 -0700 (PDT)
In-Reply-To: <52717A62.7040600@freebsd.org>
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
 <52717A62.7040600@freebsd.org>
Date: Wed, 30 Oct 2013 14:53:23 -0700
X-Google-Sender-Auth: E1dx_NzsQWZ0OYjPA511t5MA6GI
Message-ID: <CAJ-VmonUiBw+_auJEz254Gsyu9yq2awoFKyKDM9S4iY5S8BiOA@mail.gmail.com>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
From: Adrian Chadd <adrian@freebsd.org>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 21:53:24 -0000

On 30 October 2013 14:30, Andre Oppermann <andre@freebsd.org> wrote:

> As default setup and when running a server no QoS will be active
> or inserted.  No or only very small software queues exist to handle
> concurrency (except for ieee80211 to do sophisticated frame management
> inside *if_txframe).  Whenever the DMA ring is full there is no point
> in queuing up more packets.  Instead the socket buffers act as buffers
> and also ensure flow control and backpressure up to userspace to limit
> kernel memory usage from double and triple buffering.

.. and what about for LAN<->WAN traffic, where there's no socket buffers?


-adrian

From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 22:02:17 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id B23FD5CF
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 22:02:17 +0000 (UTC)
 (envelope-from garmitage@swin.edu.au)
Received: from gpo3.cc.swin.edu.au (gpo3.cc.swin.edu.au [136.186.1.32])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 470F02749
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 22:02:16 +0000 (UTC)
Received: from [136.186.229.37] (garmitage.caia.swin.edu.au [136.186.229.37])
 by gpo3.cc.swin.edu.au (8.14.3/8.14.3) with ESMTP id r9UM1isa021721
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); 
 Thu, 31 Oct 2013 09:02:04 +1100
Message-ID: <527181C8.3040502@swin.edu.au>
Date: Thu, 31 Oct 2013 09:01:44 +1100
From: grenville armitage <garmitage@swin.edu.au>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:16.0) Gecko/20121107 Thunderbird/16.0.2
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 22:02:17 -0000


On 10/30/2013 16:00, Luigi Rizzo wrote:
	[..]
> Just to set the terminology:
> QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT
> 	or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES .
> 	This is what implements DROPTAIL (also improperly called FIFO),
> 	RED, CODEL. Note that for CODEL you need to intercept extractions
> 	from the queue, whereas DROPTAIL and RED only act on
> 	insertions.
>
> SCHEDULER is the entity which decides which queue to serve among
> 	the many possible ones. It is called on INSERTIONS and
> 	EXTRACTIONS from a queue, and passes packets to the NIC's queue.
>
> The decision on which queue and ring (Q_i and R_j) to use should be made
> by a classifier at the beginning of step 2 (or once per iteration,
> if using a hierarchical scheduler). Of course they can be precomputed
> (e.g. with annotations in the mbuf coming from the socket).

I'd like to give a big +1 to the above. Crucial additional points
about the per-hop processing for QoS:

  - Classification is any decision of the form "to what class does
this frame belong", where the answer is intended to drive the frame
into the appropriate queue. (Which implies the notion of 'class' is
very much context-dependent, and classification is something that may
occur on L3 tuples, MPLS headers, other L2 fields, other local in-kernel
context,etc.)

  - Queuing and schedule must happen where bottlenecks form, and
are irrelevant at points in the data path where no bottleneck exists.

cheers,
gja


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 22:17:19 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id D9771D6F
 for <net@freebsd.org>; Wed, 30 Oct 2013 22:17:19 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id EEA9F281D
 for <net@freebsd.org>; Wed, 30 Oct 2013 22:17:18 +0000 (UTC)
Received: (qmail 64356 invoked from network); 30 Oct 2013 22:47:36 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rizzo@iet.unipi.it>; 30 Oct 2013 22:47:36 -0000
Message-ID: <52718556.9010808@freebsd.org>
Date: Wed, 30 Oct 2013 23:16:54 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>, Adrian Chadd <adrian@freebsd.org>, 
 Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>, 
 "freebsd-net@freebsd.org" <net@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 22:17:19 -0000

On 30.10.2013 06:00, Luigi Rizzo wrote:
> On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote:
>> Hi,
>>
>> We can't assume the hardware has deep queues _and_ we can't just hand
>> packets to the DMA engine.
>> [Adrian explains why]

[skipping things replied to in other email]

> The architecture i think we should pursue is this (which happens to be
> what linux implements, and also what dummynet implements, except
> that the output is to a dummynet pipe or to ether_output() or to
> ip_output() depending on the configuration):
>
>     1. multiple (one per core) concurrent transmitters t_c

That's simply the number of cores that in theory could try to send
a packet at the time?  Or is it supposed to be an actual structure?

> 	which use ether_output_frame() to send to
>
>     2. multiple disjoint queues q_j
> 	(one per traffic group, can be *a lot*, say 10^6)

Whooo, that looks a bit excessive.  So many traffic groups would
effectively be one per flow?

Most of the time traffic is distributed into 4-8 classes with
strict priority for the highest class (VoIP) and some sort of
proportional WFQ for the others.  At least that's the standard
setup for carrier/ISP networks.

> 	which are scheduled with a scheduler S
>          (iterate step 2 for hierarchical schedulers)
> 	and

Makes sense.

>     3. eventually feed ONE transmit ring R_j on the NIC.

Agreed, more than one wouldn't work because otherwise the NIC would
do poor man's RR among the queues.

> 	Once a packet reaches R_j, for all practical purpose
> 	is on the wire. We cannot intercept extractions,
> 	we cannot interfere with the scheduler in the NIC in
> 	case of multiqueue NICs. The most we can do (and should,
> 	as in Linux) is notify the owner of the packet once its
> 	transmission is complete.

Per packet notification probably has a high overhead on high pps
systems.  The coalesced TX complete interrupt should do for QoS
purposes as well to keep the DMA ring fed.  We do not track who
generated the packet and thus can't have the notification bubble
up to the PCB (if any).

> Just to set the terminology:
> QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT
> 	or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES .
> 	This is what implements DROPTAIL (also improperly called FIFO),
> 	RED, CODEL. Note that for CODEL you need to intercept extractions
> 	from the queue, whereas DROPTAIL and RED only act on
> 	insertions.

Ack.

> SCHEDULER is the entity which decides which queue to serve among
> 	the many possible ones. It is called on INSERTIONS and
> 	EXTRACTIONS from a queue, and passes packets to the NIC's queue.

Ack.

> The decision on which queue and ring (Q_i and R_j) to use should be made
> by a classifier at the beginning of step 2 (or once per iteration,
> if using a hierarchical scheduler). Of course they can be precomputed
> (e.g. with annotations in the mbuf coming from the socket).

IMHO that is the job of a packet filter, or in simple cases can be
transposed into the mbuf header from vlan header cos or IP header
tos fields.

> Now when it comes to implementing the above, we have three
> cases (or different optimization levels, if you like)

-- 0. THE NO QOS CASE ---

No qos is done and multi DMA rings are selected based on the flowid
to reduce contention while avoiding packet reordering.

> -- 1. THE SIMPLE CASE ---
>
> In the simplest possible case we have can let the NIC do everything.
> Necessary conditions are:
> - queue management policies acting only on insertions
>    (e.g. DROPTAIL or RED or similar);
> - # of traffic classes <= # number of NIC rings
> - scheduling policy S equal to the one implemented in the NIC
>    (trivial case: one queue, one ring, no scheduler)
>
> All these cases match exactly what the hardware provides, so we can just
> use the NIC ring(s) without extra queue(s), and possibly use something
> like buf_ring to manage insertions (but note that insertions in
> an empty queue will end up requiring a lock; and i think the
> same happens even now with the extra drbr queue in front of the ring).

Agreed.  A lock on the DMA ring is always required to protect the ring
structure and NIC doorbell.  Software queuing or buf_ring shouldn't be
necessary at all.  Only some mechanism to make concurrent access/backoff
to the same DMA ring more efficient may be good.  For example having one
packet slot per core instead of spinning.

> -- 2. THE INTERMEDIATE CASE ---
>
> If we do not care about a scheduler but want a more complex QUEUE
> MANAGEMENT, such as CODEL, that acts on extractions, we _must_
> implement an intermediate queue Q_i before the NIC ring.  This is
> our only chance to act on extractions from the queue (which CODEL
> requires).  Note that we DO NOT NEED to create multiple queues for
> each ring.

As long as the NIC doesn't implement fair RR or interleaving among
multiple DMA rings any sort of queue management is futile.  Whenever
queue management is active only one DMA ring may be used and it should
be as small as possible to give maximum decision latitude to the queue
management.

> -- 3. THE COMPLETE CASE ---
>
> This is when the scheduler we want (DRR, WFQ variants, PRIORITY...)
> is not implemented in the NIC, or we have more queues than those
> available in the NIC. In this case we need to invoke this extra
> block before passing packets to the NIC.

Again the same as in 2. applies, just with a more complex soft queue
and scheduler.

> Remember that dummynet implements exactly #3, and it comes with a
> set of pretty efficient schedulers (i have made extensive measurements
> on them, see links to papers on my research page
> http://info.iet.unipi.it/~luigi/research.html ).
> They are by no means a performance bottleneck (scheduling takes
> 50..200ns depending on the circumstances) in the cases where
> it matters to have a scheduler (which is, when the sender is
> faster than the NIC, which in turn only happens with large packets
> which take 1..30us to get through at the very least..

Thanks for the information.

> --- IMPLEMENTATION ---
>
> Apart from ALTQ (which is very slow and has inefficient schedulers
> and i don't think anybody wants to maintain), and with the exception
> of dummynet which I'll discuss later, at the moment FreeBSD do not
> support schedulers in the tx path of the device driver.

I haven't really dug into ALTQ/dummynet yet, however from looking
over you seems to be very much right.

The basis for fresh generic QoS implementation should be dummynet
(in parallel to keep it intact).

> So we can only deal with cases 1 and 2, and for them the software
> queue + ring suffices to implement any QUEUE MANAGEMENT policy
> (but we don't implement anything).
>
> If we want support the generic case (#3), we should do the following:
>
> 1. device drivers export a function to transmit on an individual ring,
>    basically the current if_transmit(), and a hook to play with the
>    corresponding queue lock (the scheduler needs to run under lock,
>    and we can as well use the ring lock for that).
>    Note that the ether_output_frame does not always need to
>    call the scheduler: if a packet enters a non-empty queue, we are done.

OK.

> 2. device drivers also export the number of tx queues, and
>    some (advisory) information on queue status

OK.

> 3. ether_output_frame() runs the classifier (if needed), invokes
>    the scheduler (if needed) and possibly falls through into if_transmit()
>    for the specific ring.

OK.

> 4. on transmit completions (*_txeof(), typically), a callback invokes
>    the scheduler to feed the NIC ring with more packets

Ack.

> I mentioned dummynet: it already implements ALL of this,
> including the completion callback in #4. There is a hook
> in ether_output_frame(), and the hook was called (up to 8.0
> i believe) if_tx_rdy(). You can see wat it does in
> RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy()
>
> http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup
>
> if_tx_rdy() does not exist anymore because almost nobody used it,
> but it is trivial to reimplement, and can be called by device drivers
> when *_txeof() finds that is running low on packets _and_ the
> specific NIC needs to implement the "complete" scheduling.

Yup.

> The way it worked in dummynet (I think i used it in on 'tun' and 'ed')
> is also documented in the manpage:
> define a pipe whose bandwidth is set as a the device name instead
> of a number. Then you can attach a scheduler to the pipe, queues
> to the scheduler, and you are done.  Example:
>
>      // this is the scheduler's configuration
> 	ipfw pipe 10 config bw 'em2' sched
> 	ipfw sched 10 config type drr // deficit round robin
> 	ipfw queue 1 config weight 30 sched 10 // important
> 	ipfw queue 2 config weight 5 sched 10 // less important
> 	ipfw queue 3 config weight 1 sched 10 // who cares...
>
>      // and this is the classifier, which you can skip if the
>      // packets are already pre-classified.
>      // The infrastructure is already there to implement per-interface
>      // configurations.
> 	ipfw add queue 1 src-port 53
> 	ipfw add queue 2 src-port 22
> 	ipfw add queue 2 ip from any to any
>
> Now, surely we can replace the implementation of packet queues in dummynet
> from the current TAILQ to something resembling buf_ring to improve
> write parallelism; and a bit of glue code is needed to attach
> per-interface ipfw instances to each interface, and some smarts in
> the configuration commands is needed to figure out when we can
> bypass everything or not.

I'll experiment with variantions thereof.

> But this seems to me a much more viable approach to achieve proper QoS
> support in our architecture.

Indeed.  Let me get some code and prototypes going in the next weeks
and then pick up the discussion from there again.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 22:23:40 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 02707F4A
 for <net@freebsd.org>; Wed, 30 Oct 2013 22:23:40 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 64118288C
 for <net@freebsd.org>; Wed, 30 Oct 2013 22:23:38 +0000 (UTC)
Received: (qmail 64385 invoked from network); 30 Oct 2013 22:53:56 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <adrian@freebsd.org>; 30 Oct 2013 22:53:56 -0000
Message-ID: <527186D3.7090307@freebsd.org>
Date: Wed, 30 Oct 2013 23:23:15 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>	<CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>	<52701D8B.8050907@freebsd.org>	<527022AC.4030502@FreeBSD.org>	<527027CE.5040806@freebsd.org>	<5270309E.5090403@FreeBSD.org>	<5270462B.8050305@freebsd.org>	<CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>	<20131030050056.GA84368@onelab2.iet.unipi.it>	<52717A62.7040600@freebsd.org>
 <CAJ-VmonUiBw+_auJEz254Gsyu9yq2awoFKyKDM9S4iY5S8BiOA@mail.gmail.com>
In-Reply-To: <CAJ-VmonUiBw+_auJEz254Gsyu9yq2awoFKyKDM9S4iY5S8BiOA@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "freebsd-net@freebsd.org" <net@freebsd.org>,
 Luigi Rizzo <rizzo@iet.unipi.it>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 22:23:40 -0000

On 30.10.2013 22:53, Adrian Chadd wrote:
> On 30 October 2013 14:30, Andre Oppermann <andre@freebsd.org> wrote:
>
>> As default setup and when running a server no QoS will be active
>> or inserted.  No or only very small software queues exist to handle
>> concurrency (except for ieee80211 to do sophisticated frame management
>> inside *if_txframe).  Whenever the DMA ring is full there is no point
>> in queuing up more packets.  Instead the socket buffers act as buffers
>> and also ensure flow control and backpressure up to userspace to limit
>> kernel memory usage from double and triple buffering.
>
> .. and what about for LAN<->WAN traffic, where there's no socket buffers?

When the DMA ring is full (in case of a deep ring, or the software queue
for small DMA rings) additional packets get dropped as it is today.  Instead
of tail dropping an active queue management algorithm like RED may be used.
The is no point in ultra deep buffering ending up in tens or hundreds of
milliseconds (see bufferbloat).  If there is more egress traffic destined
for an interface than it can handle there is no way to avoid packet drops.
It's actually a good thing because for TCP packet drops are the primary
feedback for its sending behavior.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 22:32:09 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 4AA0316F
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 22:32:09 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 92D152910
 for <freebsd-net@freebsd.org>; Wed, 30 Oct 2013 22:32:08 +0000 (UTC)
Received: (qmail 64443 invoked from network); 30 Oct 2013 23:02:26 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <garmitage@swin.edu.au>; 30 Oct 2013 23:02:26 -0000
Message-ID: <527188D1.2070905@freebsd.org>
Date: Wed, 30 Oct 2013 23:31:45 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: grenville armitage <garmitage@swin.edu.au>, freebsd-net@freebsd.org
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it> <527181C8.3040502@swin.edu.au>
In-Reply-To: <527181C8.3040502@swin.edu.au>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 22:32:09 -0000

On 30.10.2013 23:01, grenville armitage wrote:
> On 10/30/2013 16:00, Luigi Rizzo wrote:
>      [..]
>> Just to set the terminology:
>> QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT
>>     or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES .
>>     This is what implements DROPTAIL (also improperly called FIFO),
>>     RED, CODEL. Note that for CODEL you need to intercept extractions
>>     from the queue, whereas DROPTAIL and RED only act on
>>     insertions.
>>
>> SCHEDULER is the entity which decides which queue to serve among
>>     the many possible ones. It is called on INSERTIONS and
>>     EXTRACTIONS from a queue, and passes packets to the NIC's queue.
>>
>> The decision on which queue and ring (Q_i and R_j) to use should be made
>> by a classifier at the beginning of step 2 (or once per iteration,
>> if using a hierarchical scheduler). Of course they can be precomputed
>> (e.g. with annotations in the mbuf coming from the socket).
>
> I'd like to give a big +1 to the above. Crucial additional points
> about the per-hop processing for QoS:
>
>   - Classification is any decision of the form "to what class does
> this frame belong", where the answer is intended to drive the frame
> into the appropriate queue. (Which implies the notion of 'class' is
> very much context-dependent, and classification is something that may
> occur on L3 tuples, MPLS headers, other L2 fields, other local in-kernel
> context,etc.)

Full ack.  When the class information is present (and trusted) on ingress
packets in the vlan header, IP tos and other such well-defined fields we
can map it directly to the mbuf header qoscos field.  Everything more complex
has to be done in a packet filter that has access to and can parse L3 and
higher layers in the packet.  On egress only the mbuf header is looked at
to determine the class and queue it should be put into.

>   - Queuing and schedule must happen where bottlenecks form, and
> are irrelevant at points in the data path where no bottleneck exists.

Very well put and *the* one crucial thing to understand to make any kind
of QoS work in practice.

-- 
Andre


From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 00:32:55 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id CF1DF140;
 Thu, 31 Oct 2013 00:32:55 +0000 (UTC)
 (envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
 by mx1.freebsd.org (Postfix) with ESMTP id 6EB092FD0;
 Thu, 31 Oct 2013 00:32:52 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
 id 0EC437300A; Thu, 31 Oct 2013 01:34:38 +0100 (CET)
Date: Thu, 31 Oct 2013 01:34:38 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Andre Oppermann <andre@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
Message-ID: <20131031003438.GA10518@onelab2.iet.unipi.it>
References: <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
 <52718556.9010808@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <52718556.9010808@freebsd.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: Adrian Chadd <adrian@freebsd.org>,
 "freebsd-net@freebsd.org" <net@freebsd.org>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 00:32:55 -0000

On Wed, Oct 30, 2013 at 11:16:54PM +0100, Andre Oppermann wrote:
> On 30.10.2013 06:00, Luigi Rizzo wrote:
...
> [skipping things replied to in other email]

likewise, and let me thank you for the detailed comments.
I am adding a few comments myself below

> > The architecture i think we should pursue is this (which happens to be
> > what linux implements, and also what dummynet implements, except
> > that the output is to a dummynet pipe or to ether_output() or to
> > ip_output() depending on the configuration):
> >
> >     1. multiple (one per core) concurrent transmitters t_c
> 
> That's simply the number of cores that in theory could try to send
> a packet at the time?  Or is it supposed to be an actual structure?

it is just the number of cores that could potentially compete at
any time in using one scheduler

> > 	which use ether_output_frame() to send to
> >
> >     2. multiple disjoint queues q_j
> > 	(one per traffic group, can be *a lot*, say 10^6)
> 
> Whooo, that looks a bit excessive.  So many traffic groups would
> effectively be one per flow?

It depends on what you define as "flow", and i explictly did not
use the term as it is ambiguous. For me a traffic group is whatever
a classifier decides to put together.

The point of aiming for large number of classes is to avoid making
assumptions that will limit us in the future, eg. reserving a too
small field to represent the queue id, or statically allocating
queues, and the like.
Most schedulers in dummynet scale as O(1) with the number of classes,
so the only issue is having enough memory; and in any case
the actual max number of classes depends on the output of your classifier.

A lot of dummynet configurations (driving the upstream link for a
leaf netwrork, so right in front of  bottleneck) use a handful of
groups _per host_: say one for voip, one for dns/ssh, one for bulk
traffic, assigning different weights. A QFQ scheduler can easily
end up with a few thousands of queues and still efficiently achieve
fair sharing of bandwidth.

> Most of the time traffic is distributed into 4-8 classes with
> strict priority for the highest class (VoIP) and some sort of
> proportional WFQ for the others.  At least that's the standard
> setup for carrier/ISP networks.

This is for two reasons:
- the ISP does not need to care about individual hosts within the
  customer's network, but only (possibly) on the coarse classification
  that the customer has made via TOS/COS bits.
- boxes that only have a handful of queues handled with priority
  cost infinitely less than decent ones, so ISPs have an incentive
  in not separating individual customers (which they should do)
  especially if the SLA is "your upstream bandwidth is 1 Mbit/s,
  but the guaranteed bandwidth is 30 Kbit/s" (typical ADSL in italy).

But again, it is important that we support large sets of classes,
we do not necessarily have to use them.

> > 	Once a packet reaches R_j, for all practical purpose
> > 	is on the wire. We cannot intercept extractions,
> > 	we cannot interfere with the scheduler in the NIC in
> > 	case of multiqueue NICs. The most we can do (and should,
> > 	as in Linux) is notify the owner of the packet once its
> > 	transmission is complete.
> 
> Per packet notification probably has a high overhead on high pps
> systems.  The coalesced TX complete interrupt should do for QoS
> purposes as well to keep the DMA ring fed.  We do not track who
> generated the packet and thus can't have the notification bubble
> up to the PCB (if any).

I know we don't do it now, but linux does and performance is not
impacted badly.  Notifications can be easily batched and in the end
they only cause a selwakeup() . Anyways this can be retrofitted if
we have a reference from the mbuf to the owner/socket, and a pointer
to a callback.

> > The decision on which queue and ring (Q_i and R_j) to use should be made
> > by a classifier at the beginning of step 2 (or once per iteration,
> > if using a hierarchical scheduler). Of course they can be precomputed
> > (e.g. with annotations in the mbuf coming from the socket).
> 
> IMHO that is the job of a packet filter, or in simple cases can be
> transposed into the mbuf header from vlan header cos or IP header
> tos fields.

we are on sync here, just terminology differs.
A classifier is the first half of a packet filter (which first
classifies and then applies an action). And yes the classification
info can come from the headers.

cheers
luigi

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 02:45:01 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BE20060C;
 Thu, 31 Oct 2013 02:45:01 +0000 (UTC)
 (envelope-from linimon@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 9379926F1;
 Thu, 31 Oct 2013 02:45:01 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9V2j1Yi092704;
 Thu, 31 Oct 2013 02:45:01 GMT
 (envelope-from linimon@freefall.freebsd.org)
Received: (from linimon@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9V2j1P1092703;
 Thu, 31 Oct 2013 02:45:01 GMT (envelope-from linimon)
Date: Thu, 31 Oct 2013 02:45:01 GMT
Message-Id: <201310310245.r9V2j1P1092703@freefall.freebsd.org>
To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org
From: linimon@FreeBSD.org
Subject: Re: kern/183390: [ixgbe] 10gigabit networking problems
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 02:45:01 -0000

Old Synopsis: 10gigabit networking problems
New Synopsis: [ixgbe] 10gigabit networking problems

Responsible-Changed-From-To: freebsd-bugs->freebsd-net
Responsible-Changed-By: linimon
Responsible-Changed-When: Thu Oct 31 02:43:11 UTC 2013
Responsible-Changed-Why: 
Over to maintainer(s).

http://www.freebsd.org/cgi/query-pr.cgi?pr=183390

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 02:46:10 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 65E9B6FB;
 Thu, 31 Oct 2013 02:46:10 +0000 (UTC)
 (envelope-from linimon@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3A7B12709;
 Thu, 31 Oct 2013 02:46:10 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9V2kAPF092779;
 Thu, 31 Oct 2013 02:46:10 GMT
 (envelope-from linimon@freefall.freebsd.org)
Received: (from linimon@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9V2kADn092778;
 Thu, 31 Oct 2013 02:46:10 GMT (envelope-from linimon)
Date: Thu, 31 Oct 2013 02:46:10 GMT
Message-Id: <201310310246.r9V2kADn092778@freefall.freebsd.org>
To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org
From: linimon@FreeBSD.org
Subject: Re: kern/183391: [ixgbe] 10gigabit networking problems with Emulex
 OCE 11102 CNA
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 02:46:10 -0000

Old Synopsis: 10gigabit networking problems with Emulex OCE 11102 CNA
New Synopsis: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA

Responsible-Changed-From-To: freebsd-bugs->freebsd-net
Responsible-Changed-By: linimon
Responsible-Changed-When: Thu Oct 31 02:45:10 UTC 2013
Responsible-Changed-Why: 

Over to maintainer(s).

http://www.freebsd.org/cgi/query-pr.cgi?pr=183391

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 07:41:01 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 7756F31B
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 07:41:01 +0000 (UTC)
 (envelope-from s.khanchi@gmail.com)
Received: from mail-wg0-x22b.google.com (mail-wg0-x22b.google.com
 [IPv6:2a00:1450:400c:c00::22b])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id F377B265E
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 07:41:00 +0000 (UTC)
Received: by mail-wg0-f43.google.com with SMTP id b13so2347475wgh.10
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 00:40:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:from:date:message-id:subject:to:content-type;
 bh=1lT3QG3BPmFIjAQw9yhGT3wvVMnR4AcC+e/raZ5wPMk=;
 b=QdFYg5yetQl6S3jTc1YYdBNwC0TPZbOOjwFZlqLS+1PLJ+GBVEnmHLVDqX1m0N6U5T
 2DcbQgYUucO9j1XcR/LBsZ93R7NHbgfBCcPu95D85b5nRi+30pWnaBJlI87o99tr/Jwm
 0uZd+7ef/yjOX4Yi34QJxL7tlg0SL4BTAi8oQSKLeu7Mw1CTwU+MWuLgkCZG1puwejgv
 WK75Ffb0Vmq9VNqsVVKJ3VoZ+BaofccQsYRFpXVf6JDbQ52CP/h7EgpEOy1q5XvUKBvF
 XaFNlgGHf141vNwzFUdYY+Z1cLwmXnfY6YrYWC2bbKmMD3LQ4PvAvooQP8KMtbtZJ+QL
 /atw==
X-Received: by 10.194.250.6 with SMTP id yy6mr1392705wjc.13.1383205259515;
 Thu, 31 Oct 2013 00:40:59 -0700 (PDT)
MIME-Version: 1.0
Sender: s.khanchi@gmail.com
Received: by 10.194.119.73 with HTTP; Thu, 31 Oct 2013 00:40:39 -0700 (PDT)
From: h bagade <bagadeh@gmail.com>
Date: Thu, 31 Oct 2013 11:10:39 +0330
X-Google-Sender-Auth: 32HW-scoFN_V6fdWFkTWzY7HZCs
Message-ID: <CAARSjE39dDLMJfEZexZQ=YGHhCNR69vKzPAB6ojYdJSZopysGQ@mail.gmail.com>
Subject: Errors on running kipfw with vale switches
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 07:41:01 -0000

Hi all,

I want to run userland ipfw with netmap support(kipfw). When I try to
follow the example to test kipfw, it encounters an error on following
command:

# connect the firewall to two vale switches
./kipfw valeA:f valeB:f &

command output:
root@zharf-bsd-9:/ipfw-user # ./kipfw valeA:f valeB:f &
[1] 2278

[  10.971878] missing.c:callout_startup [356] start
init_children mod_idx value 9
+++ start module 0 ipfw ipfw at 0x61dc60 order 0x1
+++ start module 1 sy_ipfw SYSINIT at 0x0 order 0x2
ipfw2 initialized, divert loadable, nat loadable, rule-based
forwarding disabled, default to accept, logging disabled

+++ start module 2 sy_Vnet_ipfw SYSINIT at 0x0 order 0x3
[  10.971944] missing.c:callout_init [303] c 0x61e380 mpsafe 8
[  10.971949] missing.c:pfil_head_get [86] called
[  10.971952] missing.c:pfil_add_hook [93] called

+++ start module 3 dummynet dummynet at 0x61dca0 order 0x4
DUMMYNET 0x0 with IPv6 initialized (100409)
[  10.971966] missing.c:taskqueue_create [422] start dummynet fn
0x414ba0 ctx 0x61e400
[  10.971970] missing.c:taskqueue_start_threads [430] tqp 0x61e400
count 1 (dummy)

[  10.971973] missing.c:callout_init [303] c 0x61e4a0 mpsafe 8
+++ start module 4 dn_fifo dn_fifo at 0x61dcf0 order 0x5
[  10.971982] ip_dummynet.c:load_dn_sched [2250] dn_sched FIFO loaded
+++ start module 5 dn_wf2qp dn_wf2qp at 0x61ddd0 order 0x6

[  10.971989] ip_dummynet.c:load_dn_sched [2250] dn_sched WF2Q+ loaded
+++ start module 6 dn_rr dn_rr at 0x61deb0 order 0x7
[  10.971995] ip_dummynet.c:load_dn_sched [2250] dn_sched RR loaded
+++ start module 7 dn_qfq dn_qfq at 0x61df90 order 0x8

[  10.972000] ip_dummynet.c:load_dn_sched [2250] dn_sched QFQ loaded
+++ start module 8 dn_prio dn_prio at 0x61e070 order 0x9
[  10.972005] ip_dummynet.c:load_dn_sched [2250] dn_sched PRIO loaded
*** Global Sysctl Table entries = 39, total size = 2052 ***

[  10.972055] session.c:do_server  [531] +++ listening tcp 127.0.0.1:5555
[  10.972065] netmap_io.c:netmap_add_port [272] opening netmap device valeA:f
netmap_open [131] /dev/netmap opened ok

netmap_open [139] cannot get info on valeA:f, errno 6 ver 3
[  10.972098] netmap_io.c:netmap_add_port [283] error opening valeA:f
[  10.972103] netmap_io.c:netmap_add_port [272] opening netmap device valeB:f
netmap_open [131] /dev/netmap opened ok

netmap_open [139] cannot get info on valeB:f, errno 6 ver 3
[  13.019760] netmap_io.c:netmap_add_port [283] error opening valeB:f
[  13.019779] session.c:do_server  [531] +++ listening tcp 127.0.0.1:5556

[  13.021023] missing.c:callout_run [373] running 0x61e4a0 due at 1 now 2049
[  13.021035] missing.c:callout_run [373] running 0x61e380 due at 1000 now 2049

I am running firewall on FreeBSD 9.2-stable.
It seems that there is some problem with vale but I don't know what it
is! Is it possible that my netmap module doesn't support vale?

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 16:56:00 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 6CAC7356
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 16:56:00 +0000 (UTC)
 (envelope-from CGuadall@nexica.com)
Received: from relay3.mail.nexica.com (relay3.mail.nexica.com [217.13.116.92])
 by mx1.freebsd.org (Postfix) with ESMTP id CD40A2ED7
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 16:55:59 +0000 (UTC)
Received: from relay3.mail.nexica.com (zeus02nex.noc.nexica.com [10.2.0.151])
 by batchmail3.noc.nexica.com (Postfix) with ESMTP id CD370DD3D3
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 17:11:06 +0100 (CET)
Received: from cl3-smtp.mail.nexica.com (zeus02nex.noc.nexica.com [10.2.0.151])
 by relay3.noc.nexica.com (Postfix) with ESMTP id 174A2B4522
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 17:11:00 +0100 (CET)
Received: from vnxbcnex02.bcn.nexica.com (unknown [212.92.38.69])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: relay.nexica.com)
 by cl3-smtp.mail.nexica.com (Postfix) with ESMTP id 0D23712B685
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 17:11:00 +0100 (CET)
Received: from vnxbcnex01.bcn.nexica.com (192.168.1.158) by
 vnxbcnex02.bcn.nexica.com (192.168.1.159) with Microsoft SMTP Server (TLS) id
 8.1.436.0; Thu, 31 Oct 2013 17:10:59 +0100
Received: from vnxbcnex01.bcn.nexica.com ([172.16.30.68]) by
 vnxbcnex01.bcn.nexica.com ([172.16.30.68]) with mapi; Thu, 31 Oct 2013
 17:10:59 +0100
From: Carles Guadall <CGuadall@nexica.com>
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Date: Thu, 31 Oct 2013 17:10:57 +0100
Subject: LACP+VLAN with 10G NIC not working
Thread-Topic: LACP+VLAN with 10G NIC not working
Thread-Index: Ac7WU4XKuc41LagWQKWzRhhNFTXOLg==
Message-ID: <7A75BE7326F9D34D83FDF03CA8B02155174F56A242@vnxbcnex01.bcn.nexica.com>
Accept-Language: ca-ES, es-ES
Content-Language: ca-ES
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
acceptlanguage: ca-ES, es-ES
X-TM-AS-Product-Ver: SMEX-10.2.0.2087-7.000.1014-20258.003
X-TM-AS-Result: No--3.015000-8.000000-31
X-TM-AS-User-Approved-Sender: No
X-TM-AS-User-Blocked-Sender: No
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 16:56:00 -0000

I configured a lacp (lagg0) with two 10G-Intel NICs. Latter I created 2 VLA=
N over lagg0.

When trying to ping from host vlan to any other hosts doesn't work.=20
I ran tcpdump on each interface, ix[0|1], lagg0 and vlan[52|908].

- On each physical interface I see packets coming from network. I can see m=
ainly broadcasts, correctly tagged, etc.
- On lagg0 interface I also see packets coming from network. When running t=
cpdump on vlanXX I only can see ARP requests from localhost.

It's seems packets doesn't "flow" from/to vnic and lagg.

Inbound=20
( network ) --> [ix0] --> [lagg0] ---X---> [vlan52]
( network ) --> [ix0] --> [lagg0] ---X---> [vlan52]

Outbound=20

( network ) <-- [ix0] <-- [lagg0] <---X--- [vlan52]
( network ) <-- [ix0] <-- [lagg0] <---X--- [vlan52]

Any idea what's wrong??

System info

# uname -a
FreeBSD XXX-hostname-XXX 9.1-STABLE FreeBSD 9.1-STABLE #0 r+16f6355: Tue Au=
g 27 00:38:40 PDT 2013     root@build.ixsystems.com:/tank/home/jkh/src/free=
nas/os-base/amd64/tank/home/jkh/src/freenas/FreeBSD/src/sys/FREENAS.amd64  =
amd64

# sysctl kern.osreldate
kern.osreldate: 901505

# dmesg |grep -i intel
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.7 - STAB=
LE/9> port 0xbc00-0xbc1f mem 0xf9f80000-0xf9ffffff,0xf9f7c000-0xf9f7ffff ir=
q 16 at device 0.0 on pci1
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.7 - STAB=
LE/9> port 0xb880-0xb89f mem 0xf9e80000-0xf9efffff,0xf9e7c000-0xf9e7ffff ir=
q 17 at device 0.1 on pci1

# pciconf -lv | grep -B3 network
ix0@pci0:1:0:0: class=3D0x020000 card=3D0x061115d9 chip=3D0x10fb8086 rev=3D=
0x01 hdr=3D0x00
    vendor     =3D 'Intel Corporation'
    device     =3D '82599EB 10-Gigabit SFI/SFP+ Network Connection'
    class      =3D network
--
ix1@pci0:1:0:1: class=3D0x020000 card=3D0x061115d9 chip=3D0x10fb8086 rev=3D=
0x01 hdr=3D0x00
    vendor     =3D 'Intel Corporation'
    device     =3D '82599EB 10-Gigabit SFI/SFP+ Network Connection'
    class      =3D network


# ifconfig
ix0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=3D407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLA=
N_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
        ether 00:25:90:c3:da:82
        inet6 fe80::225:90ff:fec3:da82%ix0 prefixlen 64 scopeid 0x1
        nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (autoselect <full-duplex>)
        status: active
ix1: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=3D407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLA=
N_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
        ether 00:25:90:c3:da:82
        inet6 fe80::225:90ff:fec3:da83%ix1 prefixlen 64 scopeid 0x2
        nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (autoselect <full-duplex>)
        status: active

lagg0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 15=
00
        options=3D407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLA=
N_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
        ether 00:25:90:c3:da:82
        inet 192.168.100.100 netmask 0xffffff00 broadcast 192.168.100.255
        inet6 fe80::225:90ff:fec3:da82%lagg0 prefixlen 64 scopeid 0x9
        nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        laggproto lacp lagghash l2,l3,l4
        laggport: ix1 flags=3D18<COLLECTING,DISTRIBUTING>
        laggport: ix0 flags=3D18<COLLECTING,DISTRIBUTING>
vlan52: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1=
500
        options=3D303<RXCSUM,TXCSUM,TSO4,TSO6>
        ether 00:25:90:c3:da:82
        inet 10.52.0.9 netmask 0xffffff00 broadcast 10.52.0.255
        inet6 fe80::225:90ff:fec3:da82%vlan52 prefixlen 64 scopeid 0xa
        nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        vlan: 52 parent interface: lagg0
vlan908: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu =
1500
        options=3D303<RXCSUM,TXCSUM,TSO4,TSO6>
        ether 00:25:90:c3:da:82
        inet 10.21.0.9 netmask 0xffffff00 broadcast 10.21.0.255
        inet6 fe80::225:90ff:fec3:da82%vlan908 prefixlen 64 scopeid 0xb
        nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        vlan: 908 parent interface: lagg0

Thank you

Carles Guadall

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 18:07:19 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id DA92AFC1
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 18:07:19 +0000 (UTC)
 (envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
 by mx1.freebsd.org (Postfix) with ESMTP id 9D3272500
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 18:07:19 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
 id 0AAB57300A; Thu, 31 Oct 2013 19:09:07 +0100 (CET)
Date: Thu, 31 Oct 2013 19:09:07 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: h bagade <bagadeh@gmail.com>
Subject: Re: Errors on running kipfw with vale switches
Message-ID: <20131031180907.GB62132@onelab2.iet.unipi.it>
References: <CAARSjE39dDLMJfEZexZQ=YGHhCNR69vKzPAB6ojYdJSZopysGQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAARSjE39dDLMJfEZexZQ=YGHhCNR69vKzPAB6ojYdJSZopysGQ@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 18:07:19 -0000

On Thu, Oct 31, 2013 at 11:10:39AM +0330, h bagade wrote:
> Hi all,
> 
> I want to run userland ipfw with netmap support(kipfw). When I try to
> follow the example to test kipfw, it encounters an error on following
> command:

i suspect that stable/9 has an old version of the netmap code
so the argument to the ioctl fails.
In fact, I don't even remember if the code in stable/9
supports VALE.

Please wait for a few days, we am going to push a newer
version of netmap to both HEAD and stable/9 soon

cheers
luigi

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 18:08:31 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id DA83317B
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 18:08:31 +0000 (UTC)
 (envelope-from raitech@gmail.com)
Received: from mail-pd0-x231.google.com (mail-pd0-x231.google.com
 [IPv6:2607:f8b0:400e:c02::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id B6F102527
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 18:08:31 +0000 (UTC)
Received: by mail-pd0-f177.google.com with SMTP id p10so2730704pdj.22
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 11:08:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:from:date:message-id:subject:to:content-type;
 bh=A3SBVrP6kwfWgTI7CboON3o+ndvUScOUYBNH8UGYAis=;
 b=NRM9f1TM7xrd1qKEFeZia8UVx3cYUMagFMHSX+APNpjxybVSaBx/w/SYNGfWrqKc9D
 dO7f8uaTT2ZmAnzuEfYi9SdgkkewdieMlKLGpGJnIWiG5cNAFuj6z+5WgV/EqAGQpTAD
 VbLr5Kwb2JVMwSwnedy5Gh5inlEdf58Tt28olszTpPDBWwiZd3lfGNdcpMplC/Ow/no9
 CJbBc96s8uGA5vOw/YSfOOI3ZXlQQmeWkg76bXXEw0Efk7UK+5Sxu+/XQZCDesDPkBtP
 QaEynvjIAtZlHuRDoStxonU1rcT2xNyextvtdXN+fG6wD5BmC7AP4WfGpeINvWpnd4UD
 +vuA==
X-Received: by 10.68.164.165 with SMTP id yr5mr3240711pbb.146.1383242911387;
 Thu, 31 Oct 2013 11:08:31 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.70.101.70 with HTTP; Thu, 31 Oct 2013 11:08:11 -0700 (PDT)
From: Raimundo Santos <raitech@gmail.com>
Date: Thu, 31 Oct 2013 16:08:11 -0200
Message-ID: <CAGQ6iC8MAA3eHywhzvikB4n-Q6igTJ+PTkCCSYMfFLNUKKD6Hg@mail.gmail.com>
Subject: MPD PPTP seting 0 on net.inet.ip.forwarding
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 18:08:31 -0000

Hello!

I was experimenting with

set ipcp ranges 0.0.0.0 172.16.1.20

to see if I well understood concepts on MPD5.7 docs, but when I try to
connect to PPTP server with 0.0.0.0 as local address,
net.inet.ip.forwarding gets to 0, and the PPP does not connect.

But changing it to

set ipcp ranges 172.16.1.19 172.16.1.20

the same strange net.inet.ip.forwarding going to 0, but it connects the PPP
link.

And by using the mpd.conf.sample ippool example, just changing the IPs to
correspond to my network, the same strange thing.

What a strange behaving. Using MPD 5.7 and FreeBSD 9.2-RELEASE.

What could be wrong?

Here is my mpd.conf:

startup:
        # configure mpd users
        set user foo bar admin
        set user foo1 bar1
        # configure the console
        set console self 127.0.0.1 5005
        set console open
        # configure the web server
        set web self 0.0.0.0 5006
        set web open

default:
    load pptp_server

pptp_server:

        set ippool add pool1 172.16.1.20 172.16.1.100

        create bundle template B
        set iface enable proxy-arp
        set iface idle 1800
        set iface enable tcpmssfix
        set ipcp yes vjcomp
        set ipcp ranges 172.16.1.19/32 ippool pool1
        #set ipcp dns 192.168.1.3
        #set ipcp nbns 192.168.1.4
        set bundle enable compression
        set ccp yes mppc
        set mppc yes e40
        set mppc yes e128
        set mppc yes stateless

        create link template L pptp
        set link action bundle B
        set link enable multilink
        set link yes acfcomp protocomp
        set link no pap chap eap
        set link enable chap
        set link keep-alive 10 60
        set link mtu 1460
        set pptp self 192.168.0.2
        set link enable incoming
    log +all

And here is my rc.conf:

hostname="rtcprime"
ifconfig_alc0=" inet 192.168.0.2 netmask 255.255.255.0"
defaultrouter="192.168.0.1"
sshd_enable="YES"
ntpd_enable="YES"
powerd_enable="YES"
dumpdev="AUTO"

zfs_enable="YES"
noip_enable="YES"
samba_enable="YES"
mpd_enable="YES"

As you can see, there is no gateway_enable="YES", but there is
net.inet.ip.forwarding=1 in /etc/sysctl.conf

Thank you for your attention.
Raimundo Santos

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 18:39:54 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id C617FBA
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 18:39:54 +0000 (UTC)
 (envelope-from raitech@gmail.com)
Received: from mail-pa0-x236.google.com (mail-pa0-x236.google.com
 [IPv6:2607:f8b0:400e:c03::236])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id A08442756
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 18:39:54 +0000 (UTC)
Received: by mail-pa0-f54.google.com with SMTP id fa1so2974791pad.13
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 11:39:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :content-type; bh=D/TXGFZfJwut2mbSh3PXedyv8nuHm7UH0uo97f3R6Cg=;
 b=bjLOMp2VqxS+nO/lWCHLEBujjIY9e6vjgdfcSZ5kOWjye1YVU5HznbUD+TxvPQEQNH
 +PqXEGaioiEc3uW8N026W0taZ6Gu91jIB0BNL0j3OMo5nBZANCUl6ZR2CzpfKW32AC7+
 TZ4vCVd6V9c1oMIM/LazqkaFX6s5cqeCKBNOtMe+NtUy2pdP0RvDxGBfQ9fJWLhmDhc6
 rCgzhZr/dsX5kNNQyJr2hLFD0VUPIAqoUAA4aqrzyoifTxcTZhcfdVK5XTrHSFIgRQEC
 QGtHTSoeiBeZEZfmHIXbYO8eA1g7RFKL7l1QVyOG6ksGLv2y2PwVg3yJy9CvO0ZeX83U
 00og==
X-Received: by 10.67.30.100 with SMTP id kd4mr5422876pad.24.1383244794107;
 Thu, 31 Oct 2013 11:39:54 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.70.101.70 with HTTP; Thu, 31 Oct 2013 11:39:33 -0700 (PDT)
In-Reply-To: <CAGQ6iC8MAA3eHywhzvikB4n-Q6igTJ+PTkCCSYMfFLNUKKD6Hg@mail.gmail.com>
References: <CAGQ6iC8MAA3eHywhzvikB4n-Q6igTJ+PTkCCSYMfFLNUKKD6Hg@mail.gmail.com>
From: Raimundo Santos <raitech@gmail.com>
Date: Thu, 31 Oct 2013 16:39:33 -0200
Message-ID: <CAGQ6iC_rK-j1n8eiWdB2AiD4i5+vQd97c0TNGCrSosYsLoq4WQ@mail.gmail.com>
Subject: Re: MPD PPTP seting 0 on net.inet.ip.forwarding
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 18:39:54 -0000

Ok, I have found some weird thing:


On 31 October 2013 16:08, Raimundo Santos <raitech@gmail.com> wrote:

>
>
> As you can see, there is no gateway_enable="YES", but there is
> net.inet.ip.forwarding=1 in /etc/sysctl.conf
>
>
MPD do not respect my configuration in sysctl.conf, only the one in
rc.conf. To test:

* put net.inet.ip.forwarding and net.inet6.ip6.forwarding = 1 in sysctl.conf
* put gateway_enable="YES" in rc.conf
* connect to PPTP server

You will see that net.inet.ip.forwarding, after PPTP connection are
stablished, remains 1, but net.inet6.ip6.forwarding goes to 0!

Is that behaviour expected?

Am I worng when setting a router without gateway_enable="YES" in rc.conf
but with net.inet.ip.forwarding=1 in sysctl.conf?

Thank you,
Raimundo Santos

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 19:03:14 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 36A42C87
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 19:03:14 +0000 (UTC)
 (envelope-from egrosbein@rdtc.ru)
Received: from eg.sd.rdtc.ru (eg.sd.rdtc.ru [IPv6:2a03:3100:c:13::5])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7F22F28F2
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 19:03:13 +0000 (UTC)
X-Envelope-From: egrosbein@rdtc.ru
X-Envelope-To: freebsd-net@freebsd.org
Received: from eg.sd.rdtc.ru (eugen@localhost [127.0.0.1])
 by eg.sd.rdtc.ru (8.14.7/8.14.7) with ESMTP id r9VJ347D046501;
 Fri, 1 Nov 2013 02:03:04 +0700 (NOVT)
 (envelope-from egrosbein@rdtc.ru)
Message-ID: <5272A968.2050205@rdtc.ru>
Date: Fri, 01 Nov 2013 02:03:04 +0700
From: Eugene Grosbein <egrosbein@rdtc.ru>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130415 Thunderbird/17.0.5
MIME-Version: 1.0
To: Raimundo Santos <raitech@gmail.com>
Subject: Re: MPD PPTP seting 0 on net.inet.ip.forwarding
References: <CAGQ6iC8MAA3eHywhzvikB4n-Q6igTJ+PTkCCSYMfFLNUKKD6Hg@mail.gmail.com>
 <CAGQ6iC_rK-j1n8eiWdB2AiD4i5+vQd97c0TNGCrSosYsLoq4WQ@mail.gmail.com>
In-Reply-To: <CAGQ6iC_rK-j1n8eiWdB2AiD4i5+vQd97c0TNGCrSosYsLoq4WQ@mail.gmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00
 autolearn=ham version=3.3.2
X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
 * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
 *      [score: 0.0000]
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eg.sd.rdtc.ru
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 19:03:14 -0000

On 01.11.2013 01:39, Raimundo Santos wrote:
> Ok, I have found some weird thing:
> 
> 
> On 31 October 2013 16:08, Raimundo Santos <raitech@gmail.com> wrote:
> 
>>
>>
>> As you can see, there is no gateway_enable="YES", but there is
>> net.inet.ip.forwarding=1 in /etc/sysctl.conf
>>
>>
> MPD do not respect my configuration in sysctl.conf, only the one in
> rc.conf. To test:
> 
> * put net.inet.ip.forwarding and net.inet6.ip6.forwarding = 1 in sysctl.conf
> * put gateway_enable="YES" in rc.conf
> * connect to PPTP server
> 
> You will see that net.inet.ip.forwarding, after PPTP connection are
> stablished, remains 1, but net.inet6.ip6.forwarding goes to 0!
> 
> Is that behaviour expected?
> 
> Am I worng when setting a router without gateway_enable="YES" in rc.conf
> but with net.inet.ip.forwarding=1 in sysctl.conf?

That's not MPD's fault. That's FreeBSD 9.2's devd starting
/etc/pccard_ether $subsystem start
every time an interface is created. This leads to start of
/etc/rc.d/netif quietstart $ifn

netif does LOTS of things making severe (and unneeded for mpd) load on the system
and resetting net.inet.ip.forwarding to 0 if you don't have gateway_enable="YES"
in your /etc/rc.conf

I don't need devd so I just disabled it in rc.conf with devd_enable="NO".
If you need it, just switch from sysctls to:

gateway_enable="YES"
ipv6_gateway_enable="YES"

This seems as regression from 9.1 behavior for me for busy mpd-based BRAS'es
as performance of the box drops significantly due to extra work performed
by devd and its scripts.


From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 20:57:43 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 61D1F393
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 20:57:43 +0000 (UTC)
 (envelope-from raitech@gmail.com)
Received: from mail-pa0-x233.google.com (mail-pa0-x233.google.com
 [IPv6:2607:f8b0:400e:c03::233])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3B3A62200
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 20:57:43 +0000 (UTC)
Received: by mail-pa0-f51.google.com with SMTP id ld10so3063073pab.38
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 13:57:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc:content-type;
 bh=Lu0wue6EFj0Bm3Qd25OWc50JSdRtClsYa4ikXtjjPoo=;
 b=iR+P1dIr6XhbTdG9oZdg/A60Uq0+W2cil/GmdczyCbrner0Jlf8Dodndug+JNr1A2u
 7bPLZolOFgE+bk3BjRu3KAqRBCBpuXBKr3sePJAfoZh1wgz6R7H3BGSYcNGAA+Go8Hot
 l2koG5CcMx44pCpL8nD4vEbIcrTeq/3xf61d5qr7AEIk/vmoJ8HFz8zTYkqGTsDWq5II
 faoItqKkZSAZtconyVQxF1ZmWyl6d/QUWnb93R3xbR1AVBjrIH1D24uvdu7OBoyO10a0
 AGedAODgdXGYMnJlL3CXKYeatEA7fih1qeE3DoLnrThLACeCuNSG6dZt99yjvSCsFxme
 32Nw==
X-Received: by 10.68.254.231 with SMTP id al7mr3858603pbd.158.1383253062795;
 Thu, 31 Oct 2013 13:57:42 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.70.101.70 with HTTP; Thu, 31 Oct 2013 13:57:22 -0700 (PDT)
In-Reply-To: <5272A968.2050205@rdtc.ru>
References: <CAGQ6iC8MAA3eHywhzvikB4n-Q6igTJ+PTkCCSYMfFLNUKKD6Hg@mail.gmail.com>
 <CAGQ6iC_rK-j1n8eiWdB2AiD4i5+vQd97c0TNGCrSosYsLoq4WQ@mail.gmail.com>
 <5272A968.2050205@rdtc.ru>
From: Raimundo Santos <raitech@gmail.com>
Date: Thu, 31 Oct 2013 18:57:22 -0200
Message-ID: <CAGQ6iC8h_rBDYGNA9cpsx-Lyvf4xnoHBXq6H_eVOvSMGSTeYhA@mail.gmail.com>
Subject: Re: MPD PPTP seting 0 on net.inet.ip.forwarding
To: Eugene Grosbein <egrosbein@rdtc.ru>
Content-Type: text/plain; charset=KOI8-R
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 20:57:43 -0000

On 31 October 2013 17:03, Eugene Grosbein <egrosbein@rdtc.ru> wrote:
>
> That's not MPD's fault. That's FreeBSD 9.2's devd starting
> /etc/pccard_ether $subsystem start
> every time an interface is created. This leads to start of
> /etc/rc.d/netif quietstart $ifn
>
> netif does LOTS of things making severe (and unneeded for mpd) load on
the system
> and resetting net.inet.ip.forwarding to 0 if you don't have
gateway_enable="YES"
> in your /etc/rc.conf
>

Good to know. Not a problem for me by now, but I will keep an eye at the
problem.

> I don't need devd so I just disabled it in rc.conf with devd_enable="NO".
> If you need it, just switch from sysctls to:
>
> gateway_enable="YES"
> ipv6_gateway_enable="YES"
>

Yes, that was the solution that worked. I needed a quick an dirty VPN,
ended stopping my customers network! But it's okey now, as I am such a good
sysadm - heee...

Thank you, Eugene!

> This seems as regression from 9.1 behavior for me for busy mpd-based
BRAS'es
> as performance of the box drops significantly due to extra work performed
> by devd and its scripts.
>
>
>

From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 21:58:11 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id E9BBE647
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 21:58:11 +0000 (UTC)
 (envelope-from ole.myhre@dataoppdrag.no)
Received: from mail2.dataoppdrag.no (mail2.dataoppdrag.no
 [IPv6:2a02:f58:7:2::2])
 by mx1.freebsd.org (Postfix) with ESMTP id A355E2683
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 21:58:11 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by mail2.dataoppdrag.no (Postfix) with ESMTP id DCA234058C
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 22:58:09 +0100 (CET)
Received: from mail2.dataoppdrag.no ([127.0.0.1])
 by localhost (mail2.dataoppdrag.no [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id SsO7wRr9ubbP for <freebsd-net@freebsd.org>;
 Thu, 31 Oct 2013 22:58:09 +0100 (CET)
Received: from EX-MBX02.cust-d1.dataoppdrag.no
 (ex-mbx02.cust-d1.dataoppdrag.no [IPv6:2a02:f58:0:313:b898:7b82:13e0:c3bd])
 by mail2.dataoppdrag.no (Postfix) with ESMTPS id B9DC340442
 for <freebsd-net@freebsd.org>; Thu, 31 Oct 2013 22:58:09 +0100 (CET)
Received: from EX-MBX01.cust-d1.dataoppdrag.no ([fe80::6db0:e393:6a07:457]) by
 EX-MBX02.cust-d1.dataoppdrag.no ([fe80::b898:7b82:13e0:c3bd%11])
 with mapi id 14.02.0342.003; Thu, 31 Oct 2013 22:58:09 +0100
From: Ole Myhre <ole.myhre@dataoppdrag.no>
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject: carp on 10.0 and ipv6 network route
Thread-Topic: carp on 10.0 and ipv6 network route
Thread-Index: Ac7WhEWGxJMHfReCSm+wVWEHkUtYuw==
Date: Thu, 31 Oct 2013 21:58:08 +0000
Message-ID: <C5A69C67E0D032469281F68C7CFD6AA009E3F3FB@EX-MBX01.cust-d1.dataoppdrag.no>
Accept-Language: en-US, nb-NO
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [172.20.20.26]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 21:58:12 -0000

Hi,

I'm testing carp on 10.0-BETA2, and there seems to be different
behaviour with the network route between IPv4 and IPv6 when using carp
on interfaces.

IPv4 routes are not present in the routing table when the interface is
in BACKUP state (as expected), but IPv6 routes are present in the
routing table in both BACKUP and MASTER state. This causes some issues
with routing daemons as the network route is announced to other
routers from both machines running carp.

[root@rtr1 ~]# ifconfig em2 vhid 1 192.168.0.1/24

[root@rtr2 ~]# ifconfig em2 vhid 1 192.168.0.1/24

[root@rtr1 ~]# ifconfig em2 | grep carp
        carp: MASTER vhid 1 advbase 1 advskew 0
[root@rtr1 ~]# netstat -rn | grep 192.168.0.0
192.168.0.0/24     link#3             U           0        0    em2
[root@rtr1 ~]#

[root@rtr2 ~]# ifconfig em2 | grep carp
        carp: BACKUP vhid 1 advbase 1 advskew 0
[root@rtr2 ~]# netstat -rn | grep 192.168.0.0
[root@rtr2 ~]#

[root@rtr1 ~]# ifconfig em2 inet6 2001:db8::1/64 vhid 1

[root@rtr2 ~]# ifconfig em2 inet6 2001:db8::1/64 vhid 1

[root@rtr1 ~]# ifconfig em2 | grep carp
        carp: MASTER vhid 1 advbase 1 advskew 0
[root@rtr1 ~]# netstat -rn | grep 2001:db8::/64
2001:db8::/64                     link#3                        U          =
 em2
[root@rtr1 ~]#

[root@rtr2 ~]# ifconfig em2 | grep carp
        carp: BACKUP vhid 1 advbase 1 advskew 0
[root@rtr2 ~]# netstat -rn | grep 2001:db8::/64
2001:db8::/64                     link#3                        U          =
 em2
[root@rtr2 ~]#

Thanks,
Ole

From owner-freebsd-net@FreeBSD.ORG  Fri Nov  1 12:47:23 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 8CC43909;
 Fri,  1 Nov 2013 12:47:23 +0000 (UTC)
 (envelope-from glebius@FreeBSD.org)
Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id CFBFF293F;
 Fri,  1 Nov 2013 12:47:22 +0000 (UTC)
Received: from cell.glebius.int.ru (localhost [127.0.0.1])
 by cell.glebius.int.ru (8.14.7/8.14.7) with ESMTP id rA1ClKJ5065572;
 Fri, 1 Nov 2013 16:47:20 +0400 (MSK)
 (envelope-from glebius@FreeBSD.org)
Received: (from glebius@localhost)
 by cell.glebius.int.ru (8.14.7/8.14.7/Submit) id rA1ClKg5065571;
 Fri, 1 Nov 2013 16:47:20 +0400 (MSK)
 (envelope-from glebius@FreeBSD.org)
X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to
 glebius@FreeBSD.org using -f
Date: Fri, 1 Nov 2013 16:47:20 +0400
From: Gleb Smirnoff <glebius@FreeBSD.org>
To: net@FreeBSD.org, current@FreeBSD.org
Subject: [CFT & review] new in_control()
Message-ID: <20131101124720.GF52889@FreeBSD.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="SvF6CGw9fzJC4Rcx"
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Nov 2013 12:47:23 -0000


--SvF6CGw9fzJC4Rcx
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

  Hi!

  I've got a patch that cleans up the way we configure
and delete IPv4 on interfaces. What it does:

1) separate function for SIOCAIFADDR, with clear code
   flow from beginning to the end.
2) separate function for SIOCDIFADDR, with clear code
   flow from beginning to the end.
3) provided 1) and 2) the in_control() got very thin
   and clear.

The above wasn't just a cut&paste job, instead every
step taken was evaluated. I've cut quite a lot of strange
code, added extra sanity checking and provided comments
on the strange code that remains.

4) sx(9) lock covers entire SIOCAIFADDR/SIOCDIFADDR
   operation, so we close races ifconfig vs ifconfig,
   or ifconfig vs mpd.
   On interface detach SIOCDIFADDR is called w/o sx(9),
   but its operation is covered by IF_ADDR_LOCK().

Also, except of redesign of SIOCAIFADDR/SIOCDIFADDR,
the following two related changes leaked into the patch.
It is possible to separate them out, but won't be easy.

5) Removed useloopback conditional. Rationale:
   - option was always on since pre-FreeBSD times
   - sysctl knob lives in invalid (ethernet) namespace,
     and documented in wrong (arp(8)) place.
   - since new-ARP, the knob was consulted on route
     addition, but was ignored on delete.
   - operation of network stack useloopback=0 is
     strange

   The only reason running useloopback=0 could be
   a router that doesn't want to pollute large network
   with its /32 announces. However, this can be achieved
   with filtering in routing daemons.

6) Implemented correctly code from r201282, that tried
   to keep localhost route in table when multiple P2P
   interfaces with same local address are created and
   deleted.

The check in of the code can cause problems. I could make
mistakes, and some program that relied on strange behavior
can pop up. Thus, early testing is appreciated.

So far I have tested simple address assignment, CARP,
and mpd5 as L2TP access concentrator.

Advice for reviewers is to not look at diff, but look at
patched in.c instead.

-- 
Totus tuus, Glebius.

--SvF6CGw9fzJC4Rcx
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="in_control.diff"

Index: sys/net/if.c
===================================================================
--- sys/net/if.c	(revision 257503)
+++ sys/net/if.c	(working copy)
@@ -1525,6 +1525,25 @@ ifa_del_loopback_route(struct ifaddr *ifa, struct
 	return (error);
 }
 
+int
+ifa_switch_loopback_route(struct ifaddr *ifa, struct sockaddr *sa)
+{
+	struct rtentry *rt;
+
+	rt = rtalloc1_fib(sa, 0, 0, 0);
+	if (rt == NULL) {
+		log(LOG_DEBUG, "%s: fail", __func__);
+		return (EHOSTUNREACH);
+	}
+	((struct sockaddr_dl *)rt->rt_gateway)->sdl_type =
+	    ifa->ifa_ifp->if_type;
+	((struct sockaddr_dl *)rt->rt_gateway)->sdl_index =
+	    ifa->ifa_ifp->if_index;
+	RTFREE_LOCKED(rt);
+
+	return (0);
+}
+
 /*
  * XXX: Because sockaddr_dl has deeper structure than the sockaddr
  * structs used to represent other address families, it is necessary
Index: sys/net/if_var.h
===================================================================
--- sys/net/if_var.h	(revision 257503)
+++ sys/net/if_var.h	(working copy)
@@ -491,6 +491,7 @@ struct	ifnet *ifunit_ref(const char *);
 
 int	ifa_add_loopback_route(struct ifaddr *, struct sockaddr *);
 int	ifa_del_loopback_route(struct ifaddr *, struct sockaddr *);
+int	ifa_switch_loopback_route(struct ifaddr *, struct sockaddr *);
 
 struct	ifaddr *ifa_ifwithaddr(struct sockaddr *);
 int		ifa_ifwithaddr_check(struct sockaddr *);
Index: sys/netinet/if_ether.c
===================================================================
--- sys/netinet/if_ether.c	(revision 257503)
+++ sys/netinet/if_ether.c	(working copy)
@@ -85,8 +85,6 @@ static SYSCTL_NODE(_net_link_ether, PF_ARP, arp, C
 static VNET_DEFINE(int, arpt_keep) = (20*60);	/* once resolved, good for 20
 						 * minutes */
 static VNET_DEFINE(int, arp_maxtries) = 5;
-VNET_DEFINE(int, useloopback) = 1;	/* use loopback interface for
-					 * local traffic */
 static VNET_DEFINE(int, arp_proxyall) = 0;
 static VNET_DEFINE(int, arpt_down) = 20;	/* keep incomplete entries for
 						 * 20 seconds */
@@ -111,9 +109,6 @@ SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, ma
 SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, maxtries, CTLFLAG_RW,
 	&VNET_NAME(arp_maxtries), 0,
 	"ARP resolution attempts before returning error");
-SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, useloopback, CTLFLAG_RW,
-	&VNET_NAME(useloopback), 0,
-	"Use the loopback interface for local traffic");
 SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, proxyall, CTLFLAG_RW,
 	&VNET_NAME(arp_proxyall), 0,
 	"Enable proxy ARP for all suitable requests");
Index: sys/netinet/in.c
===================================================================
--- sys/netinet/in.c	(revision 257503)
+++ sys/netinet/in.c	(working copy)
@@ -47,6 +47,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/proc.h>
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
+#include <sys/sx.h>
 
 #include <net/if.h>
 #include <net/if_var.h>
@@ -71,10 +72,10 @@ static int in_mask2len(struct in_addr *);
 static void in_len2mask(struct in_addr *, int);
 static int in_lifaddr_ioctl(struct socket *, u_long, caddr_t,
 	struct ifnet *, struct thread *);
+static int in_aifaddr_ioctl(caddr_t, struct ifnet *, struct thread *);
+static int in_difaddr_ioctl(caddr_t, struct ifnet *, struct thread *);
 
 static void	in_socktrim(struct sockaddr_in *);
-static int	in_ifinit(struct ifnet *, struct in_ifaddr *,
-		    struct sockaddr_in *, int, int);
 static void	in_purgemaddrs(struct ifnet *);
 
 static VNET_DEFINE(int, nosameprefix);
@@ -86,6 +87,9 @@ SYSCTL_VNET_INT(_net_inet_ip, OID_AUTO, no_same_pr
 VNET_DECLARE(struct inpcbinfo, ripcbinfo);
 #define	V_ripcbinfo			VNET(ripcbinfo)
 
+static struct sx in_control_sx;
+SX_SYSINIT(in_control_sx, &in_control_sx, "in_control");
+
 /*
  * Return 1 if an internet address is for a ``local'' host
  * (one to which we have a connection).
@@ -128,6 +132,28 @@ in_localip(struct in_addr in)
 }
 
 /*
+ * Return an address equal to the supplied one, but not the same.
+ */
+static struct in_ifaddr *
+more_localip(struct in_ifaddr *ia)
+{
+	in_addr_t in = IA_SIN(ia)->sin_addr.s_addr;
+	struct in_ifaddr *it;
+
+	IN_IFADDR_RLOCK();
+	LIST_FOREACH(it, INADDR_HASH(in), ia_hash) {
+		if (it != ia && IA_SIN(it)->sin_addr.s_addr == in) {
+			ifa_ref(&it->ia_ifa);
+			IN_IFADDR_RUNLOCK();
+			return (it);
+		}
+	}
+	IN_IFADDR_RUNLOCK();
+
+	return (NULL);
+}
+
+/*
  * Determine whether an IP address is in a reserved set of addresses
  * that may not be forwarded, or whether datagrams to that destination
  * may be forwarded.
@@ -203,40 +229,22 @@ in_len2mask(struct in_addr *mask, int len)
 
 /*
  * Generic internet control operations (ioctl's).
- *
- * ifp is NULL if not an interface-specific ioctl.
  */
-/* ARGSUSED */
 int
 in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
     struct thread *td)
 {
-	register struct ifreq *ifr = (struct ifreq *)data;
-	register struct in_ifaddr *ia, *iap;
-	register struct ifaddr *ifa;
-	struct in_addr allhosts_addr;
-	struct in_addr dst;
-	struct in_ifinfo *ii;
-	struct in_aliasreq *ifra = (struct in_aliasreq *)data;
-	int error, hostIsNew, iaIsNew, maskIsNew;
-	int iaIsFirst;
-	u_long ocmd = cmd;
+	struct ifreq *ifr = (struct ifreq *)data;
+	struct sockaddr_in *addr = (struct sockaddr_in *)&ifr->ifr_addr;
+	struct in_ifaddr *ia;
+	int error;
 
-	/*
-	 * Pre-10.x compat: OSIOCAIFADDR passes a shorter
-	 * struct in_aliasreq, without ifra_vhid.
-	 */
-	if (cmd == OSIOCAIFADDR)
-		cmd = SIOCAIFADDR;
+	if (ifp == NULL)
+		return (EADDRNOTAVAIL);
 
-	ia = NULL;
-	iaIsFirst = 0;
-	iaIsNew = 0;
-	allhosts_addr.s_addr = htonl(INADDR_ALLHOSTS_GROUP);
-
 	/*
-	 * Filter out ioctls we implement directly; forward the rest on to
-	 * in_lifaddr_ioctl() and ifp->if_ioctl().
+	 * Filter out 4 ioctls we implement directly.  Forward the rest
+	 * to specific functions and ifp->if_ioctl().
 	 */
 	switch (cmd) {
 	case SIOCGIFADDR:
@@ -243,34 +251,21 @@ in_control(struct socket *so, u_long cmd, caddr_t
 	case SIOCGIFBRDADDR:
 	case SIOCGIFDSTADDR:
 	case SIOCGIFNETMASK:
+		break;
 	case SIOCDIFADDR:
-		break;
+		sx_xlock(&in_control_sx);
+		error = in_difaddr_ioctl(data, ifp, td);
+		sx_xunlock(&in_control_sx);
+		return (error);
 	case SIOCAIFADDR:
-		/*
-		 * ifra_addr must be present and be of INET family.
-		 * ifra_broadaddr and ifra_mask are optional.
-		 */
-		if (ifra->ifra_addr.sin_len != sizeof(struct sockaddr_in) ||
-		    ifra->ifra_addr.sin_family != AF_INET)
-			return (EINVAL);
-		if (ifra->ifra_broadaddr.sin_len != 0 &&
-		    (ifra->ifra_broadaddr.sin_len !=
-		    sizeof(struct sockaddr_in) ||
-		    ifra->ifra_broadaddr.sin_family != AF_INET))
-			return (EINVAL);
-#if 0
-		/*
-		 * ifconfig(8) in pre-10.x doesn't set sin_family for the
-		 * mask. The code is disabled for the 10.x timeline, to
-		 * make SIOCAIFADDR compatible with 9.x ifconfig(8).
-		 * The code should be enabled in 11.x
-		 */
-		if (ifra->ifra_mask.sin_len != 0 &&
-		    (ifra->ifra_mask.sin_len != sizeof(struct sockaddr_in) ||
-		    ifra->ifra_mask.sin_family != AF_INET))
-			return (EINVAL);
-#endif
-		break;
+		sx_xlock(&in_control_sx);
+		error = in_aifaddr_ioctl(data, ifp, td);
+		sx_xunlock(&in_control_sx);
+		return (error);
+	case SIOCALIFADDR:
+	case SIOCDLIFADDR:
+	case SIOCGLIFADDR:
+		return (in_lifaddr_ioctl(so, cmd, data, ifp, td));
 	case SIOCSIFADDR:
 	case SIOCSIFBRDADDR:
 	case SIOCSIFDSTADDR:
@@ -277,306 +272,353 @@ in_control(struct socket *so, u_long cmd, caddr_t
 	case SIOCSIFNETMASK:
 		/* We no longer support that old commands. */
 		return (EINVAL);
-
-	case SIOCALIFADDR:
-		if (td != NULL) {
-			error = priv_check(td, PRIV_NET_ADDIFADDR);
-			if (error)
-				return (error);
-		}
-		if (ifp == NULL)
-			return (EINVAL);
-		return in_lifaddr_ioctl(so, cmd, data, ifp, td);
-
-	case SIOCDLIFADDR:
-		if (td != NULL) {
-			error = priv_check(td, PRIV_NET_DELIFADDR);
-			if (error)
-				return (error);
-		}
-		if (ifp == NULL)
-			return (EINVAL);
-		return in_lifaddr_ioctl(so, cmd, data, ifp, td);
-
-	case SIOCGLIFADDR:
-		if (ifp == NULL)
-			return (EINVAL);
-		return in_lifaddr_ioctl(so, cmd, data, ifp, td);
-
 	default:
-		if (ifp == NULL || ifp->if_ioctl == NULL)
+		if (ifp->if_ioctl == NULL)
 			return (EOPNOTSUPP);
 		return ((*ifp->if_ioctl)(ifp, cmd, data));
 	}
 
-	if (ifp == NULL)
-		return (EADDRNOTAVAIL);
-
 	/*
-	 * Security checks before we get involved in any work.
-	 */
-	switch (cmd) {
-	case SIOCAIFADDR:
-		if (td != NULL) {
-			error = priv_check(td, PRIV_NET_ADDIFADDR);
-			if (error)
-				return (error);
-		}
-		break;
-
-	case SIOCDIFADDR:
-		if (td != NULL) {
-			error = priv_check(td, PRIV_NET_DELIFADDR);
-			if (error)
-				return (error);
-		}
-		break;
-	}
-
-	/*
 	 * Find address for this interface, if it exists.
-	 *
-	 * If an alias address was specified, find that one instead of the
-	 * first one on the interface, if possible.
 	 */
-	dst = ((struct sockaddr_in *)&ifr->ifr_addr)->sin_addr;
 	IN_IFADDR_RLOCK();
-	LIST_FOREACH(iap, INADDR_HASH(dst.s_addr), ia_hash) {
-		if (iap->ia_ifp == ifp &&
-		    iap->ia_addr.sin_addr.s_addr == dst.s_addr) {
-			if (td == NULL || prison_check_ip4(td->td_ucred,
-			    &dst) == 0)
-				ia = iap;
+	LIST_FOREACH(ia, INADDR_HASH(addr->sin_addr.s_addr), ia_hash) {
+		if (ia->ia_ifp == ifp &&
+		    ia->ia_addr.sin_addr.s_addr == addr->sin_addr.s_addr &&
+		    prison_check_ip4(td->td_ucred, &addr->sin_addr) == 0)
 			break;
-		}
 	}
-	if (ia != NULL)
-		ifa_ref(&ia->ia_ifa);
-	IN_IFADDR_RUNLOCK();
+
 	if (ia == NULL) {
-		IF_ADDR_RLOCK(ifp);
-		TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
-			iap = ifatoia(ifa);
-			if (iap->ia_addr.sin_family == AF_INET) {
-				if (td != NULL &&
-				    prison_check_ip4(td->td_ucred,
-				    &iap->ia_addr.sin_addr) != 0)
-					continue;
-				ia = iap;
-				break;
-			}
-		}
-		if (ia != NULL)
-			ifa_ref(&ia->ia_ifa);
-		IF_ADDR_RUNLOCK(ifp);
+		IN_IFADDR_RUNLOCK();
+		return (EADDRNOTAVAIL);
 	}
-	if (ia == NULL)
-		iaIsFirst = 1;
 
 	error = 0;
 	switch (cmd) {
-	case SIOCAIFADDR:
-	case SIOCDIFADDR:
-		if (ifra->ifra_addr.sin_family == AF_INET) {
-			struct in_ifaddr *oia;
+	case SIOCGIFADDR:
+		*addr = ia->ia_addr;
+		break;
 
-			IN_IFADDR_RLOCK();
-			for (oia = ia; ia; ia = TAILQ_NEXT(ia, ia_link)) {
-				if (ia->ia_ifp == ifp  &&
-				    ia->ia_addr.sin_addr.s_addr ==
-				    ifra->ifra_addr.sin_addr.s_addr)
-					break;
-			}
-			if (ia != NULL && ia != oia)
-				ifa_ref(&ia->ia_ifa);
-			if (oia != NULL && ia != oia)
-				ifa_free(&oia->ia_ifa);
-			IN_IFADDR_RUNLOCK();
-			if ((ifp->if_flags & IFF_POINTOPOINT)
-			    && (cmd == SIOCAIFADDR)
-			    && (ifra->ifra_dstaddr.sin_addr.s_addr
-				== INADDR_ANY)) {
-				error = EDESTADDRREQ;
-				goto out;
-			}
+	case SIOCGIFBRDADDR:
+		if ((ifp->if_flags & IFF_BROADCAST) == 0) {
+			error = EINVAL;
+			break;
 		}
-		if (cmd == SIOCDIFADDR && ia == NULL) {
-			error = EADDRNOTAVAIL;
-			goto out;
-		}
-		if (ia == NULL) {
-			ifa = ifa_alloc(sizeof(struct in_ifaddr), M_WAITOK);
-			ia = (struct in_ifaddr *)ifa;
-			ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr;
-			ifa->ifa_dstaddr = (struct sockaddr *)&ia->ia_dstaddr;
-			ifa->ifa_netmask = (struct sockaddr *)&ia->ia_sockmask;
+		*addr = ia->ia_broadaddr;
+		break;
 
-			ia->ia_sockmask.sin_len = 8;
-			ia->ia_sockmask.sin_family = AF_INET;
-			if (ifp->if_flags & IFF_BROADCAST) {
-				ia->ia_broadaddr.sin_len = sizeof(ia->ia_addr);
-				ia->ia_broadaddr.sin_family = AF_INET;
-			}
-			ia->ia_ifp = ifp;
-
-			ifa_ref(ifa);			/* if_addrhead */
-			IF_ADDR_WLOCK(ifp);
-			TAILQ_INSERT_TAIL(&ifp->if_addrhead, ifa, ifa_link);
-			IF_ADDR_WUNLOCK(ifp);
-			ifa_ref(ifa);			/* in_ifaddrhead */
-			IN_IFADDR_WLOCK();
-			TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link);
-			IN_IFADDR_WUNLOCK();
-			iaIsNew = 1;
+	case SIOCGIFDSTADDR:
+		if ((ifp->if_flags & IFF_POINTOPOINT) == 0) {
+			error = EINVAL;
+			break;
 		}
+		*addr = ia->ia_dstaddr;
 		break;
 
-	case SIOCGIFADDR:
 	case SIOCGIFNETMASK:
-	case SIOCGIFDSTADDR:
-	case SIOCGIFBRDADDR:
-		if (ia == NULL) {
-			error = EADDRNOTAVAIL;
-			goto out;
-		}
+		*addr = ia->ia_sockmask;
 		break;
 	}
 
+	IN_IFADDR_RUNLOCK();
+
+	return (error);
+}
+
+static int
+in_aifaddr_ioctl(caddr_t data, struct ifnet *ifp, struct thread *td)
+{
+	const struct in_aliasreq *ifra = (struct in_aliasreq *)data;
+	const struct sockaddr_in *addr = &ifra->ifra_addr;
+	const struct sockaddr_in *broadaddr = &ifra->ifra_broadaddr;
+	const struct sockaddr_in *mask = &ifra->ifra_mask;
+	const struct sockaddr_in *dstaddr = &ifra->ifra_dstaddr;
+	const int vhid = ifra->ifra_vhid;
+	struct ifaddr *ifa;
+	struct in_ifaddr *ia;
+	bool iaIsFirst;
+	int error = 0;
+
+	error = priv_check(td, PRIV_NET_ADDIFADDR);
+	if (error)
+		return (error);
+
 	/*
-	 * Most paths in this switch return directly or via out.  Only paths
-	 * that remove the address break in order to hit common removal code.
+	 * ifra_addr must be present and be of INET family.
+	 * ifra_broadaddr/ifra_dstaddr and ifra_mask are optional.
 	 */
-	switch (cmd) {
-	case SIOCGIFADDR:
-		*((struct sockaddr_in *)&ifr->ifr_addr) = ia->ia_addr;
-		goto out;
+	if (addr->sin_len != sizeof(struct sockaddr_in) ||
+	    addr->sin_family != AF_INET)
+		return (EINVAL);
+	if (broadaddr->sin_len != 0 &&
+	    (broadaddr->sin_len != sizeof(struct sockaddr_in) ||
+	    broadaddr->sin_family != AF_INET))
+		return (EINVAL);
+	if (mask->sin_len != 0 &&
+	    (mask->sin_len != sizeof(struct sockaddr_in) ||
+	    mask->sin_family != AF_INET))
+		return (EINVAL);
+	if ((ifp->if_flags & IFF_POINTOPOINT) &&
+	    (dstaddr->sin_len != sizeof(struct sockaddr_in) ||
+	     dstaddr->sin_addr.s_addr == INADDR_ANY))
+		return (EDESTADDRREQ);
+	if (vhid > 0 && carp_attach_p == NULL)
+		return (EPROTONOSUPPORT);
 
-	case SIOCGIFBRDADDR:
-		if ((ifp->if_flags & IFF_BROADCAST) == 0) {
-			error = EINVAL;
-			goto out;
-		}
-		*((struct sockaddr_in *)&ifr->ifr_dstaddr) = ia->ia_broadaddr;
-		goto out;
+	/*
+	 * See whether address already exist.
+	 */
+	iaIsFirst = true;
+	ia = NULL;
+	IF_ADDR_RLOCK(ifp);
+	TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
+		struct in_ifaddr *it = ifatoia(ifa);
 
-	case SIOCGIFDSTADDR:
-		if ((ifp->if_flags & IFF_POINTOPOINT) == 0) {
-			error = EINVAL;
-			goto out;
-		}
-		*((struct sockaddr_in *)&ifr->ifr_dstaddr) = ia->ia_dstaddr;
-		goto out;
+		if (it->ia_addr.sin_family != AF_INET)
+			continue;
 
-	case SIOCGIFNETMASK:
-		*((struct sockaddr_in *)&ifr->ifr_addr) = ia->ia_sockmask;
-		goto out;
+		iaIsFirst = false;
+		if (it->ia_addr.sin_addr.s_addr == addr->sin_addr.s_addr &&
+		    prison_check_ip4(td->td_ucred, &addr->sin_addr) == 0)
+			ia = it;
+	}
+	IF_ADDR_RUNLOCK(ifp);
 
-	case SIOCAIFADDR:
-		maskIsNew = 0;
-		hostIsNew = 1;
-		error = 0;
-		if (ifra->ifra_addr.sin_addr.s_addr ==
-			    ia->ia_addr.sin_addr.s_addr)
-			hostIsNew = 0;
-		if (ifra->ifra_mask.sin_len) {
-			/*
-			 * QL: XXX
-			 * Need to scrub the prefix here in case
-			 * the issued command is SIOCAIFADDR with
-			 * the same address, but with a different
-			 * prefix length. And if the prefix length
-			 * is the same as before, then the call is
-			 * un-necessarily executed here.
-			 */
-			in_scrubprefix(ia, LLE_STATIC);
-			ia->ia_sockmask = ifra->ifra_mask;
-			ia->ia_sockmask.sin_family = AF_INET;
-			ia->ia_subnetmask =
-			    ntohl(ia->ia_sockmask.sin_addr.s_addr);
-			maskIsNew = 1;
-		}
-		if ((ifp->if_flags & IFF_POINTOPOINT) &&
-		    (ifra->ifra_dstaddr.sin_family == AF_INET)) {
-			in_scrubprefix(ia, LLE_STATIC);
-			ia->ia_dstaddr = ifra->ifra_dstaddr;
-			maskIsNew  = 1; /* We lie; but the effect's the same */
-		}
-		if (hostIsNew || maskIsNew)
-			error = in_ifinit(ifp, ia, &ifra->ifra_addr, maskIsNew,
-			    (ocmd == cmd ? ifra->ifra_vhid : 0));
-		if (error != 0 && iaIsNew)
-			break;
+	if (ia != NULL)
+		(void )in_difaddr_ioctl(data, ifp, td);
 
-		if ((ifp->if_flags & IFF_BROADCAST) &&
-		    ifra->ifra_broadaddr.sin_len)
-			ia->ia_broadaddr = ifra->ifra_broadaddr;
-		if (error == 0) {
-			ii = ((struct in_ifinfo *)ifp->if_afdata[AF_INET]);
-			if (iaIsFirst &&
-			    (ifp->if_flags & IFF_MULTICAST) != 0) {
-				error = in_joingroup(ifp, &allhosts_addr,
-				    NULL, &ii->ii_allhosts);
-			}
-			EVENTHANDLER_INVOKE(ifaddr_event, ifp);
-		}
-		goto out;
+	ifa = ifa_alloc(sizeof(struct in_ifaddr), M_WAITOK);
+	ia = (struct in_ifaddr *)ifa;
+	ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr;
+	ifa->ifa_dstaddr = (struct sockaddr *)&ia->ia_dstaddr;
+	ifa->ifa_netmask = (struct sockaddr *)&ia->ia_sockmask;
 
-	case SIOCDIFADDR:
-		/*
-		 * in_scrubprefix() kills the interface route.
-		 */
-		in_scrubprefix(ia, LLE_STATIC);
+	ia->ia_ifp = ifp;
+	ia->ia_ifa.ifa_metric = ifp->if_metric;
+	ia->ia_addr = *addr;
+	if (mask->sin_len != 0) {
+		ia->ia_sockmask = *mask;
+		ia->ia_subnetmask = ntohl(ia->ia_sockmask.sin_addr.s_addr);
+	} else {
+		in_addr_t i = ntohl(addr->sin_addr.s_addr);
 
 		/*
-		 * in_ifadown gets rid of all the rest of
-		 * the routes.  This is not quite the right
-		 * thing to do, but at least if we are running
-		 * a routing process they will come back.
-		 */
-		in_ifadown(&ia->ia_ifa, 1);
-		EVENTHANDLER_INVOKE(ifaddr_event, ifp);
-		error = 0;
-		break;
+	 	 * Be compatible with network classes, if netmask isn't
+		 * supplied, guess it based on classes.
+	 	 */
+		if (IN_CLASSA(i))
+			ia->ia_subnetmask = IN_CLASSA_NET;
+		else if (IN_CLASSB(i))
+			ia->ia_subnetmask = IN_CLASSB_NET;
+		else
+			ia->ia_subnetmask = IN_CLASSC_NET;
+		ia->ia_sockmask.sin_addr.s_addr = htonl(ia->ia_subnetmask);
+	}
+	ia->ia_subnet = ntohl(addr->sin_addr.s_addr) & ia->ia_subnetmask;
+	in_socktrim(&ia->ia_sockmask);
 
-	default:
-		panic("in_control: unsupported ioctl");
+	if (ifp->if_flags & IFF_BROADCAST) {
+		if (broadaddr->sin_len != 0) {
+			ia->ia_broadaddr = *broadaddr;
+		} else if (ia->ia_subnetmask == IN_RFC3021_MASK) {
+			ia->ia_broadaddr.sin_addr.s_addr = INADDR_BROADCAST;
+			ia->ia_broadaddr.sin_len = sizeof(struct sockaddr_in);
+			ia->ia_broadaddr.sin_family = AF_INET;
+		} else {
+			ia->ia_broadaddr.sin_addr.s_addr =
+			    htonl(ia->ia_subnet | ~ia->ia_subnetmask);
+			ia->ia_broadaddr.sin_len = sizeof(struct sockaddr_in);
+			ia->ia_broadaddr.sin_family = AF_INET;
+		}
 	}
 
+	if (ifp->if_flags & IFF_POINTOPOINT)
+		ia->ia_dstaddr = *dstaddr;
+
+	/* XXXGL: rtinit() needs this strange assignment. */
+	if (ifp->if_flags & IFF_LOOPBACK)
+                ia->ia_dstaddr = ia->ia_addr;
+
+	ifa_ref(ifa);			/* if_addrhead */
+	IF_ADDR_WLOCK(ifp);
+	TAILQ_INSERT_TAIL(&ifp->if_addrhead, ifa, ifa_link);
+	IF_ADDR_WUNLOCK(ifp);
+
+	ifa_ref(ifa);			/* in_ifaddrhead */
+	IN_IFADDR_WLOCK();
+	TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link);
+	LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr), ia, ia_hash);
+	IN_IFADDR_WUNLOCK();
+
+	if (vhid != 0)
+		error = (*carp_attach_p)(&ia->ia_ifa, vhid);
+	if (error)
+		goto fail1;
+
+	/*
+	 * Give the interface a chance to initialize
+	 * if this is its first address,
+	 * and to validate the address if necessary.
+	 */
+	if (ifp->if_ioctl != NULL)
+		error = (*ifp->if_ioctl)(ifp, SIOCSIFADDR, (caddr_t)ia);
+	if (error)
+		goto fail2;
+
+	/*
+	 * Add route for the network.
+	 */
+	if (vhid == 0) {
+		int flags = RTF_UP;
+
+		if (ifp->if_flags & (IFF_LOOPBACK|IFF_POINTOPOINT))
+			flags |= RTF_HOST;
+
+		error = in_addprefix(ia, flags);
+		if (error)
+			goto fail2;
+	}
+
+	/*
+	 * Add a loopback route to self.
+	 */
+	if (vhid == 0 && (ifp->if_flags & IFF_LOOPBACK) == 0 &&
+	    ia->ia_addr.sin_addr.s_addr != INADDR_ANY) {
+		struct in_ifaddr *eia;
+
+		eia = more_localip(ia);
+
+		if (eia == NULL) {
+			error = ifa_add_loopback_route((struct ifaddr *)ia,
+			    (struct sockaddr *)&ia->ia_addr);
+			if (error)
+				goto fail3;
+		} else
+			ifa_free(&eia->ia_ifa);
+	}
+
+	if (iaIsFirst && (ifp->if_flags & IFF_MULTICAST)) {
+		struct in_addr allhosts_addr;
+		struct in_ifinfo *ii;
+
+		ii = ((struct in_ifinfo *)ifp->if_afdata[AF_INET]);
+		allhosts_addr.s_addr = htonl(INADDR_ALLHOSTS_GROUP);
+
+		error = in_joingroup(ifp, &allhosts_addr, NULL,
+			&ii->ii_allhosts);
+	}
+
+	EVENTHANDLER_INVOKE(ifaddr_event, ifp);
+
+	return (error);
+
+fail3:
+	if (vhid == 0)
+		(void )in_scrubprefix(ia, LLE_STATIC);
+
+fail2:
 	if (ia->ia_ifa.ifa_carp)
 		(*carp_detach_p)(&ia->ia_ifa);
 
+fail1:
 	IF_ADDR_WLOCK(ifp);
-	/* Re-check that ia is still part of the list. */
+	TAILQ_REMOVE(&ifp->if_addrhead, &ia->ia_ifa, ifa_link);
+	IF_ADDR_WUNLOCK(ifp);
+	ifa_free(&ia->ia_ifa);
+
+	IN_IFADDR_WLOCK();
+	TAILQ_REMOVE(&V_in_ifaddrhead, ia, ia_link);
+	LIST_REMOVE(ia, ia_hash);
+	IN_IFADDR_WUNLOCK();
+	ifa_free(&ia->ia_ifa);
+
+	return (error);
+}
+
+static int
+in_difaddr_ioctl(caddr_t data, struct ifnet *ifp, struct thread *td)
+{
+	const struct ifreq *ifr = (struct ifreq *)data;
+	const struct sockaddr_in *addr = (struct sockaddr_in *)&ifr->ifr_addr;
+	struct ifaddr *ifa;
+	struct in_ifaddr *ia;
+	bool deleteAny, iaIsLast;
+	int error;
+
+	if (td != NULL) {
+		error = priv_check(td, PRIV_NET_DELIFADDR);
+		if (error)
+			return (error);
+	}
+
+	if (addr->sin_len != sizeof(struct sockaddr_in) ||
+	    addr->sin_family != AF_INET)
+		deleteAny = true;
+	else
+		deleteAny = false;
+
+	iaIsLast = true;
+	ia = NULL;
+	IF_ADDR_WLOCK(ifp);
 	TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
-		if (ifa == &ia->ia_ifa)
-			break;
+		struct in_ifaddr *it = ifatoia(ifa);
+
+		if (it->ia_addr.sin_family != AF_INET)
+			continue;
+
+		if (deleteAny && ia == NULL && (td == NULL ||
+		    prison_check_ip4(td->td_ucred, &it->ia_addr.sin_addr) == 0))
+			ia = it;
+
+		if (it->ia_addr.sin_addr.s_addr == addr->sin_addr.s_addr &&
+		    (td == NULL || prison_check_ip4(td->td_ucred,
+		    &addr->sin_addr) == 0))
+			ia = it;
+
+		if (it != ia)
+			iaIsLast = false;
 	}
-	if (ifa == NULL) {
-		/*
-		 * If we lost the race with another thread, there is no need to
-		 * try it again for the next loop as there is no other exit
-		 * path between here and out.
-		 */
+
+	if (ia == NULL) {
 		IF_ADDR_WUNLOCK(ifp);
-		error = EADDRNOTAVAIL;
-		goto out;
+		return (EADDRNOTAVAIL);
 	}
+
 	TAILQ_REMOVE(&ifp->if_addrhead, &ia->ia_ifa, ifa_link);
 	IF_ADDR_WUNLOCK(ifp);
-	ifa_free(&ia->ia_ifa);		      /* if_addrhead */
+	ifa_free(&ia->ia_ifa);		/* if_addrhead */
 
 	IN_IFADDR_WLOCK();
 	TAILQ_REMOVE(&V_in_ifaddrhead, ia, ia_link);
-
 	LIST_REMOVE(ia, ia_hash);
 	IN_IFADDR_WUNLOCK();
+	ifa_free(&ia->ia_ifa);		/* in_ifaddrhead */
+
 	/*
+	 * in_scrubprefix() kills the interface route.
+	 */
+	in_scrubprefix(ia, LLE_STATIC);
+
+	/*
+	 * in_ifadown gets rid of all the rest of
+	 * the routes.  This is not quite the right
+	 * thing to do, but at least if we are running
+	 * a routing process they will come back.
+	 */
+	in_ifadown(&ia->ia_ifa, 1);
+
+	if (ia->ia_ifa.ifa_carp)
+		(*carp_detach_p)(&ia->ia_ifa);
+
+	/*
 	 * If this is the last IPv4 address configured on this
 	 * interface, leave the all-hosts group.
 	 * No state-change report need be transmitted.
 	 */
-	IFP_TO_IA(ifp, iap);
-	if (iap == NULL) {
+	if (iaIsLast && (ifp->if_flags & IFF_MULTICAST)) {
+		struct in_ifinfo *ii;
+
 		ii = ((struct in_ifinfo *)ifp->if_afdata[AF_INET]);
 		IN_MULTI_LOCK();
 		if (ii->ii_allhosts) {
@@ -584,14 +626,11 @@ in_control(struct socket *so, u_long cmd, caddr_t
 			ii->ii_allhosts = NULL;
 		}
 		IN_MULTI_UNLOCK();
-	} else
-		ifa_free(&iap->ia_ifa);
+	}
 
-	ifa_free(&ia->ia_ifa);				/* in_ifaddrhead */
-out:
-	if (ia != NULL)
-		ifa_free(&ia->ia_ifa);
-	return (error);
+	EVENTHANDLER_INVOKE(ifaddr_event, ifp);
+
+	return (0);
 }
 
 /*
@@ -616,11 +655,23 @@ in_lifaddr_ioctl(struct socket *so, u_long cmd, ca
 {
 	struct if_laddrreq *iflr = (struct if_laddrreq *)data;
 	struct ifaddr *ifa;
+	int error;
 
-	/* sanity checks */
-	if (data == NULL || ifp == NULL) {
-		panic("invalid argument to in_lifaddr_ioctl");
-		/*NOTRECHED*/
+	switch (cmd) {
+	case SIOCALIFADDR:
+		if (td != NULL) {
+			error = priv_check(td, PRIV_NET_ADDIFADDR);
+			if (error)
+				return (error);
+		}
+		break;
+	case SIOCDLIFADDR:
+		if (td != NULL) {
+			error = priv_check(td, PRIV_NET_DELIFADDR);
+			if (error)
+				return (error);
+		}
+		break;
 	}
 
 	switch (cmd) {
@@ -770,115 +821,6 @@ in_lifaddr_ioctl(struct socket *so, u_long cmd, ca
 	return (EOPNOTSUPP);	/*just for safety*/
 }
 
-/*
- * Initialize an interface's internet address
- * and routing table entry.
- */
-static int
-in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
-    int masksupplied, int vhid)
-{
-	register u_long i = ntohl(sin->sin_addr.s_addr);
-	int flags, error = 0;
-
-	IN_IFADDR_WLOCK();
-	if (ia->ia_addr.sin_family == AF_INET)
-		LIST_REMOVE(ia, ia_hash);
-	ia->ia_addr = *sin;
-	LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr),
-	    ia, ia_hash);
-	IN_IFADDR_WUNLOCK();
-
-	if (vhid > 0) {
-		if (carp_attach_p != NULL)
-			error = (*carp_attach_p)(&ia->ia_ifa, vhid);
-		else
-			error = EPROTONOSUPPORT;
-	}
-	if (error)
-		return (error);
-
-	/*
-	 * Give the interface a chance to initialize
-	 * if this is its first address,
-	 * and to validate the address if necessary.
-	 */
-	if (ifp->if_ioctl != NULL &&
-	    (error = (*ifp->if_ioctl)(ifp, SIOCSIFADDR, (caddr_t)ia)) != 0)
-			/* LIST_REMOVE(ia, ia_hash) is done in in_control */
-			return (error);
-
-	/*
-	 * Be compatible with network classes, if netmask isn't supplied,
-	 * guess it based on classes.
-	 */
-	if (!masksupplied) {
-		if (IN_CLASSA(i))
-			ia->ia_subnetmask = IN_CLASSA_NET;
-		else if (IN_CLASSB(i))
-			ia->ia_subnetmask = IN_CLASSB_NET;
-		else
-			ia->ia_subnetmask = IN_CLASSC_NET;
-		ia->ia_sockmask.sin_addr.s_addr = htonl(ia->ia_subnetmask);
-	}
-	ia->ia_subnet = i & ia->ia_subnetmask;
-	in_socktrim(&ia->ia_sockmask);
-
-	/*
-	 * Add route for the network.
-	 */
-	flags = RTF_UP;
-	ia->ia_ifa.ifa_metric = ifp->if_metric;
-	if (ifp->if_flags & IFF_BROADCAST) {
-		if (ia->ia_subnetmask == IN_RFC3021_MASK)
-			ia->ia_broadaddr.sin_addr.s_addr = INADDR_BROADCAST;
-		else
-			ia->ia_broadaddr.sin_addr.s_addr =
-			    htonl(ia->ia_subnet | ~ia->ia_subnetmask);
-	} else if (ifp->if_flags & IFF_LOOPBACK) {
-		ia->ia_dstaddr = ia->ia_addr;
-		flags |= RTF_HOST;
-	} else if (ifp->if_flags & IFF_POINTOPOINT) {
-		if (ia->ia_dstaddr.sin_family != AF_INET)
-			return (0);
-		flags |= RTF_HOST;
-	}
-	if (!vhid && (error = in_addprefix(ia, flags)) != 0)
-		return (error);
-
-	if (ia->ia_addr.sin_addr.s_addr == INADDR_ANY)
-		return (0);
-
-	if (ifp->if_flags & IFF_POINTOPOINT &&
-	    ia->ia_dstaddr.sin_addr.s_addr == ia->ia_addr.sin_addr.s_addr)
-			return (0);
-
-	/*
-	 * add a loopback route to self
-	 */
-	if (V_useloopback && !vhid && !(ifp->if_flags & IFF_LOOPBACK)) {
-		struct route ia_ro;
-
-		bzero(&ia_ro, sizeof(ia_ro));
-		*((struct sockaddr_in *)(&ia_ro.ro_dst)) = ia->ia_addr;
-		rtalloc_ign_fib(&ia_ro, 0, RT_DEFAULT_FIB);
-		if ((ia_ro.ro_rt != NULL) && (ia_ro.ro_rt->rt_ifp != NULL) &&
-		    (ia_ro.ro_rt->rt_ifp == V_loif)) {
-			RT_LOCK(ia_ro.ro_rt);
-			RT_ADDREF(ia_ro.ro_rt);
-			RTFREE_LOCKED(ia_ro.ro_rt);
-		} else
-			error = ifa_add_loopback_route((struct ifaddr *)ia,
-			    (struct sockaddr *)&ia->ia_addr);
-		if (error == 0)
-			ia->ia_flags |= IFA_RTSELF;
-		if (ia_ro.ro_rt != NULL)
-			RTFREE(ia_ro.ro_rt);
-	}
-
-	return (error);
-}
-
 #define rtinitflags(x) \
 	((((x)->ia_ifp->if_flags & (IFF_LOOPBACK | IFF_POINTOPOINT)) != 0) \
 	    ? RTF_HOST : 0)
@@ -1007,44 +949,27 @@ in_scrubprefix(struct in_ifaddr *target, u_int fla
 
 	/*
 	 * Remove the loopback route to the interface address.
-	 * The "useloopback" setting is not consulted because if the
-	 * user configures an interface address, turns off this
-	 * setting, and then tries to delete that interface address,
-	 * checking the current setting of "useloopback" would leave
-	 * that interface address loopback route untouched, which
-	 * would be wrong. Therefore the interface address loopback route
-	 * deletion is unconditional.
 	 */
 	if ((target->ia_addr.sin_addr.s_addr != INADDR_ANY) &&
 	    !(target->ia_ifp->if_flags & IFF_LOOPBACK) &&
-	    (target->ia_flags & IFA_RTSELF)) {
-		struct route ia_ro;
-		int freeit = 0;
+	    (flags & LLE_STATIC)) {
+		struct in_ifaddr *eia;
 
-		bzero(&ia_ro, sizeof(ia_ro));
-		*((struct sockaddr_in *)(&ia_ro.ro_dst)) = target->ia_addr;
-		rtalloc_ign_fib(&ia_ro, 0, 0);
-		if ((ia_ro.ro_rt != NULL) && (ia_ro.ro_rt->rt_ifp != NULL) &&
-		    (ia_ro.ro_rt->rt_ifp == V_loif)) {
-			RT_LOCK(ia_ro.ro_rt);
-			if (ia_ro.ro_rt->rt_refcnt <= 1)
-				freeit = 1;
-			else if (flags & LLE_STATIC) {
-				RT_REMREF(ia_ro.ro_rt);
-				target->ia_flags &= ~IFA_RTSELF;
-			}
-			RTFREE_LOCKED(ia_ro.ro_rt);
-		}
-		if (freeit && (flags & LLE_STATIC)) {
+		eia = more_localip(target);
+
+		if (eia != NULL) {
+			error = ifa_switch_loopback_route((struct ifaddr *)eia,
+			    (struct sockaddr *)&target->ia_addr);
+			ifa_free(&eia->ia_ifa);
+		} else {
 			error = ifa_del_loopback_route((struct ifaddr *)target,
 			    (struct sockaddr *)&target->ia_addr);
-			if (error == 0)
-				target->ia_flags &= ~IFA_RTSELF;
 		}
-		if ((flags & LLE_STATIC) &&
-			!(target->ia_ifp->if_flags & IFF_NOARP))
+
+		if (!(target->ia_ifp->if_flags & IFF_NOARP))
 			/* remove arp cache */
-			arp_ifscrub(target->ia_ifp, IA_SIN(target)->sin_addr.s_addr);
+			arp_ifscrub(target->ia_ifp,
+			    IA_SIN(target)->sin_addr.s_addr);
 	}
 
 	if (rtinitflags(target)) {
Index: sys/netinet/raw_ip.c
===================================================================
--- sys/netinet/raw_ip.c	(revision 257503)
+++ sys/netinet/raw_ip.c	(working copy)
@@ -774,8 +774,6 @@ rip_ctlinput(int cmd, struct sockaddr *sa, void *v
 			flags |= RTF_HOST;
 
 		err = ifa_del_loopback_route((struct ifaddr *)ia, sa);
-		if (err == 0)
-			ia->ia_flags &= ~IFA_RTSELF;
 
 		err = rtinit(&ia->ia_ifa, RTM_ADD, flags);
 		if (err == 0)
@@ -782,8 +780,6 @@ rip_ctlinput(int cmd, struct sockaddr *sa, void *v
 			ia->ia_flags |= IFA_ROUTE;
 
 		err = ifa_add_loopback_route((struct ifaddr *)ia, sa);
-		if (err == 0)
-			ia->ia_flags |= IFA_RTSELF;
 
 		ifa_free(&ia->ia_ifa);
 		break;

--SvF6CGw9fzJC4Rcx--

From owner-freebsd-net@FreeBSD.ORG  Fri Nov  1 16:33:52 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id C196D3B3
 for <freebsd-net@freebsd.org>; Fri,  1 Nov 2013 16:33:52 +0000 (UTC)
 (envelope-from s.khanchi@gmail.com)
Received: from mail-wi0-x231.google.com (mail-wi0-x231.google.com
 [IPv6:2a00:1450:400c:c05::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 5D50D28D8
 for <freebsd-net@freebsd.org>; Fri,  1 Nov 2013 16:33:52 +0000 (UTC)
Received: by mail-wi0-f177.google.com with SMTP id f4so1321752wiw.4
 for <freebsd-net@freebsd.org>; Fri, 01 Nov 2013 09:33:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc:content-type;
 bh=vswjD5pt+Ip9CC/rcoL0LIQ3egOuHF5H89slgjqp5TA=;
 b=PQhxmugASjaGWTZ4J0aUpyVH/3InPniQNLgOO9vdfcMujRTKgg+e/l4zLDlp+TD95K
 WaVxENF+xgDlMyJ6+yCFmATHsLaZeEvuoPQWKe0KB4BQM1rNTP6xgOueQc6iT0+XFC0l
 4uUTrn2pSu66idX0uLYZOvNB5QwbaBO9NxojoLaZR2BuKouSZs25mf2wwdgr1zaeypZd
 hzp3x1Qf6c/kk1NeML6+JJDqJ1OjT0IYr7bGPYbh7UXg6Xma9eEua0fv8s/8VfS7XE5T
 FEj2lauy/4Zu7IN8UQXY/ktD8erqruOJJWhG2sRuYqdTcJeqaSaILGUtBVE17hWbnDcN
 KJTg==
X-Received: by 10.180.105.194 with SMTP id go2mr3069156wib.39.1383323630849;
 Fri, 01 Nov 2013 09:33:50 -0700 (PDT)
MIME-Version: 1.0
Sender: s.khanchi@gmail.com
Received: by 10.194.122.230 with HTTP; Fri, 1 Nov 2013 09:33:30 -0700 (PDT)
In-Reply-To: <20131031180907.GB62132@onelab2.iet.unipi.it>
References: <CAARSjE39dDLMJfEZexZQ=YGHhCNR69vKzPAB6ojYdJSZopysGQ@mail.gmail.com>
 <20131031180907.GB62132@onelab2.iet.unipi.it>
From: h bagade <bagadeh@gmail.com>
Date: Fri, 1 Nov 2013 20:03:30 +0330
X-Google-Sender-Auth: YQiKg7MJnikHJ6kl3tbJubrydd8
Message-ID: <CAARSjE0jOUS9XyQ6=SC_5GdQnpQH8VPf5uZMTVqfQi_29rjo+w@mail.gmail.com>
Subject: Re: Errors on running kipfw with vale switches
To: Luigi Rizzo <rizzo@iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Nov 2013 16:33:52 -0000

On Thu, Oct 31, 2013 at 9:39 PM, Luigi Rizzo <rizzo@iet.unipi.it> wrote:

> On Thu, Oct 31, 2013 at 11:10:39AM +0330, h bagade wrote:
> > Hi all,
> >
> > I want to run userland ipfw with netmap support(kipfw). When I try to
> > follow the example to test kipfw, it encounters an error on following
> > command:
>
> i suspect that stable/9 has an old version of the netmap code
> so the argument to the ioctl fails.
> In fact, I don't even remember if the code in stable/9
> supports VALE.
>
> Please wait for a few days, we am going to push a newer
> version of netmap to both HEAD and stable/9 soon
>
> cheers
> luigi
>

Thanks for your great support. I'll wait for your changes :)

From owner-freebsd-net@FreeBSD.ORG  Sat Nov  2 12:20:02 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 4D057B38
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Sat,  2 Nov 2013 12:20:02 +0000 (UTC)
 (envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 21D072265
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Sat,  2 Nov 2013 12:20:02 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id rA2CK0dm081543
 for <freebsd-net@freefall.freebsd.org>; Sat, 2 Nov 2013 12:20:00 GMT
 (envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id rA2CK0s0081542;
 Sat, 2 Nov 2013 12:20:00 GMT (envelope-from gnats)
Date: Sat, 2 Nov 2013 12:20:00 GMT
Message-Id: <201311021220.rA2CK0s0081542@freefall.freebsd.org>
To: freebsd-net@FreeBSD.org
Cc: 
From: "Pataki Antal (Granaglia Kft.)" <pataki.antal@granaglia.com>
Subject: Re: kern/183391: [ixgbe] 10gigabit networking problems with Emulex
 OCE 11102 CNA
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: "Pataki Antal \(Granaglia Kft.\)" <pataki.antal@granaglia.com>
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Nov 2013 12:20:02 -0000

The following reply was made to PR kern/183391; it has been noted by GNATS.

From: "Pataki Antal (Granaglia Kft.)" <pataki.antal@granaglia.com>
To: bug-followup@FreeBSD.org,
 Pataki Antal <pataki.antal@gmail.com>
Cc:  
Subject: Re: kern/183391: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA
Date: Sat, 2 Nov 2013 13:11:49 +0100

 --Apple-Mail=_6FE46762-5B52-4C3F-8C0F-4A4AEB8D919B
 Content-Transfer-Encoding: 7bit
 Content-Type: text/plain;
 	charset=us-ascii
 
 I would like to correct his line: 
 
 Synopsis:	[ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA
 
 
 This problem is not realted to the ixgbe, but related to the oce.
 
 
 Thanks,
 
 Antal Pataki
 --Apple-Mail=_6FE46762-5B52-4C3F-8C0F-4A4AEB8D919B
 Content-Transfer-Encoding: quoted-printable
 Content-Type: text/html;
 	charset=us-ascii
 
 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
 charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
 -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">I =
 would like to correct his line:&nbsp;<div><br></div><div><table =
 class=3D"headtable" style=3D"border-left-width: 1px; border-left-style: =
 solid; border-left-color: rgb(153, 153, 153); border-bottom-width: 1px; =
 border-bottom-style: solid; border-bottom-color: rgb(153, 153, 153); =
 width: 774px; margin-bottom: 16px; color: rgb(0, 0, 0); font-family: =
 verdana, sans-serif; font-size: 11px; background-color: rgb(255, 255, =
 255);"><tbody><tr><td class=3D"key" style=3D"vertical-align: top; =
 font-weight: bold; width: 12em;">Synopsis:</td><td class=3D"val" =
 style=3D"vertical-align: top;">[ixgbe] 10gigabit networking problems =
 with Emulex OCE 11102 =
 CNA</td></tr></tbody></table><div><br></div></div><div><br></div><div>This=
  problem is not realted to the ixgbe, but related to the =
 oce.</div><div><br></div><div><br></div><div>Thanks,</div><div><br></div><=
 div>Antal Pataki</div></body></html>=
 
 --Apple-Mail=_6FE46762-5B52-4C3F-8C0F-4A4AEB8D919B--

From owner-freebsd-net@FreeBSD.ORG  Sat Nov  2 19:50:54 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id B195C335;
 Sat,  2 Nov 2013 19:50:54 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7E70B257E;
 Sat,  2 Nov 2013 19:50:54 +0000 (UTC)
Received: from Julian-MBP3.local ([12.157.112.67]) (authenticated bits=0)
 by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id rA2Johgn037781
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Sat, 2 Nov 2013 12:50:44 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <5275578E.40000@freebsd.org>
Date: Sat, 02 Nov 2013 12:50:38 -0700
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
To: Andre Oppermann <andre@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>,
 Adrian Chadd <adrian@freebsd.org>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it> <52717A62.7040600@freebsd.org>
In-Reply-To: <52717A62.7040600@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Nov 2013 19:50:54 -0000

On 10/30/13, 2:30 PM, Andre Oppermann wrote:
>
> Now ifnet has become very complex and large and should be brought
> back to its original purpose of the being the logical layer 3 interface
> abstraction.  There isn't necessarily a 1:1 mapping from one ifnet
> instance to one hardware interface.  In fact there are pure logical
> ifnets (gre, tun, ...), direct hardware ifnets (simple network 
> interfaces
> like fxp(4)), and multiple logic interfaces on top a single hardware
> (vlan, lagg, ...).  Depending on the ifnets purpose the backend can
> be very different.  Thus I want to decouple the current implicit
> notion of ifnet==hardware with associated queuing and such. Instead
> it should become a layer 3 abstraction inside the kernel again and
> delegate all lower layers to appropriate protocol, layer 2, and
> hardware specific implementations.

I have thought for a long time that the 'if' should be split in two..

the top half really is just common for everything..
it is basically what tun is.. (or ng_iface for that matter)


From owner-freebsd-net@FreeBSD.ORG  Sat Nov  2 22:40:02 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 87B26333
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Sat,  2 Nov 2013 22:40:02 +0000 (UTC)
 (envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 5BC462C86
 for <freebsd-net@smarthost.ysv.freebsd.org>;
 Sat,  2 Nov 2013 22:40:02 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id rA2Me2R8018040
 for <freebsd-net@freefall.freebsd.org>; Sat, 2 Nov 2013 22:40:02 GMT
 (envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.7/8.14.7/Submit) id rA2Me28R018039;
 Sat, 2 Nov 2013 22:40:02 GMT (envelope-from gnats)
Date: Sat, 2 Nov 2013 22:40:02 GMT
Message-Id: <201311022240.rA2Me28R018039@freefall.freebsd.org>
To: freebsd-net@FreeBSD.org
Cc: 
From: Mohamad Aghakhani <www.mohamad607@gmail.com>
Subject: Re: kern/172683: [ip6] Duplicate IPv6 Link Local Addresses
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: Mohamad Aghakhani <www.mohamad607@gmail.com>
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Nov 2013 22:40:02 -0000

The following reply was made to PR kern/172683; it has been noted by GNATS.

From: Mohamad Aghakhani <www.mohamad607@gmail.com>
To: bug-followup@FreeBSD.org, doug@lafn.org
Cc:  
Subject: Re: kern/172683: [ip6] Duplicate IPv6 Link Local Addresses
Date: Sun, 3 Nov 2013 02:02:34 +0330

 --089e0139ffb864788604ea394257
 Content-Type: text/plain; charset=ISO-8859-1
 
 
 --089e0139ffb864788604ea394257
 Content-Type: text/html; charset=ISO-8859-1
 
 
 --089e0139ffb864788604ea394257--