From owner-freebsd-net@FreeBSD.ORG Sun Oct 27 12:13:47 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 3F3CCF0 for ; Sun, 27 Oct 2013 12:13:47 +0000 (UTC) (envelope-from eocallaghan@alterapraxis.com) Received: from smtp.alterapraxis.com (unknown [101.164.33.212]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8502A2563 for ; Sun, 27 Oct 2013 12:13:46 +0000 (UTC) Received: from smtp.alterapraxis.com (tony [127.0.0.1]) by smtp.alterapraxis.com (Postfix) with ESMTP id A7948634852 for ; Sun, 27 Oct 2013 23:11:19 +1100 (EST) Received: from tinkerbell.alterapraxis.com (unknown [101.164.33.212]) (using SSLv3 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: eocallaghan@alterapraxis.com) by smtp.alterapraxis.com (Postfix) with ESMTPSA id 6212E63484A for ; Sun, 27 Oct 2013 23:11:18 +1100 (EST) Date: Sun, 27 Oct 2013 23:13:25 +1100 From: Edward O'Callaghan To: freebsd-net@freebsd.org Subject: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU, 8411B and 8106EUS. Message-ID: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com> Organization: Altera Praxis Pty Ltd X-Mailer: Claws Mail 3.9.2 (GTK+ 2.24.22; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA512; boundary="Sig_/+_L+578aFh1LcEL6OaLqt2L"; protocol="application/pgp-signature" X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Oct 2013 12:13:47 -0000 --Sig_/+_L+578aFh1LcEL6OaLqt2L Content-Type: multipart/mixed; boundary="MP_/0wFFVR_dOtdshJmu5wgtM3." --MP_/0wFFVR_dOtdshJmu5wgtM3. Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hi, This is a follow up. I have tested most of these NIC's now and this patch _should_ be fine to commit to HEAD. Could someone please help me mediate this? This also fixes kern/183167. Please disregards the patches in the PR. Kind Regards, Edward. --MP_/0wFFVR_dOtdshJmu5wgtM3. Content-Type: text/x-patch Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename=0001-re-4-resync.-Adds-preliminary-support-for-8168G-8168.patch =46rom 5357870e5d9129a3f098e48d47e34f1c40924485 Mon Sep 17 00:00:00 2001 From: Edward O'Callaghan Date: Sun, 27 Oct 2013 23:03:53 +1100 Subject: [PATCH] re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU, 8411B and 8106EUS. Organization: Altera Praxis Pty Ltd. Signed-off-by: Edward O'Callaghan --- sys/dev/re/if_re.c | 8 ++++++++ sys/pci/if_rlreg.h | 6 +++++- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/sys/dev/re/if_re.c b/sys/dev/re/if_re.c index 381fa87..0de569f 100644 --- a/sys/dev/re/if_re.c +++ b/sys/dev/re/if_re.c @@ -234,7 +234,11 @@ static const struct rl_hwrev re_hwrevs[] =3D { { RL_HWREV_8168E, RL_8169, "8168E/8111E", RL_JUMBO_MTU_9K}, { RL_HWREV_8168E_VL, RL_8169, "8168E/8111E-VL", RL_JUMBO_MTU_6K}, { RL_HWREV_8168F, RL_8169, "8168F/8111F", RL_JUMBO_MTU_9K}, + { RL_HWREV_8168G, RL_8169, "8168G/8111G", RL_JUMBO_MTU_9K}, + { RL_HWREV_8168EP, RL_8169, "8168G/8111EP", RL_JUMBO_MTU_9K}, + { RL_HWREV_8168GU, RL_8169, "8168G/8111GU", RL_JUMBO_MTU_9K}, { RL_HWREV_8411, RL_8169, "8411", RL_JUMBO_MTU_9K}, + { RL_HWREV_8411B, RL_8169, "8411B", RL_JUMBO_MTU_9K}, { 0, 0, NULL, 0 } }; =20 @@ -1451,6 +1455,7 @@ re_attach(device_t dev) RL_FLAG_DESCV2 | RL_FLAG_MACSTAT | RL_FLAG_AUTOPAD | RL_FLAG_JUMBOV2 | RL_FLAG_WAIT_TXPOLL | RL_FLAG_WOL_MANLINK; break; + case RL_HWREV_8168GU: case RL_HWREV_8168E: sc->rl_flags |=3D RL_FLAG_PHYWAKE | RL_FLAG_PHYWAKE_PM | RL_FLAG_PAR | RL_FLAG_DESCV2 | RL_FLAG_MACSTAT | @@ -1458,8 +1463,11 @@ re_attach(device_t dev) RL_FLAG_WOL_MANLINK; break; case RL_HWREV_8168E_VL: + case RL_HWREV_8168EP: case RL_HWREV_8168F: + case RL_HWREV_8168G: case RL_HWREV_8411: + case RL_HWREV_8411B: sc->rl_flags |=3D RL_FLAG_PHYWAKE | RL_FLAG_PAR | RL_FLAG_DESCV2 | RL_FLAG_MACSTAT | RL_FLAG_CMDSTOP | RL_FLAG_AUTOPAD | RL_FLAG_JUMBOV2 | diff --git a/sys/pci/if_rlreg.h b/sys/pci/if_rlreg.h index 142fe48..89440e3 100644 --- a/sys/pci/if_rlreg.h +++ b/sys/pci/if_rlreg.h @@ -174,7 +174,7 @@ #define RL_HWREV_8102EL_SPIN1 0x24C00000 #define RL_HWREV_8168D 0x28000000 #define RL_HWREV_8168DP 0x28800000 -#define RL_HWREV_8168E 0x2C000000 +#define RL_HWREV_8168E 0x2C000000 /* 8105E */ #define RL_HWREV_8168E_VL 0x2C800000 #define RL_HWREV_8168B_SPIN1 0x30000000 #define RL_HWREV_8100E 0x30800000 @@ -192,6 +192,10 @@ #define RL_HWREV_8106E 0x44800000 #define RL_HWREV_8168F 0x48000000 #define RL_HWREV_8411 0x48800000 +#define RE_HWREV_8411B 0x5C800000 +#define RE_HWREV_8168G 0x4C000000 +#define RE_HWREV_8168EP 0x50000000 +#define RE_HWREV_8168GU 0x50800000 /* 8106EUS */ #define RL_HWREV_8139 0x60000000 #define RL_HWREV_8139A 0x70000000 #define RL_HWREV_8139AG 0x70800000 --=20 1.8.4.1 --MP_/0wFFVR_dOtdshJmu5wgtM3.-- --Sig_/+_L+578aFh1LcEL6OaLqt2L Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBCgAGBQJSbQNqAAoJENeyf/ug44dtLmUP/3yHCmLJEt2EH26TtxHj+Ozq GpIWSQu7y2kHGzq4co5EFzD0pY7R/6sUjesO9R8Cudfyp99/dud+wvpMzGI+uXpq v90uNk2gwZp7OWuToK7d5h4zs171eshdnWZyBGmTtR7RjfZIWtNBVeOba8Bm+RNG rg+NiSjSQdIhro3PMFToSLqoPZMGavB7G3Wd5oRCbtHaVNOC4bLBNE9ShB8IzShX RWocmGcQIvvBO3rI27npmwQB0nwo1liLdxhsrL1dt0Px7WLPlZy4+Z12pVxQ/9VT F9M7RpsBcSIfGKxzwZuQRL8NeUMGxHJIk5z3WNCyEJpjy4N2xF/b4rxZETpgXGyy cUoAs4QBKvwA+g0OPhwQXVR8gkRUQ3dZWPbc30aRlZQyRONvKU7CQhENB6jlsyIS mq+W5NAdh6kOeB3oUkcwOlwDfdR2BsJBneklIwww68VcZhfTTT91ifDUUpyEXlVo 3aozV1zdBQFlWg7mdFW42SzgEUTD+yyRwtzXx/F8F+zhrG7pM9fDrF2oWfWptKsC Bnh+mqb8+wKpRUsFo44S7wNBNJS6LXWSEvvsZ4liUmo6GfSKKxintWiVRhQkv00T ucNrIatOeADR4nuXEiJMnBqTFxf4prfobK3+D/KYx/dhXGOR60/RYxdQ45KHoroD aCQcdq0/uaP+pI8V7Rtp =FBep -----END PGP SIGNATURE----- --Sig_/+_L+578aFh1LcEL6OaLqt2L-- From owner-freebsd-net@FreeBSD.ORG Mon Oct 28 02:27:28 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 5DDEC673 for ; Mon, 28 Oct 2013 02:27:28 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pb0-x231.google.com (mail-pb0-x231.google.com [IPv6:2607:f8b0:400e:c01::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 346A12AF8 for ; Mon, 28 Oct 2013 02:27:28 +0000 (UTC) Received: by mail-pb0-f49.google.com with SMTP id xb4so2986996pbc.22 for ; Sun, 27 Oct 2013 19:27:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=R+FgeKFNk546gakYuAVJarxmq84daGBQv/PrvWJrpvc=; b=OPiXhWDiNm80zEWd4/98XWJqO4aFQ9fgqF1VijZGznW3DZaZJHrKVbYGqpNMaBGU94 4zkp1b8mTnSMnKbczx31bjMp1U01cTx9k9q3ypCnqMkeUf8PULzVa17KetrqtrSzm+LQ y/9gY5KNzNuJl+uJdXZ+699N+xWDEq1Ltm1p0NrmCftvZGThpUGQU8s8UmFcQufQS9n8 ytTGxakxdTIIM/XMw/k+aUxKfV3f+j0I1NGsTF4Pt0DBGPgwqg8Gp/HaOm4e8J03ypla +b1lpmi0PTS5fQtT9octmZ4HWBFg7cxFnkPvLdD3QaLqIrMLHw/WVzejwj+1kFhzn6HH OTEg== X-Received: by 10.66.163.164 with SMTP id yj4mr23419537pab.91.1382927246133; Sun, 27 Oct 2013 19:27:26 -0700 (PDT) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id yh1sm24865208pbc.21.2013.10.27.19.27.23 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 27 Oct 2013 19:27:25 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Mon, 28 Oct 2013 11:27:23 +0900 From: Yonghyeon PYUN Date: Mon, 28 Oct 2013 11:27:23 +0900 To: Edward O'Callaghan Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU, 8411B and 8106EUS. Message-ID: <20131028022723.GA4367@michelle.cdnetworks.com> References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com> User-Agent: Mutt/1.4.2.3i Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Oct 2013 02:27:28 -0000 On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote: > Hi, > > This is a follow up. I have tested most of these NIC's now and this > patch _should_ be fine to commit to HEAD. Could someone please help me > mediate this? This also fixes kern/183167. Please disregards the > patches in the PR. > I can handle this. Actually I had been working on supporting these newer controllers for a while. It seems just adding 8168GU id does not work. Did you test the patch on 8168GU controller? If yes, please let me know the OUI id and model number of the PHY. From owner-freebsd-net@FreeBSD.ORG Mon Oct 28 05:48:49 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BB481B22 for ; Mon, 28 Oct 2013 05:48:49 +0000 (UTC) (envelope-from eocallaghan@alterapraxis.com) Received: from smtp.alterapraxis.com (unknown [101.164.33.212]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 73ABB238E for ; Mon, 28 Oct 2013 05:48:49 +0000 (UTC) Received: from smtp.alterapraxis.com (tony [127.0.0.1]) by smtp.alterapraxis.com (Postfix) with ESMTP id 1AF67634852; Mon, 28 Oct 2013 16:46:27 +1100 (EST) Received: from tinkerbell.alterapraxis.com (unknown [101.164.33.212]) (using SSLv3 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: eocallaghan@alterapraxis.com) by smtp.alterapraxis.com (Postfix) with ESMTPSA id 1F51463484A; Mon, 28 Oct 2013 16:46:25 +1100 (EST) Date: Mon, 28 Oct 2013 16:48:35 +1100 From: Edward O'Callaghan To: pyunyh@gmail.com Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU, 8411B and 8106EUS. Message-ID: <20131028164835.298646d5.eocallaghan@alterapraxis.com> In-Reply-To: <20131028022723.GA4367@michelle.cdnetworks.com> References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com> <20131028022723.GA4367@michelle.cdnetworks.com> Organization: Altera Praxis Pty Ltd X-Mailer: Claws Mail 3.9.2 (GTK+ 2.24.22; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA512; boundary="Sig_/0tWS0zraPu/45De8hAEzUAu"; protocol="application/pgp-signature" X-Virus-Scanned: ClamAV using ClamSMTP Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Oct 2013 05:48:49 -0000 --Sig_/0tWS0zraPu/45De8hAEzUAu Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Mon, 28 Oct 2013 11:27:23 +0900 Yonghyeon PYUN wrote: > On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote: > > Hi, > >=20 > > This is a follow up. I have tested most of these NIC's now and this > > patch _should_ be fine to commit to HEAD. Could someone please help > > me mediate this? This also fixes kern/183167. Please disregards the > > patches in the PR. > >=20 >=20 > I can handle this. Actually I had been working on supporting these > newer controllers for a while. It seems just adding 8168GU id does > not work. Did you test the patch on 8168GU controller? > If yes, please let me know the OUI id and model number of the PHY. Hi Yonghyeon, Many thanks! Not the 8168GU, however I did find out that its the same as a 8106EUS. I don't know if this may shed some light if you have the hw to test it.. What exactly did not work about the 8168GU, what is it doing? My main concern is to get a board here working that has a 8168G onboard. Kind Regards, Edward. --Sig_/0tWS0zraPu/45De8hAEzUAu Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBCgAGBQJSbfq3AAoJENeyf/ug44dtJnUP/2dax3g2HyZf5EBa82EjEPRK JsWqJGCXvBnfcwRouOKr12hdZdNwPI7kmmjDxHIc5BH66hdPbSrvtvdh0aa4daSm BTyv2Ycdj36I7znZcWsGkeZ5NHL+iwk0o7tnRpOqp8g111/fVDLwuzMij/wULU6c e77G1Z5V31g6t/DENh0UOBbayJ/3NJ0twgdLwoQewdbA2UYk6IhJeA6gOFGSwJC1 7TMuLO/CLnY6wUU8x7rLtGJb7HOftIjUqmYlR6rmUdSJyrmiHTBaZ5R+JxTgcJJn S4GVvFJC96e9eV8sbsq1SjV0ExkDO43tnLh8q5b/OFTMmSMcoUANVExml3JvWPuk vLmgCr82YTJCNHNnyjD0jmTuzZW4eqRw/WrdQ+z/spu1vmxus6HqHZpy+dBFmjoX INniKCYmqsJenpPTPNxdxpTOyj9woR74UAzb19fXlmJ7IuobtPD181lw64rrb8+K jdsk0h/yLk9KkDpWdb0LXS0XAfIq0Ky1jYSX6VTVxUFqKPso4pHvImtzQXrn6MCM ma9pYbMcISgi7pAiVnXzK8AY0Xk/txwPCbHP+dnptU2QDMe1N2qkwikAigs01Ybq cQfEtEIomHzCHqmaAk+ijHi1FmV5XssNxjyRrlFhDnokUxOf5gA4uU+96yH8ghJt 3x1J+4ndt2be3vkb95Ya =+ZPK -----END PGP SIGNATURE----- --Sig_/0tWS0zraPu/45De8hAEzUAu-- From owner-freebsd-net@FreeBSD.ORG Mon Oct 28 06:11:08 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 17D5F1D4 for ; Mon, 28 Oct 2013 06:11:08 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pd0-x230.google.com (mail-pd0-x230.google.com [IPv6:2607:f8b0:400e:c02::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E33F424B5 for ; Mon, 28 Oct 2013 06:11:07 +0000 (UTC) Received: by mail-pd0-f176.google.com with SMTP id g10so6587459pdj.35 for ; Sun, 27 Oct 2013 23:11:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=8/aUAwTJavH4cys1CiDzmNREE/RXpzy/akL3rOs7IEA=; b=vsXHlpD56uGlt4WiKLALZXfLx/IPJp7PuPWEauIjxO/9Df2gzmXqkY8NtW5BhIuw8O nXutdU7xWnRNn/wLz7GmdW/jlxLmgEI2mCGuLowfskAeXJ7dwf7NG1dr7ojxnvOO0gGI C6pINwEgMyqFND4b9M9q/5AUvHrtI+99b67n19HdHRavxSTnMS/c2uLOb58pgA11gONU FzTh1L7PshUYhKjn1AV0TIEv/UFynyfCDbk5vJ5tjGKlM2+19oYB6xxHzB5QarOMaxYM EslOfispjZCf9n9MlHS3tcj0vs4Pk/fTDGxRswg+dVIwwr9XJkke1EqEsUuAWPeBM25+ Ud8g== X-Received: by 10.66.149.231 with SMTP id ud7mr24221077pab.8.1382940666472; Sun, 27 Oct 2013 23:11:06 -0700 (PDT) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id lm2sm32345952pab.2.2013.10.27.23.11.03 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 27 Oct 2013 23:11:05 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Mon, 28 Oct 2013 15:11:00 +0900 From: Yonghyeon PYUN Date: Mon, 28 Oct 2013 15:11:00 +0900 To: Edward O'Callaghan Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU, 8411B and 8106EUS. Message-ID: <20131028061100.GC1350@michelle.cdnetworks.com> References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com> <20131028022723.GA4367@michelle.cdnetworks.com> <20131028164835.298646d5.eocallaghan@alterapraxis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131028164835.298646d5.eocallaghan@alterapraxis.com> User-Agent: Mutt/1.4.2.3i Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Oct 2013 06:11:08 -0000 On Mon, Oct 28, 2013 at 04:48:35PM +1100, Edward O'Callaghan wrote: > On Mon, 28 Oct 2013 11:27:23 +0900 > Yonghyeon PYUN wrote: > > > On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote: > > > Hi, > > > > > > This is a follow up. I have tested most of these NIC's now and this > > > patch _should_ be fine to commit to HEAD. Could someone please help > > > me mediate this? This also fixes kern/183167. Please disregards the > > > patches in the PR. > > > > > > > I can handle this. Actually I had been working on supporting these > > newer controllers for a while. It seems just adding 8168GU id does > > not work. Did you test the patch on 8168GU controller? > > If yes, please let me know the OUI id and model number of the PHY. > > Hi Yonghyeon, > > Many thanks! Not the 8168GU, however I did find out that its the same > as a 8106EUS. I don't know if this may shed some light if you have the > hw to test it.. What exactly did not work about the 8168GU, what is it > doing? Intermittent packet drops and slightly high number of RX interrupts. > > My main concern is to get a board here working that has a 8168G onboard. > Just adding RTL8168G id would use ukpky(4). Probably rgephy(4) should be taught to pick up the PHY but I don't have copy of data sheet. I'm testing patched rgephy(4) at this moment so give me some time. > Kind Regards, > Edward. From owner-freebsd-net@FreeBSD.ORG Mon Oct 28 11:06:52 2013 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BA650AEC for ; Mon, 28 Oct 2013 11:06:52 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A17A62475 for ; Mon, 28 Oct 2013 11:06:52 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9SB6qDj055167 for ; Mon, 28 Oct 2013 11:06:52 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9SB6qPn055165 for freebsd-net@FreeBSD.org; Mon, 28 Oct 2013 11:06:52 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 28 Oct 2013 11:06:52 GMT Message-Id: <201310281106.r9SB6qPn055165@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-net@FreeBSD.org Subject: Current problem reports assigned to freebsd-net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Oct 2013 11:06:52 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/182847 net [netinet6] [patch] Remove dead code o kern/182665 net [wlan] Kernel panic when creating second wlandev. o kern/182382 net [tcp] sysctl to set TCP CC method on BIG ENDIAN system o kern/182297 net [cm] ArcNet driver fails to detect the link address - o kern/182212 net [patch] [ng_mppc] ng_mppc(4) blocks on network errors o kern/181970 net [re] LAN RealtekŪ 8111G is not supported by re driver o kern/181931 net [vlan] [lagg] vlan over lagg over mlxen crashes the ke o kern/181823 net [ip6] [patch] make ipv6 mroute return same errror code o kern/181741 net [kernel] [patch] Packet loss when 'control' messages a o kern/181703 net [re] [patch] Fix Realtek 8111G Ethernet controller not o kern/181657 net [bpf] [patch] BPF_COP/BPF_COPX instruction reservation o kern/181257 net [bge] bge link status change o kern/181236 net [igb] igb driver unstable work o kern/181225 net [infiniband] [patch] unloading ipoib crashes the kerne o kern/181135 net [netmap] [patch] sys/dev/netmap patch for Linux compat o kern/181131 net [netmap] [patch] sys/dev/netmap memory allocation impr o kern/181006 net [run] [patch] mbuf leak in run(4) driver o kern/180893 net [if_ethersubr] [patch] Packets received with own LLADD o kern/180844 net [panic] [re] Intermittent panic (re driver?) o kern/180775 net [bxe] if_bxe driver broken with Broadcom BCM57711 card o kern/180722 net [bluetooth] bluetooth takes 30-50 attempts to pair to s kern/180468 net [request] LOCAL_PEERCRED support for PF_INET o kern/180065 net [netinet6] [patch] Multicast loopback to own host brok o kern/179926 net [lacp] [patch] active aggregator selection bug o kern/179824 net [ixgbe] System (9.1-p4) hangs on heavy ixgbe network t o kern/179733 net [lagg] [patch] interface loses capabilities when proto o kern/179429 net [tap] STP enabled tap bridge o kern/179299 net [igb] Intel X540-T2 - unstable driver a kern/179264 net [vimage] [pf] Core dump with Packet filter and VIMAGE o kern/178947 net [arp] arp rejecting not working o kern/178782 net [ixgbe] 82599EB SFP does not work with passthrough und o kern/178612 net [run] kernel panic due the problems with run driver o kern/178472 net [ip6] [patch] make return code consistent with IPv4 co o kern/178079 net [tcp] Switching TCP CC algorithm panics on sparc64 wit s kern/178071 net FreeBSD unable to recongize Kontron (Industrial Comput o kern/177905 net [xl] [panic] ifmedia_set when pluging CardBus LAN card o kern/177618 net [bridge] Problem with bridge firewall with trunk ports o kern/177417 net [ip6] Invalid protocol value in ipsec6_common_input_cb o kern/177402 net [igb] [pf] problem with ethernet driver igb + pf / alt o kern/177400 net [jme] JMC25x 1000baseT establishment issues o kern/177366 net [ieee80211] negative malloc(9) statistics for 80211nod f kern/177362 net [netinet] [patch] Wrong control used to return TOS o kern/177194 net [netgraph] Unnamed netgraph nodes for vlan interfaces o kern/177184 net [bge] [patch] enable wake on lan o kern/177139 net [igb] igb drops ethernet ports 2 and 3 o kern/176884 net [re] re0 flapping up/down o kern/176671 net [epair] MAC address for epair device not unique o kern/176484 net [ipsec] [enc] [patch] panic: IPsec + enc(4); device na o kern/176446 net [netinet] [patch] Concurrency in ixgbe driving out-of- o kern/176420 net [kernel] [patch] incorrect errno for LOCAL_PEERCRED o kern/176419 net [kernel] [patch] socketpair support for LOCAL_PEERCRED o kern/176401 net [netgraph] page fault in netgraph o kern/176167 net [ipsec][lagg] using lagg and ipsec causes immediate pa o kern/176027 net [em] [patch] flow control systcl consistency for em dr o kern/176026 net [tcp] [patch] TCP wrappers caused quite a lot of warni o kern/175864 net [re] Intel MB D510MO, onboard ethernet not working aft o kern/175852 net [amd64] [patch] in_cksum_hdr() behaves differently on o kern/175734 net no ethernet detected on system with EG20T PCH chipset o kern/175267 net [pf] [tap] pf + tap keep state problem o kern/175236 net [epair] [gif] epair and gif Devices On Bridge o kern/175182 net [panic] kernel panic on RADIX_MPATH when deleting rout o kern/175153 net [tcp] will there miss a FIN when do TSO? o kern/174959 net [net] [patch] rnh_walktree_from visits spurious nodes o kern/174958 net [net] [patch] rnh_walktree_from makes unreasonable ass o kern/174897 net [route] Interface routes are broken o kern/174851 net [bxe] [patch] UDP checksum offload is wrong in bxe dri o kern/174850 net [bxe] [patch] bxe driver does not receive multicasts o kern/174849 net [bxe] [patch] bxe driver can hang kernel when reset o kern/174822 net [tcp] Page fault in tcp_discardcb under high traffic o kern/174602 net [gif] [ipsec] traceroute issue on gif tunnel with ipse o kern/174535 net [tcp] TCP fast retransmit feature works strange o kern/173871 net [gif] process of 'ifconfig gif0 create hangs' when if_ o kern/173475 net [tun] tun(4) stays opened by PID after process is term o kern/173201 net [ixgbe] [patch] Missing / broken ixgbe sysctl's and tu o kern/173137 net [em] em(4) unable to run at gigabit with 9.1-RC2 o kern/173002 net [patch] data type size problem in if_spppsubr.c o kern/172895 net [ixgb] [ixgbe] do not properly determine link-state o kern/172683 net [ip6] Duplicate IPv6 Link Local Addresses o kern/172675 net [netinet] [patch] sysctl_tcp_hc_list (net.inet.tcp.hos p kern/172113 net [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4 o kern/171840 net [ip6] IPv6 packets transmitting only on queue 0 o kern/171739 net [bce] [panic] bce related kernel panic o kern/171711 net [dummynet] [panic] Kernel panic in dummynet o kern/171532 net [ndis] ndis(4) driver includes 'pccard'-specific code, o kern/171531 net [ndis] undocumented dependency for ndis(4) o kern/171524 net [ipmi] ipmi driver crashes kernel by reboot or shutdow s kern/171508 net [epair] [request] Add the ability to name epair device o kern/171228 net [re] [patch] if_re - eeprom write issues o kern/170701 net [ppp] killl ppp or reboot with active ppp connection c o kern/170267 net [ixgbe] IXGBE_LE32_TO_CPUS is probably an unintentiona o kern/170081 net [fxp] pf/nat/jails not working if checksum offloading o kern/169898 net ifconfig(8) fails to set MTU on multiple interfaces. o kern/169676 net [bge] [hang] system hangs, fully or partially after re o kern/169620 net [ng] [pf] ng_l2tp incoming packet bypass pf firewall o kern/169459 net [ppp] umodem/ppp/3g stopped working after update from o kern/169438 net [ipsec] ipv4-in-ipv6 tunnel mode IPsec does not work p kern/168294 net [ixgbe] [patch] ixgbe driver compiled in kernel has no o kern/168246 net [em] Multiple em(4) not working with qemu o kern/168245 net [arp] [regression] Permanent ARP entry not deleted on o kern/168244 net [arp] [regression] Unable to manually remove permanent o kern/168183 net [bce] bce driver hang system o kern/167603 net [ip] IP fragment reassembly's broken: file transfer ov o kern/167500 net [em] [panic] Kernel panics in em driver o kern/167325 net [netinet] [patch] sosend sometimes return EINVAL with o kern/167202 net [igmp]: Sending multiple IGMP packets crashes kernel o kern/166462 net [gre] gre(4) when using a tunnel source address from c o kern/166285 net [arp] FreeBSD v8.1 REL p8 arp: unknown hardware addres o kern/166255 net [net] [patch] It should be possible to disable "promis p kern/165903 net mbuf leak o kern/165622 net [ndis][panic][patch] Unregistered use of FPU in kernel s kern/165562 net [request] add support for Intel i350 in FreeBSD 7.4 o kern/165526 net [bxe] UDP packets checksum calculation whithin if_bxe o kern/165488 net [ppp] [panic] Fatal trap 12 jails and ppp , kernel wit o kern/165305 net [ip6] [request] Feature parity between IP_TOS and IPV6 o kern/165296 net [vlan] [patch] Fix EVL_APPLY_VLID, update EVL_APPLY_PR o kern/165181 net [igb] igb freezes after about 2 weeks of uptime o kern/165174 net [patch] [tap] allow tap(4) to keep its address on clos o kern/165152 net [ip6] Does not work through the issue of ipv6 addresse o kern/164495 net [igb] connect double head igb to switch cause system t o kern/164490 net [pfil] Incorrect IP checksum on pfil pass from ip_outp o kern/164475 net [gre] gre misses RUNNING flag after a reboot o kern/164265 net [netinet] [patch] tcp_lro_rx computes wrong checksum i o kern/163903 net [igb] "igb0:tx(0)","bpf interface lock" v2.2.5 9-STABL o kern/163481 net freebsd do not add itself to ping route packet o kern/162927 net [tun] Modem-PPP error ppp[1538]: tun0: Phase: Clearing o kern/162558 net [dummynet] [panic] seldom dummynet panics o kern/162153 net [em] intel em driver 7.2.4 don't compile o kern/162110 net [igb] [panic] RELENG_9 panics on boot in IGB driver - o kern/162028 net [ixgbe] [patch] misplaced #endif in ixgbe.c o kern/161277 net [em] [patch] BMC cannot receive IPMI traffic after loa o kern/160873 net [igb] igb(4) from HEAD fails to build on 7-STABLE o kern/160750 net Intel PRO/1000 connection breaks under load until rebo o kern/160693 net [gif] [em] Multicast packet are not passed from GIF0 t o kern/160293 net [ieee80211] ppanic] kernel panic during network setup o kern/160206 net [gif] gifX stops working after a while (IPv6 tunnel) o kern/159817 net [udp] write UDPv4: No buffer space available (code=55) o kern/159629 net [ipsec] [panic] kernel panic with IPsec in transport m o kern/159621 net [tcp] [panic] panic: soabort: so_count o kern/159603 net [netinet] [patch] in_ifscrubprefix() - network route c o kern/159601 net [netinet] [patch] in_scrubprefix() - loopback route re o kern/159294 net [em] em watchdog timeouts o kern/159203 net [wpi] Intel 3945ABG Wireless LAN not support IBSS o kern/158930 net [bpf] BPF element leak in ifp->bpf_if->bif_dlist o kern/158726 net [ip6] [patch] ICMPv6 Router Announcement flooding limi o kern/158694 net [ix] [lagg] ix0 is not working within lagg(4) o kern/158665 net [ip6] [panic] kernel pagefault in in6_setscope() o kern/158635 net [em] TSO breaks BPF packet captures with em driver f kern/157802 net [dummynet] [panic] kernel panic in dummynet o kern/157785 net amd64 + jail + ipfw + natd = very slow outbound traffi o kern/157418 net [em] em driver lockup during boot on Supermicro X9SCM- o kern/157410 net [ip6] IPv6 Router Advertisements Cause Excessive CPU U o kern/157287 net [re] [panic] INVARIANTS panic (Memory modified after f o kern/157200 net [network.subr] [patch] stf(4) can not communicate betw o kern/157182 net [lagg] lagg interface not working together with epair o kern/156877 net [dummynet] [panic] dummynet move_pkt() null ptr derefe o kern/156667 net [em] em0 fails to init on CURRENT after March 17 o kern/156408 net [vlan] Routing failure when using VLANs vs. Physical e o kern/156328 net [icmp]: host can ping other subnet but no have IP from o kern/156317 net [ip6] Wrong order of IPv6 NS DAD/MLD Report o kern/156283 net [ip6] [patch] nd6_ns_input - rtalloc_mpath does not re o kern/156279 net [if_bridge][divert][ipfw] unable to correctly re-injec o kern/156226 net [lagg]: failover does not announce the failover to swi o kern/156030 net [ip6] [panic] Crash in nd6_dad_start() due to null ptr o kern/155680 net [multicast] problems with multicast s kern/155642 net [new driver] [request] Add driver for Realtek RTL8191S o kern/155597 net [panic] Kernel panics with "sbdrop" message o kern/155420 net [vlan] adding vlan break existent vlan o kern/155177 net [route] [panic] Panic when inject routes in kernel o kern/155010 net [msk] ntfs-3g via iscsi using msk driver cause kernel o kern/154943 net [gif] ifconfig gifX create on existing gifX clears IP s kern/154851 net [new driver] [request]: Port brcm80211 driver from Lin o kern/154850 net [netgraph] [patch] ng_ether fails to name nodes when t o kern/154679 net [em] Fatal trap 12: "em1 taskq" only at startup (8.1-R o kern/154600 net [tcp] [panic] Random kernel panics on tcp_output o kern/154557 net [tcp] Freeze tcp-session of the clients, if in the gat o kern/154443 net [if_bridge] Kernel module bridgestp.ko missing after u o kern/154286 net [netgraph] [panic] 8.2-PRERELEASE panic in netgraph o kern/154255 net [nfs] NFS not responding o kern/154214 net [stf] [panic] Panic when creating stf interface o kern/154185 net race condition in mb_dupcl p kern/154169 net [multicast] [ip6] Node Information Query multicast add o kern/154134 net [ip6] stuck kernel state in LISTEN on ipv6 daemon whic o kern/154091 net [netgraph] [panic] netgraph, unaligned mbuf? o conf/154062 net [vlan] [patch] change to way of auto-generatation of v o kern/153937 net [ral] ralink panics the system (amd64 freeBSDD 8.X) wh o kern/153936 net [ixgbe] [patch] MPRC workaround incorrectly applied to o kern/153816 net [ixgbe] ixgbe doesn't work properly with the Intel 10g o kern/153772 net [ixgbe] [patch] sysctls reference wrong XON/XOFF varia o kern/153497 net [netgraph] netgraph panic due to race conditions o kern/153454 net [patch] [wlan] [urtw] Support ad-hoc and hostap modes o kern/153308 net [em] em interface use 100% cpu o kern/153244 net [em] em(4) fails to send UDP to port 0xffff o kern/152893 net [netgraph] [panic] 8.2-PRERELEASE panic in netgraph o kern/152853 net [em] tftpd (and likely other udp traffic) fails over e o kern/152828 net [em] poor performance on 8.1, 8.2-PRE o kern/152569 net [net]: Multiple ppp connections and routing table prob o kern/152235 net [arp] Permanent local ARP entries are not properly upd o kern/152141 net [vlan] [patch] encapsulate vlan in ng_ether before out o kern/152036 net [libc] getifaddrs(3) returns truncated sockaddrs for n o kern/151690 net [ep] network connectivity won't work until dhclient is o kern/151681 net [nfs] NFS mount via IPv6 leads to hang on client with o kern/151593 net [igb] [panic] Kernel panic when bringing up igb networ o kern/150920 net [ixgbe][igb] Panic when packets are dropped with heade o kern/150557 net [igb] igb0: Watchdog timeout -- resetting o kern/150251 net [patch] [ixgbe] Late cable insertion broken o kern/150249 net [ixgbe] Media type detection broken o bin/150224 net ppp(8) does not reassign static IP after kill -KILL co f kern/149969 net [wlan] [ral] ralink rt2661 fails to maintain connectio o kern/149643 net [rum] device not sending proper beacon frames in ap mo o kern/149609 net [panic] reboot after adding second default route o kern/149117 net [inet] [patch] in_pcbbind: redundant test o kern/149086 net [multicast] Generic multicast join failure in 8.1 o kern/148018 net [flowtable] flowtable crashes on ia64 o kern/147912 net [boot] FreeBSD 8 Beta won't boot on Thinkpad i1300 11 o kern/147894 net [ipsec] IPv6-in-IPv4 does not work inside an ESP-only o kern/147155 net [ip6] setfb not work with ipv6 o kern/146845 net [libc] close(2) returns error 54 (connection reset by f kern/146792 net [flowtable] flowcleaner 100% cpu's core load o kern/146719 net [pf] [panic] PF or dumynet kernel panic o kern/146534 net [icmp6] wrong source address in echo reply o kern/146427 net [mwl] Additional virtual access points don't work on m f kern/146394 net [vlan] IP source address for outgoing connections o bin/146377 net [ppp] [tun] Interface doesn't clear addresses when PPP o kern/146358 net [vlan] wrong destination MAC address o kern/146165 net [wlan] [panic] Setting bssid in adhoc mode causes pani o kern/146082 net [ng_l2tp] a false invaliant check was performed in ng_ o kern/146037 net [panic] mpd + CoA = kernel panic o kern/145825 net [panic] panic: soabort: so_count o kern/145728 net [lagg] Stops working lagg between two servers. p kern/145600 net TCP/ECN behaves different to CE/CWR than ns2 reference f kern/144917 net [flowtable] [panic] flowtable crashes system [regressi o kern/144882 net MacBookPro =>4.1 does not connect to BSD in hostap wit o kern/144874 net [if_bridge] [patch] if_bridge frees mbuf after pfil ho o conf/144700 net [rc.d] async dhclient breaks stuff for too many people o kern/144616 net [nat] [panic] ip_nat panic FreeBSD 7.2 f kern/144315 net [ipfw] [panic] freebsd 8-stable reboot after add ipfw o kern/144231 net bind/connect/sendto too strict about sockaddr length o kern/143846 net [gif] bringing gif3 tunnel down causes gif0 tunnel to s kern/143673 net [stf] [request] there should be a way to support multi s kern/143666 net [ip6] [request] PMTU black hole detection not implemen o kern/143622 net [pfil] [patch] unlock pfil lock while calling firewall o kern/143593 net [ipsec] When using IPSec, tcpdump doesn't show outgoin o kern/143591 net [ral] RT2561C-based DLink card (DWL-510) fails to work o kern/143208 net [ipsec] [gif] IPSec over gif interface not working o kern/143034 net [panic] system reboots itself in tcp code [regression] o kern/142877 net [hang] network-related repeatable 8.0-STABLE hard hang o kern/142774 net Problem with outgoing connections on interface with mu o kern/142772 net [libc] lla_lookup: new lle malloc failed f kern/142518 net [em] [lagg] Problem on 8.0-STABLE with em and lagg o kern/142018 net [iwi] [patch] Possibly wrong interpretation of beacon- o kern/141861 net [wi] data garbled with WEP and wi(4) with Prism 2.5 f kern/141741 net Etherlink III NIC won't work after upgrade to FBSD 8, o kern/140742 net rum(4) Two asus-WL167G adapters cannot talk to each ot o kern/140682 net [netgraph] [panic] random panic in netgraph f kern/140634 net [vlan] destroying if_lagg interface with if_vlan membe o kern/140619 net [ifnet] [patch] refine obsolete if_var.h comments desc o kern/140346 net [wlan] High bandwidth use causes loss of wlan connecti o kern/140142 net [ip6] [panic] FreeBSD 7.2-amd64 panic w/IPv6 o kern/140066 net [bwi] install report for 8.0 RC 2 (multiple problems) o kern/139387 net [ipsec] Wrong lenth of PF_KEY messages in promiscuous o bin/139346 net [patch] arp(8) add option to remove static entries lis o kern/139268 net [if_bridge] [patch] allow if_bridge to forward just VL p kern/139204 net [arp] DHCP server replies rejected, ARP entry lost bef o kern/139117 net [lagg] + wlan boot timing (EBUSY) o kern/138850 net [dummynet] dummynet doesn't work correctly on a bridge o kern/138782 net [panic] sbflush_internal: cc 0 || mb 0xffffff004127b00 o kern/138688 net [rum] possibly broken on 8 Beta 4 amd64: able to wpa a o kern/138678 net [lo] FreeBSD does not assign linklocal address to loop o kern/138407 net [gre] gre(4) interface does not come up after reboot o kern/138332 net [tun] [lor] ifconfig tun0 destroy causes LOR if_adata/ o kern/138266 net [panic] kernel panic when udp benchmark test used as r f kern/138029 net [bpf] [panic] periodically kernel panic and reboot o kern/137881 net [netgraph] [panic] ng_pppoe fatal trap 12 p bin/137841 net [patch] wpa_supplicant(8) cannot verify SHA256 signed p kern/137776 net [rum] panic in rum(4) driver on 8.0-BETA2 o bin/137641 net ifconfig(8): various problems with "vlan_device.vlan_i o kern/137392 net [ip] [panic] crash in ip_nat.c line 2577 o kern/137372 net [ral] FreeBSD doesn't support wireless interface from o kern/137089 net [lagg] lagg falsely triggers IPv6 duplicate address de o kern/136911 net [netgraph] [panic] system panic on kldload ng_bpf.ko t o kern/136618 net [pf][stf] panic on cloning interface without unit numb o kern/135502 net [periodic] Warning message raised by rtfree function i o kern/134583 net [hang] Machine with jail freezes after random amount o o kern/134531 net [route] [panic] kernel crash related to routes/zebra o kern/134157 net [dummynet] dummynet loads cpu for 100% and make a syst o kern/133969 net [dummynet] [panic] Fatal trap 12: page fault while in o kern/133968 net [dummynet] [panic] dummynet kernel panic o kern/133736 net [udp] ip_id not protected ... o kern/133595 net [panic] Kernel Panic at pcpu.h:195 o kern/133572 net [ppp] [hang] incoming PPTP connection hangs the system o kern/133490 net [bpf] [panic] 'kmem_map too small' panic on Dell r900 o kern/133235 net [netinet] [patch] Process SIOCDLIFADDR command incorre f kern/133213 net arp and sshd errors on 7.1-PRERELEASE o kern/133060 net [ipsec] [pfsync] [panic] Kernel panic with ipsec + pfs o kern/132889 net [ndis] [panic] NDIS kernel crash on load BCM4321 AGN d o conf/132851 net [patch] rc.conf(5): allow to setfib(1) for service run o kern/132734 net [ifmib] [panic] panic in net/if_mib.c o kern/132705 net [libwrap] [patch] libwrap - infinite loop if hosts.all o kern/132672 net [ndis] [panic] ndis with rt2860.sys causes kernel pani o kern/132354 net [nat] Getting some packages to ipnat(8) causes crash o kern/132277 net [crypto] [ipsec] poor performance using cryptodevice f o kern/131781 net [ndis] ndis keeps dropping the link o kern/131776 net [wi] driver fails to init o kern/131753 net [altq] [panic] kernel panic in hfsc_dequeue o bin/131365 net route(8): route add changes interpretation of network f kern/130820 net [ndis] wpa_supplicant(8) returns 'no space on device' o kern/130628 net [nfs] NFS / rpc.lockd deadlock on 7.1-R o kern/130525 net [ndis] [panic] 64 bit ar5008 ndisgen-erated driver cau o kern/130311 net [wlan_xauth] [panic] hostapd restart causing kernel pa o kern/130109 net [ipfw] Can not set fib for packets originated from loc f kern/130059 net [panic] Leaking 50k mbufs/hour f kern/129719 net [nfs] [panic] Panic during shutdown, tcp_ctloutput: in o kern/129517 net [ipsec] [panic] double fault / stack overflow f kern/129508 net [carp] [panic] Kernel panic with EtherIP (may be relat o kern/129219 net [ppp] Kernel panic when using kernel mode ppp o kern/129197 net [panic] 7.0 IP stack related panic o bin/128954 net ifconfig(8) deletes valid routes o bin/128602 net [an] wpa_supplicant(8) crashes with an(4) o kern/128448 net [nfs] 6.4-RC1 Boot Fails if NFS Hostname cannot be res o bin/128295 net [patch] ifconfig(8) does not print TOE4 or TOE6 capabi o bin/128001 net wpa_supplicant(8), wlan(4), and wi(4) issues o kern/127826 net [iwi] iwi0 driver has reduced performance and connecti o kern/127815 net [gif] [patch] if_gif does not set vlan attributes from o kern/127724 net [rtalloc] rtfree: 0xc5a8f870 has 1 refs f bin/127719 net [arp] arp: Segmentation fault (core dumped) f kern/127528 net [icmp]: icmp socket receives icmp replies not owned by p kern/127360 net [socket] TOE socket options missing from sosetopt() o bin/127192 net routed(8) removes the secondary alias IP of interface f kern/127145 net [wi]: prism (wi) driver crash at bigger traffic o kern/126895 net [patch] [ral] Add antenna selection (marked as TBD) o kern/126874 net [vlan]: Zebra problem if ifconfig vlanX destroy o kern/126695 net rtfree messages and network disruption upon use of if_ o kern/126339 net [ipw] ipw driver drops the connection o kern/126075 net [inet] [patch] internet control accesses beyond end of o bin/125922 net [patch] Deadlock in arp(8) o kern/125920 net [arp] Kernel Routing Table loses Ethernet Link status o kern/125845 net [netinet] [patch] tcp_lro_rx() should make use of hard o kern/125258 net [socket] socket's SO_REUSEADDR option does not work o kern/125239 net [gre] kernel crash when using gre o kern/124341 net [ral] promiscuous mode for wireless device ral0 looses o kern/124225 net [ndis] [patch] ndis network driver sometimes loses net o kern/124160 net [libc] connect(2) function loops indefinitely o kern/124021 net [ip6] [panic] page fault in nd6_output() o kern/123968 net [rum] [panic] rum driver causes kernel panic with WPA. o kern/123892 net [tap] [patch] No buffer space available o kern/123890 net [ppp] [panic] crash & reboot on work with PPP low-spee o kern/123858 net [stf] [patch] stf not usable behind a NAT o kern/123758 net [panic] panic while restarting net/freenet6 o bin/123633 net ifconfig(8) doesn't set inet and ether address in one o kern/123559 net [iwi] iwi periodically disassociates/associates [regre o bin/123465 net [ip6] route(8): route add -inet6 -interfac o kern/123463 net [ipsec] [panic] repeatable crash related to ipsec-tool o conf/123330 net [nsswitch.conf] Enabling samba wins in nsswitch.conf c o kern/123160 net [ip] Panic and reboot at sysctl kern.polling.enable=0 o kern/122989 net [swi] [panic] 6.3 kernel panic in swi1: net o kern/122954 net [lagg] IPv6 EUI64 incorrectly chosen for lagg devices f kern/122780 net [lagg] tcpdump on lagg interface during high pps wedge o kern/122685 net It is not visible passing packets in tcpdump(1) o kern/122319 net [wi] imposible to enable ad-hoc demo mode with Orinoco o kern/122290 net [netgraph] [panic] Netgraph related "kmem_map too smal o kern/122252 net [ipmi] [bge] IPMI problem with BCM5704 (does not work o kern/122033 net [ral] [lor] Lock order reversal in ral0 at bootup ieee o bin/121895 net [patch] rtsol(8)/rtsold(8) doesn't handle managed netw s kern/121774 net [swi] [panic] 6.3 kernel panic in swi1: net o kern/121555 net [panic] Fatal trap 12: current process = 12 (swi1: net o kern/121534 net [ipl] [nat] FreeBSD Release 6.3 Kernel Trap 12: o kern/121443 net [gif] [lor] icmp6_input/nd6_lookup o kern/121437 net [vlan] Routing to layer-2 address does not work on VLA o bin/121359 net [patch] [security] ppp(8): fix local stack overflow in o kern/121257 net [tcp] TSO + natd -> slow outgoing tcp traffic o kern/121181 net [panic] Fatal trap 3: breakpoint instruction fault whi o kern/120966 net [rum] kernel panic with if_rum and WPA encryption o kern/120566 net [request]: ifconfig(8) make order of arguments more fr o kern/120304 net [netgraph] [patch] netgraph source assumes 32-bit time o kern/120266 net [udp] [panic] gnugk causes kernel panic when closing U o bin/120060 net routed(8) deletes link-level routes in the presence of o kern/119945 net [rum] [panic] rum device in hostap mode, cause kernel o kern/119791 net [nfs] UDP NFS mount of aliased IP addresses from a Sol o kern/119617 net [nfs] nfs error on wpa network when reseting/shutdown f kern/119516 net [ip6] [panic] _mtx_lock_sleep: recursed on non-recursi o kern/119432 net [arp] route add -host -iface causes arp e o kern/119225 net [wi] 7.0-RC1 no carrier with Prism 2.5 wifi card [regr o kern/118727 net [netgraph] [patch] [request] add new ng_pf module o kern/117423 net [vlan] Duplicate IP on different interfaces o bin/117339 net [patch] route(8): loading routing management commands o bin/116643 net [patch] [request] fstat(1): add INET/INET6 socket deta o kern/116185 net [iwi] if_iwi driver leads system to reboot o kern/115239 net [ipnat] panic with 'kmem_map too small' using ipnat o kern/115019 net [netgraph] ng_ether upper hook packet flow stops on ad o kern/115002 net [wi] if_wi timeout. failed allocation (busy bit). ifco o kern/114915 net [patch] [pcn] pcn (sys/pci/if_pcn.c) ethernet driver f o kern/113432 net [ucom] WARNING: attempt to net_add_domain(netgraph) af o kern/112722 net [ipsec] [udp] IP v4 udp fragmented packet reject o kern/112686 net [patm] patm driver freezes System (FreeBSD 6.2-p4) i38 o bin/112557 net [patch] ppp(8) lock file should not use symlink name o kern/112528 net [nfs] NFS over TCP under load hangs with "impossible p o kern/111537 net [inet6] [patch] ip6_input() treats mbuf cluster wrong o kern/111457 net [ral] ral(4) freeze o kern/110284 net [if_ethersubr] Invalid Assumption in SIOCSIFADDR in et o kern/110249 net [kernel] [regression] [patch] setsockopt() error regre o kern/109470 net [wi] Orinoco Classic Gold PC Card Can't Channel Hop o bin/108895 net pppd(8): PPPoE dead connections on 6.2 [regression] f kern/108197 net [panic] [gif] [ip6] if_delmulti reference counting pan o kern/107944 net [wi] [patch] Forget to unlock mutex-locks o conf/107035 net [patch] bridge(8): bridge interface given in rc.conf n o kern/106444 net [netgraph] [panic] Kernel Panic on Binding to an ip to o kern/106316 net [dummynet] dummynet with multipass ipfw drops packets o kern/105945 net Address can disappear from network interface s kern/105943 net Network stack may modify read-only mbuf chain copies o bin/105925 net problems with ifconfig(8) and vlan(4) [regression] o kern/104851 net [inet6] [patch] On link routes not configured when usi o kern/104751 net [netgraph] kernel panic, when getting info about my tr o kern/104738 net [inet] [patch] Reentrant problem with inet_ntoa in the o kern/103191 net Unpredictable reboot o kern/103135 net [ipsec] ipsec with ipfw divert (not NAT) encodes a pac o kern/102540 net [netgraph] [patch] supporting vlan(4) by ng_fec(4) o conf/102502 net [netgraph] [patch] ifconfig name does't rename netgrap o kern/102035 net [plip] plip networking disables parallel port printing o kern/100709 net [libc] getaddrinfo(3) should return TTL info o kern/100519 net [netisr] suggestion to fix suboptimal network polling o kern/98597 net [inet6] Bug in FreeBSD 6.1 IPv6 link-local DAD procedu o bin/98218 net wpa_supplicant(8) blacklist not working o kern/97306 net [netgraph] NG_L2TP locks after connection with failed o conf/97014 net [gif] gifconfig_gif? in rc.conf does not recognize IPv f kern/96268 net [socket] TCP socket performance drops by 3000% if pack o kern/95519 net [ral] ral0 could not map mbuf o kern/95288 net [pppd] [tty] [panic] if_ppp panic in sys/kern/tty_subr o kern/95277 net [netinet] [patch] IP Encapsulation mask_match() return o kern/95267 net packet drops periodically appear f kern/93378 net [tcp] Slow data transfer in Postfix and Cyrus IMAP (wo o kern/93019 net [ppp] ppp and tunX problems: no traffic after restarti o kern/92880 net [libc] [patch] almost rewritten inet_network(3) functi s kern/92279 net [dc] Core faults everytime I reboot, possible NIC issu o kern/91859 net [ndis] if_ndis does not work with Asus WL-138 o kern/91364 net [ral] [wep] WF-511 RT2500 Card PCI and WEP o kern/91311 net [aue] aue interface hanging o kern/87421 net [netgraph] [panic]: ng_ether + ng_eiface + if_bridge o kern/86871 net [tcp] [patch] allocation logic for PCBs in TIME_WAIT s o kern/86427 net [lor] Deadlock with FASTIPSEC and nat o kern/85780 net 'panic: bogus refcnt 0' in routing/ipv6 o bin/85445 net ifconfig(8): deprecated keyword to ifconfig inoperativ o bin/82975 net route change does not parse classfull network as given o kern/82881 net [netgraph] [panic] ng_fec(4) causes kernel panic after o kern/82468 net Using 64MB tcp send/recv buffers, trafficflow stops, i o bin/82185 net [patch] ndp(8) can delete the incorrect entry o kern/81095 net IPsec connection stops working if associated network i o kern/78968 net FreeBSD freezes on mbufs exhaustion (network interface o kern/78090 net [ipf] ipf filtering on bridged packets doesn't work if o kern/77341 net [ip6] problems with IPV6 implementation o kern/75873 net Usability problem with non-RFC-compliant IP spoof prot s kern/75407 net [an] an(4): no carrier after short time a kern/71474 net [route] route lookup does not skip interfaces marked d o kern/71469 net default route to internet magically disappears with mu o kern/68889 net [panic] m_copym, length > size of mbuf chain o kern/66225 net [netgraph] [patch] extend ng_eiface(4) control message o kern/65616 net IPSEC can't detunnel GRE packets after real ESP encryp s kern/60293 net [patch] FreeBSD arp poison patch a kern/56233 net IPsec tunnel (ESP) over IPv6: MTU computation is wrong s bin/41647 net ifconfig(8) doesn't accept lladdr along with inet addr o kern/39937 net ipstealth issue a kern/38554 net [patch] changing interface ipaddress doesn't seem to w o kern/31940 net ip queue length too short for >500kpps o kern/31647 net [libc] socket calls can return undocumented EINVAL o kern/30186 net [libc] getaddrinfo(3) does not handle incorrect servna f kern/24959 net [patch] proper TCP_NOPUSH/TCP_CORK compatibility o conf/23063 net [arp] [patch] for static ARP tables in rc.network o kern/21998 net [socket] [patch] ident only for outgoing connections o kern/5877 net [socket] sb_cc counts control data as well as data dat 468 problems total. From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 06:03:49 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 4D295FF7 for ; Tue, 29 Oct 2013 06:03:49 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-oa0-x234.google.com (mail-oa0-x234.google.com [IPv6:2607:f8b0:4003:c02::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 127732B9B for ; Tue, 29 Oct 2013 06:03:49 +0000 (UTC) Received: by mail-oa0-f52.google.com with SMTP id j1so1203171oag.39 for ; Mon, 28 Oct 2013 23:03:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=sNGIKSsL8Iil63MEJ62EtFhTZwPvvVBMYYir8pib/xo=; b=Ut5DhMJQf+K7SSErHk5kMQC2orjbTMAqIMZyyGFpoTOxrA5+V8acZrcFKnU3Hrm+i3 Ja9csrFwAQN0ivHg1f3RCCOSvI2Zx1RvC2OHbgPDTzWQy2C4cEmMJg+fUqwohbQZjROw 7leqkzsCkqrA3Z3i+rXQl5QV5tqM1Afmyul4vINQXjMV2ZKR4TvBU6JOUh1xoEiK0Hq3 60f2nyp75RK1Ti78AgYZHa7w4UTFuXLeywbG1nz8iRvn8ZUABb+3WZPtDTYCMJXmW5BM QIqHc4kdLqQ2YhbrXsiDtJYZP1E6CZYhcbxmbgl7B1PgTvjldHwPmBrzeYHVZxeKIAYr rrMQ== X-Received: by 10.182.66.164 with SMTP id g4mr4807666obt.47.1383026627524; Mon, 28 Oct 2013 23:03:47 -0700 (PDT) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id xx9sm32857193obc.6.2013.10.28.23.03.44 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 28 Oct 2013 23:03:46 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Tue, 29 Oct 2013 15:03:40 +0900 From: Yonghyeon PYUN Date: Tue, 29 Oct 2013 15:03:40 +0900 To: Edward O'Callaghan Subject: Re: re(4) resync. Adds preliminary support for 8168G, 8168EP, 8168GU, 8411B and 8106EUS. Message-ID: <20131029060340.GA1390@michelle.cdnetworks.com> References: <20131027231325.2719b3c9.eocallaghan@alterapraxis.com> <20131028022723.GA4367@michelle.cdnetworks.com> <20131028164835.298646d5.eocallaghan@alterapraxis.com> <20131028061100.GC1350@michelle.cdnetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131028061100.GC1350@michelle.cdnetworks.com> User-Agent: Mutt/1.4.2.3i Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 06:03:49 -0000 On Mon, Oct 28, 2013 at 03:11:00PM +0900, Yonghyeon PYUN wrote: > On Mon, Oct 28, 2013 at 04:48:35PM +1100, Edward O'Callaghan wrote: > > On Mon, 28 Oct 2013 11:27:23 +0900 > > Yonghyeon PYUN wrote: > > > > > On Sun, Oct 27, 2013 at 11:13:25PM +1100, Edward O'Callaghan wrote: > > > > Hi, > > > > > > > > This is a follow up. I have tested most of these NIC's now and this > > > > patch _should_ be fine to commit to HEAD. Could someone please help > > > > me mediate this? This also fixes kern/183167. Please disregards the > > > > patches in the PR. > > > > > > > > > > I can handle this. Actually I had been working on supporting these > > > newer controllers for a while. It seems just adding 8168GU id does > > > not work. Did you test the patch on 8168GU controller? > > > If yes, please let me know the OUI id and model number of the PHY. > > > > Hi Yonghyeon, > > > > Many thanks! Not the 8168GU, however I did find out that its the same > > as a 8106EUS. I don't know if this may shed some light if you have the > > hw to test it.. What exactly did not work about the 8168GU, what is it > > doing? > > Intermittent packet drops and slightly high number of RX > interrupts. > > > > > My main concern is to get a board here working that has a 8168G onboard. > > > > Just adding RTL8168G id would use ukpky(4). Probably rgephy(4) > should be taught to pick up the PHY but I don't have copy of data > sheet. I'm testing patched rgephy(4) at this moment so give me some > time. > FYI: Committed in r257304-257306. These commits do not address high number of RX interrupts though. From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 10:51:20 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BF1A174B for ; Tue, 29 Oct 2013 10:51:20 +0000 (UTC) (envelope-from rrs@lakerest.net) Received: from lakerest.net (lakerest.net [162.235.35.161]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A20A32B92 for ; Tue, 29 Oct 2013 10:51:19 +0000 (UTC) Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162]) (authenticated bits=0) by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TAouW2068631 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 29 Oct 2013 06:50:56 -0400 (EDT) (envelope-from rrs@lakerest.net) From: Randall Stewart Content-Type: multipart/mixed; boundary="Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4" Subject: MQ Patch. Date: Tue, 29 Oct 2013 06:50:56 -0400 Message-Id: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> To: net@freebsd.org Mime-Version: 1.0 (Apple Message framework v1283) X-Mailer: Apple Mail (2.1283) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 10:51:20 -0000 --Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 Hi: As discussed at vBSDcon with andre/emaste and gnn, I am sending this patch out to all of you ;-) I have previously sent it to gnn, andre, jhb, rwatson, and several other of the usual suspects (as gnn put it) and received dead silence. What does this patch do? Well it add the ability to do multi-queue at the driver level. Basically any driver that uses the new interface gets under it N queues (default is 8) for each physical transmit ring it has. The driver picks up=20 its queue 0 first, then queue 1 .. up to the max. This allows you to prioritize packets. Also in here is the start of some work I will be doing for AQM.. think either Pi or Codel ;-) Right now thats pretty simple and just (in a few drivers) as the ability to limit the amount of data on the ring=85 which can help reduce buffer bloat. That needs to be refined into a lot more. This work is donated by Adara Networks and has been discussed in several of the past vendor summits. I plan on committing this before the IETF unless I hear major = objections. Please have a look ;-) Best wishes R --Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4 Content-Disposition: attachment; filename=patch_mq.txt Content-Type: text/plain; x-unix-mode=0644; name="patch_mq.txt" Content-Transfer-Encoding: quoted-printable Index: sys/conf/files =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/conf/files (revision 257322) +++ sys/conf/files (working copy) @@ -3062,6 +3062,7 @@ net/bridgestp.c optional bridge = | if_bridge net/flowtable.c optional flowtable inet | = flowtable inet6 net/ieee8023ad_lacp.c optional lagg net/if.c standard +net/drbr.c standard net/if_arcsubr.c optional arcnet net/if_atmsubr.c optional atm net/if_bridge.c optional bridge inet | if_bridge = inet Index: sys/dev/bxe/bxe.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/bxe/bxe.c (revision 257322) +++ sys/dev/bxe/bxe.c (working copy) @@ -5935,10 +5935,11 @@ bxe_tx_mq_start_locked(struct bxe_softc *sc, struct bxe_fastpath *fp, struct mbuf *m) { - struct buf_ring *tx_br =3D fp->tx_br; + struct drbr_ring *tx_br =3D fp->tx_br; struct mbuf *next; int depth, rc, tx_count; uint16_t tx_bd_avail; + uint8_t qused; =20 rc =3D tx_count =3D 0; =20 @@ -5955,25 +5956,16 @@ bxe_tx_mq_start_locked(struct bxe_softc *sc, =20 BXE_FP_TX_LOCK_ASSERT(fp); =20 - if (m =3D=3D NULL) { - /* no new work, check for pending frames */ - next =3D drbr_dequeue(ifp, tx_br); - } else if (drbr_needs_enqueue(ifp, tx_br)) { - /* have both new and pending work, maintain packet order */ - rc =3D drbr_enqueue(ifp, tx_br, m); - if (rc !=3D 0) { - fp->eth_q_stats.tx_soft_errors++; - goto bxe_tx_mq_start_locked_exit; - } - next =3D drbr_dequeue(ifp, tx_br); - } else { - /* new work only and nothing pending */ - next =3D m; + if (m !=3D NULL) { + rc =3D drbr_enqueue(ifp, tx_br, m); + if (rc !=3D 0) { + fp->eth_q_stats.tx_soft_errors++; + goto bxe_tx_mq_start_locked_exit; + } } =20 /* keep adding entries while there are frames to send */ - while (next !=3D NULL) { - + while ((next =3D drbr_peek(ifp, fp->tx_br, &qused)) !=3D NULL) { /* the mbuf now belongs to us */ fp->eth_q_stats.mbuf_alloc_tx++; =20 @@ -5985,19 +5977,22 @@ bxe_tx_mq_start_locked(struct bxe_softc *sc, rc =3D bxe_tx_encap(fp, &next); if (__predict_false(rc !=3D 0)) { fp->eth_q_stats.tx_encap_failures++; - if (next !=3D NULL) { - /* mark the TX queue as full and save the frame */ - ifp->if_drv_flags |=3D IFF_DRV_OACTIVE; - /* XXX this may reorder the frame */ - rc =3D drbr_enqueue(ifp, tx_br, next); - fp->eth_q_stats.mbuf_alloc_tx--; - fp->eth_q_stats.tx_frames_deferred++; - } - + if (next =3D=3D NULL) { + drbr_advance(ifp, fp->tx_br, qused); + } else { + drbr_putback(ifp, fp->tx_br, next, qused); + /* + * Mark the TX queue as full and save + * the frame. + */ + ifp->if_drv_flags |=3D IFF_DRV_OACTIVE; + fp->eth_q_stats.mbuf_alloc_tx--; + fp->eth_q_stats.tx_frames_deferred++; + } /* stop looking for more work */ break; } - + drbr_advance(ifp, fp->tx_br, qused); /* the transmit frame was enqueued successfully */ tx_count++; =20 @@ -6078,7 +6073,6 @@ bxe_mq_flush(struct ifnet *ifp) { struct bxe_softc *sc =3D ifp->if_softc; struct bxe_fastpath *fp; - struct mbuf *m; int i; =20 for (i =3D 0; i < sc->num_queues; i++) { @@ -6093,9 +6087,7 @@ bxe_mq_flush(struct ifnet *ifp) if (fp->tx_br !=3D NULL) { BLOGD(sc, DBG_LOAD, "Clearing fp[%02d] buf_ring\n", = fp->index); BXE_FP_TX_LOCK(fp); - while ((m =3D buf_ring_dequeue_sc(fp->tx_br)) !=3D NULL) { - m_freem(m); - } + drbr_flush(ifp, fp->tx_br); BXE_FP_TX_UNLOCK(fp); } } @@ -6496,12 +6488,9 @@ bxe_free_fp_buffers(struct bxe_softc *sc) =20 #if __FreeBSD_version >=3D 800000 if (fp->tx_br !=3D NULL) { - struct mbuf *m; /* just in case bxe_mq_flush() wasn't called */ - while ((m =3D buf_ring_dequeue_sc(fp->tx_br)) !=3D NULL) { - m_freem(m); - } - buf_ring_free(fp->tx_br, M_DEVBUF); + drbr_flush(sc->ifnet, fp->tx_br); + drbr_free(fp->tx_br, M_DEVBUF); fp->tx_br =3D NULL; } #endif @@ -6762,8 +6751,7 @@ bxe_alloc_fp_buffers(struct bxe_softc *sc) fp =3D &sc->fp[i]; =20 #if __FreeBSD_version >=3D 800000 - fp->tx_br =3D buf_ring_alloc(BXE_BR_SIZE, M_DEVBUF, - M_DONTWAIT, &fp->tx_mtx); + fp->tx_br =3D drbr_alloc(M_DEVBUF, M_DONTWAIT, &fp->tx_mtx); if (fp->tx_br =3D=3D NULL) { BLOGE(sc, "buf_ring alloc fail for fp[%02d]\n", i); goto bxe_alloc_fp_buffers_error; Index: sys/dev/bxe/bxe.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/bxe/bxe.h (revision 257322) +++ sys/dev/bxe/bxe.h (working copy) @@ -69,6 +69,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include =20 #include #include @@ -734,7 +735,7 @@ struct bxe_fastpath { =20 #if __FreeBSD_version >=3D 800000 #define BXE_BR_SIZE 4096 - struct buf_ring *tx_br; + struct drbr_ring *tx_br; #endif }; /* struct bxe_fastpath */ =20 Index: sys/dev/cesa/cesa.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cesa/cesa.c (revision 257322) +++ sys/dev/cesa/cesa.c (working copy) @@ -995,11 +995,17 @@ cesa_attach(device_t dev) sc->sc_dev =3D dev; =20 /* Check if CESA peripheral device has power turned on */ +#if defined(SOC_MV_KIRKWOOD) + if (soc_power_ctrl_get(CPU_PM_CTRL_CRYPTO) =3D=3D = CPU_PM_CTRL_CRYPTO) { + device_printf(dev, "not powered on\n"); + return (ENXIO); + } +#else if (soc_power_ctrl_get(CPU_PM_CTRL_CRYPTO) !=3D = CPU_PM_CTRL_CRYPTO) { device_printf(dev, "not powered on\n"); return (ENXIO); } - +#endif soc_id(&d, &r); =20 switch (d) { Index: sys/dev/cxgb/cxgb_adapter.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cxgb/cxgb_adapter.h (revision 257322) +++ sys/dev/cxgb/cxgb_adapter.h (working copy) @@ -252,7 +252,7 @@ struct sge_txq { bus_dma_tag_t entry_tag; struct mbuf_head sendq; =20 - struct buf_ring *txq_mr; + struct drbr_ring *txq_mr; struct ifaltq *txq_ifq; struct callout txq_timer; struct callout txq_watchdog; Index: sys/dev/cxgb/cxgb_main.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cxgb/cxgb_main.c (revision 257322) +++ sys/dev/cxgb/cxgb_main.c (working copy) @@ -66,6 +66,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include =20 #include #include @@ -2361,7 +2362,7 @@ cxgb_tick_handler(void *arg, int count) =20 drops =3D 0; for (j =3D pi->first_qset; j < pi->first_qset + = pi->nqsets; j++) - drops +=3D = sc->sge.qs[j].txq[TXQ_ETH].txq_mr->br_drops; + drops +=3D = drbr_get_dropcnt(sc->sge.qs[j].txq[TXQ_ETH].txq_mr); ifp->if_snd.ifq_drops =3D drops; =20 ifp->if_oerrors =3D Index: sys/dev/cxgb/cxgb_sge.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cxgb/cxgb_sge.c (revision 257322) +++ sys/dev/cxgb/cxgb_sge.c (working copy) @@ -61,6 +61,7 @@ __FBSDID("$FreeBSD$"); #include =09 #include #include +#include =20 #include #include @@ -1684,7 +1685,7 @@ cxgb_transmit_locked(struct ifnet *ifp, struct sge { struct port_info *pi =3D qs->port; struct sge_txq *txq =3D &qs->txq[TXQ_ETH]; - struct buf_ring *br =3D txq->txq_mr; + struct drbr_ring *br =3D txq->txq_mr; int error, avail; =20 avail =3D txq->size - txq->in_use; @@ -1980,7 +1981,7 @@ t3_free_qset(adapter_t *sc, struct sge_qset *q) =09 reclaim_completed_tx(q, 0, TXQ_ETH); if (q->txq[TXQ_ETH].txq_mr !=3D NULL)=20 - buf_ring_free(q->txq[TXQ_ETH].txq_mr, M_DEVBUF); + drbr_free(q->txq[TXQ_ETH].txq_mr, M_DEVBUF); if (q->txq[TXQ_ETH].txq_ifq !=3D NULL) { ifq_delete(q->txq[TXQ_ETH].txq_ifq); free(q->txq[TXQ_ETH].txq_ifq, M_DEVBUF); @@ -2430,8 +2431,8 @@ t3_sge_alloc_qset(adapter_t *sc, u_int id, int npo q->port =3D pi; q->adap =3D sc; =20 - if ((q->txq[TXQ_ETH].txq_mr =3D = buf_ring_alloc(cxgb_txq_buf_ring_size, - M_DEVBUF, M_WAITOK, &q->lock)) =3D=3D NULL) { + if ((q->txq[TXQ_ETH].txq_mr =3D drbr_alloc(M_DEVBUF, M_WAITOK,=20= + &q->lock)) =3D=3D NULL) { device_printf(sc->dev, "failed to allocate mbuf = ring\n"); goto err; } @@ -3523,9 +3524,9 @@ t3_add_configured_sysctls(adapter_t *sc) CTLTYPE_STRING | CTLFLAG_RD, &qs->rspq, 0, t3_dump_rspq, "A", "dump of the response = queue"); =20 - SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, = "dropped", +/* RRS FIXME SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, = "dropped", CTLFLAG_RD, = &qs->txq[TXQ_ETH].txq_mr->br_drops, - "#tunneled packets dropped"); + "#tunneled packets dropped");*/ SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, = "sendqlen", CTLFLAG_RD, &qs->txq[TXQ_ETH].sendq.qlen, 0, "#tunneled packets waiting to be sent"); Index: sys/dev/cxgbe/adapter.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cxgbe/adapter.h (revision 257322) +++ sys/dev/cxgbe/adapter.h (working copy) @@ -419,7 +419,7 @@ struct sge_txq { =20 struct ifnet *ifp; /* the interface this txq belongs to */ bus_dma_tag_t tx_tag; /* tag for transmit buffers */ - struct buf_ring *br; /* tx buffer ring */ + struct drbr_ring *br; /* tx buffer ring */ struct tx_sdesc *sdesc; /* KVA of software descriptor ring */ struct mbuf *m; /* held up due to temporary resource = shortage */ =20 Index: sys/dev/cxgbe/t4_main.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cxgbe/t4_main.c (revision 257322) +++ sys/dev/cxgbe/t4_main.c (working copy) @@ -54,6 +54,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #if defined(__i386__) || defined(__amd64__) #include @@ -1254,7 +1255,7 @@ cxgbe_transmit(struct ifnet *ifp, struct mbuf *m) struct port_info *pi =3D ifp->if_softc; struct adapter *sc =3D pi->adapter; struct sge_txq *txq =3D &sc->sge.txq[pi->first_txq]; - struct buf_ring *br; + struct drbr_ring *br; int rc; =20 M_ASSERTPKTHDR(m); @@ -1295,7 +1296,7 @@ cxgbe_transmit(struct ifnet *ifp, struct mbuf *m) */ =20 TXQ_LOCK_ASSERT_OWNED(txq); - if (drbr_needs_enqueue(ifp, br) || txq->m) { + if (txq->m) { =20 /* Queued for transmission. */ =20 @@ -1321,7 +1322,6 @@ cxgbe_qflush(struct ifnet *ifp) struct port_info *pi =3D ifp->if_softc; struct sge_txq *txq; int i; - struct mbuf *m; =20 /* queues do not exist if !PORT_INIT_DONE. */ if (pi->flags & PORT_INIT_DONE) { @@ -1329,8 +1329,7 @@ cxgbe_qflush(struct ifnet *ifp) TXQ_LOCK(txq); m_freem(txq->m); txq->m =3D NULL; - while ((m =3D buf_ring_dequeue_sc(txq->br)) !=3D = NULL) - m_freem(m); + drbr_flush(ifp, txq->br); TXQ_UNLOCK(txq); } } @@ -4042,7 +4041,7 @@ cxgbe_tick(void *arg) =20 drops =3D s->tx_drop; for_each_txq(pi, i, txq) - drops +=3D txq->br->br_drops; + drops +=3D drbr_get_dropcnt(txq->br); ifp->if_snd.ifq_drops =3D drops; =20 ifp->if_oerrors =3D s->tx_error_frames; @@ -6493,7 +6492,7 @@ sysctl_wcwr_stats(SYSCTL_HANDLER_ARGS) static inline void txq_start(struct ifnet *ifp, struct sge_txq *txq) { - struct buf_ring *br; + struct drbr_ring *br; struct mbuf *m; =20 TXQ_LOCK_ASSERT_OWNED(txq); @@ -7509,7 +7508,6 @@ t4_ioctl(struct cdev *dev, unsigned long cmd, cadd txq->txpkt_wrs =3D 0; txq->txpkts_wrs =3D 0; txq->txpkts_pkts =3D 0; - txq->br->br_drops =3D 0; txq->no_dmamap =3D 0; txq->no_desc =3D 0; } Index: sys/dev/cxgbe/t4_sge.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/cxgbe/t4_sge.c (revision 257322) +++ sys/dev/cxgbe/t4_sge.c (working copy) @@ -47,6 +47,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include @@ -1844,9 +1845,10 @@ t4_eth_tx(struct ifnet *ifp, struct sge_txq *txq, struct port_info *pi =3D (void *)ifp->if_softc; struct adapter *sc =3D pi->adapter; struct sge_eq *eq =3D &txq->eq; - struct buf_ring *br =3D txq->br; + struct drbr_ring *br =3D txq->br; struct mbuf *next; int rc, coalescing, can_reclaim; + uint8_t qused; struct txpkts txpkts; struct sgl sgl; =20 @@ -1873,8 +1875,7 @@ t4_eth_tx(struct ifnet *ifp, struct sge_txq *txq, =20 if (__predict_false(eq->flags & EQ_DOOMED)) { m_freem(m); - while ((m =3D buf_ring_dequeue_sc(txq->br)) !=3D NULL) - m_freem(m); + drbr_flush(ifp, br); return (ENETDOWN); } =20 @@ -1889,7 +1890,7 @@ t4_eth_tx(struct ifnet *ifp, struct sge_txq *txq, next =3D m->m_nextpkt; m->m_nextpkt =3D NULL; =20 - if (next || buf_ring_peek(br)) + if (next || drbr_peek(ifp, br, &qused)) coalescing =3D 1; =20 rc =3D get_pkt_sgl(txq, &m, &sgl, coalescing); @@ -2936,7 +2937,7 @@ alloc_txq(struct port_info *pi, struct sge_txq *tx =20 txq->sdesc =3D malloc(eq->cap * sizeof(struct tx_sdesc), = M_CXGBE, M_ZERO | M_WAITOK); - txq->br =3D buf_ring_alloc(eq->qsize, M_CXGBE, M_WAITOK, = &eq->eq_lock); + txq->br =3D drbr_alloc(M_CXGBE, M_WAITOK, &eq->eq_lock); =20 rc =3D bus_dma_tag_create(sc->dmat, 1, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, 64 * 1024, TX_SGL_SEGS, @@ -2991,8 +2992,8 @@ alloc_txq(struct port_info *pi, struct sge_txq *tx SYSCTL_ADD_UQUAD(&pi->ctx, children, OID_AUTO, "txpkts_pkts", = CTLFLAG_RD, &txq->txpkts_pkts, "# of frames tx'd using txpkts work = requests"); =20 - SYSCTL_ADD_UQUAD(&pi->ctx, children, OID_AUTO, "br_drops", = CTLFLAG_RD, - &txq->br->br_drops, "# of drops in the buf_ring for this = queue"); +/* SYSCTL_ADD_UQUAD(&pi->ctx, children, OID_AUTO, "br_drops", = CTLFLAG_RD, + &txq->br->br_drops, "# of drops in the buf_ring for this = queue");*/ SYSCTL_ADD_UINT(&pi->ctx, children, OID_AUTO, "no_dmamap", = CTLFLAG_RD, &txq->no_dmamap, 0, "# of times txq ran out of DMA maps"); SYSCTL_ADD_UINT(&pi->ctx, children, OID_AUTO, "no_desc", = CTLFLAG_RD, @@ -3021,7 +3022,7 @@ free_txq(struct port_info *pi, struct sge_txq *txq if (txq->txmaps.maps) t4_free_tx_maps(&txq->txmaps, txq->tx_tag); =20 - buf_ring_free(txq->br, M_CXGBE); + drbr_free(txq->br, M_CXGBE); =20 if (txq->tx_tag) bus_dma_tag_destroy(txq->tx_tag); Index: sys/dev/e1000/if_em.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/e1000/if_em.c (revision 257322) +++ sys/dev/e1000/if_em.c (working copy) @@ -67,6 +67,7 @@ #include #include #include +#include =20 #include #include @@ -273,6 +274,9 @@ static int em_is_valid_ether_addr(u8 *); static int em_sysctl_int_delay(SYSCTL_HANDLER_ARGS); static void em_add_int_delay_sysctl(struct adapter *, const char *, const char *, struct em_int_delay_info *, int, int); +static void em_max_bytes(struct ifnet *, uint64_t max); +static struct drbr_ring *em_get_ring(struct ifnet *ifp, int num); +static int em_ring_query(struct ifnet *ifp, struct mbuf *); /* Management and WOL Support */ static void em_init_manageability(struct adapter *); static void em_release_manageability(struct adapter *); @@ -897,7 +901,38 @@ em_resume(device_t dev) return bus_generic_resume(dev); } =20 +void +em_max_bytes(struct ifnet *ifp, uint64_t max) +{ + struct adapter *adapter =3D ifp->if_softc; + adapter->ring_bytes_max =3D max; +} =20 +struct drbr_ring * +em_get_ring(struct ifnet *ifp, int num) +{ + struct adapter *adapter =3D ifp->if_softc; + struct tx_ring *txr; + if (num >=3D adapter->num_queues) { + return (NULL); + } + if (adapter->tx_rings) { + txr =3D &adapter->tx_rings[num]; + return (txr->br); + } else { + return (NULL); + } +} +=20 +int +em_ring_query(struct ifnet *ifp, struct mbuf *m) +{ + struct adapter *adapter =3D ifp->if_softc; + struct tx_ring *txr; + txr =3D &adapter->tx_rings[0]; + return(drbr_is_on_ring(txr->br, m)); +} + #ifdef EM_MULTIQUEUE /********************************************************************* * Multiqueue Transmit routines=20 @@ -913,6 +948,7 @@ em_mq_start_locked(struct ifnet *ifp, struct tx_ri struct adapter *adapter =3D txr->adapter; struct mbuf *next; int err =3D 0, enq =3D 0; + uint8_t qused; =20 if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=3D= IFF_DRV_RUNNING || adapter->link_active =3D=3D 0) { @@ -929,20 +965,26 @@ em_mq_start_locked(struct ifnet *ifp, struct tx_ri }=20 =20 /* Process the queue */ - while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) { + while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) { if ((err =3D em_xmit(txr, &next)) !=3D 0) { if (next =3D=3D NULL) - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); else=20 - drbr_putback(ifp, txr->br, next); + drbr_putback(ifp, txr->br, next, qused); break; } - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); + atomic_add_long(&txr->bytes_on_ring,=20 + (uint64_t)next->m_pkthdr.len); enq++; ifp->if_obytes +=3D next->m_pkthdr.len; if (next->m_flags & M_MCAST) ifp->if_omcasts++; ETHER_BPF_MTAP(ifp, next); + if (adapter->ring_bytes_max &&=20 + (txr->bytes_on_ring >=3D adapter->ring_bytes_max)) { + break; + } if ((ifp->if_drv_flags & IFF_DRV_RUNNING) =3D=3D 0) break; } @@ -991,8 +1033,7 @@ em_qflush(struct ifnet *ifp) =20 for (int i =3D 0; i < adapter->num_queues; i++, txr++) { EM_TX_LOCK(txr); - while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL) - m_freem(m); + drbr_flush(ifp, txr->br); EM_TX_UNLOCK(txr); } if_qflush(ifp); @@ -2984,6 +3025,9 @@ em_setup_interface(device_t dev, struct adapter *a ifp->if_softc =3D adapter; ifp->if_flags =3D IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_ioctl =3D em_ioctl; + ifp->if_maxbytes =3D em_max_bytes; + ifp->if_getdrbr_ring =3D em_get_ring; + ifp->if_mbuf_on_ring =3D em_ring_query; #ifdef EM_MULTIQUEUE /* Multiqueue stack interface */ ifp->if_transmit =3D em_mq_start; @@ -3222,7 +3266,7 @@ em_allocate_queues(struct adapter *adapter) } #if __FreeBSD_version >=3D 800000 /* Allocate a buf ring */ - txr->br =3D buf_ring_alloc(4096, M_DEVBUF, + txr->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &txr->tx_mtx); #endif } @@ -3272,7 +3316,7 @@ err_tx_desc: free(adapter->rx_rings, M_DEVBUF); rx_fail: #if __FreeBSD_version >=3D 800000 - buf_ring_free(txr->br, M_DEVBUF); + drbr_free(txr->br, M_DEVBUF); #endif free(adapter->tx_rings, M_DEVBUF); fail: @@ -3396,6 +3440,7 @@ em_setup_transmit_ring(struct tx_ring *txr) =20 /* Set number of descriptors available */ txr->tx_avail =3D adapter->num_tx_desc; + txr->bytes_on_ring =3D 0; txr->queue_status =3D EM_QUEUE_IDLE; =20 /* Clear checksum offload context. */ @@ -3579,7 +3624,7 @@ em_free_transmit_buffers(struct tx_ring *txr) } #if __FreeBSD_version >=3D 800000 if (txr->br !=3D NULL) - buf_ring_free(txr->br, M_DEVBUF); + drbr_free(txr->br, M_DEVBUF); #endif if (txr->tx_buffers !=3D NULL) { free(txr->tx_buffers, M_DEVBUF); @@ -3877,6 +3922,8 @@ em_txeof(struct tx_ring *txr) ++processed; =20 if (tx_buffer->m_head) { + = atomic_subtract_long(&txr->bytes_on_ring, + = (u_long)tx_buffer->m_head->m_pkthdr.len); bus_dmamap_sync(txr->txtag, tx_buffer->map, BUS_DMASYNC_POSTWRITE); @@ -5329,7 +5376,7 @@ em_add_hw_stats(struct adapter *adapter) queue_node =3D SYSCTL_ADD_NODE(ctx, child, OID_AUTO, = namebuf, CTLFLAG_RD, NULL, "Queue = Name"); queue_list =3D SYSCTL_CHILDREN(queue_node); - + drbr_add_sysctl_stats(dev, queue_list, txr->br); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "txd_head",=20= CTLTYPE_UINT | CTLFLAG_RD, adapter, E1000_TDH(txr->me), Index: sys/dev/e1000/if_em.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/e1000/if_em.h (revision 257322) +++ sys/dev/e1000/if_em.h (working copy) @@ -298,8 +298,9 @@ struct tx_ring { u8 last_hw_tucso; u8 last_hw_tucss; #if __FreeBSD_version >=3D 800000 - struct buf_ring *br; + struct drbr_ring *br; #endif + volatile u_long bytes_on_ring; /* Interrupt resources */ bus_dma_tag_t txtag; void *tag; @@ -346,6 +347,7 @@ struct rx_ring { /* Our adapter structure */ struct adapter { struct ifnet *ifp; + uint64_t ring_bytes_max; struct e1000_hw hw; =20 /* FreeBSD operating-system-specific structures. */ Index: sys/dev/e1000/if_igb.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/e1000/if_igb.c (revision 257322) +++ sys/dev/e1000/if_igb.c (working copy) @@ -72,6 +72,7 @@ #include #include #include +#include =20 #include #include @@ -216,6 +217,9 @@ static void igb_reset(struct adapter *); static int igb_setup_interface(device_t, struct adapter *); static int igb_allocate_queues(struct adapter *); static void igb_configure_queues(struct adapter *); +static void igb_max_bytes(struct ifnet *, uint64_t max); +static struct drbr_ring *igb_get_ring(struct ifnet *ifp, int num); +static int igb_ring_query(struct ifnet *ifp, struct mbuf *m); =20 static int igb_allocate_transmit_buffers(struct tx_ring *); static void igb_setup_transmit_structures(struct adapter *); @@ -883,7 +887,43 @@ igb_resume(device_t dev) return bus_generic_resume(dev); } =20 +void +igb_max_bytes(struct ifnet *ifp, uint64_t max) +{ + struct adapter *adapter =3D ifp->if_softc; + adapter->ring_bytes_max =3D max; =20 +} + +struct drbr_ring * +igb_get_ring(struct ifnet *ifp, int num) +{ + struct adapter *adapter =3D ifp->if_softc; + struct tx_ring *txr; + + if (num >=3D adapter->num_queues) { + return (NULL); + } + if (adapter->tx_rings) { + txr =3D &adapter->tx_rings[num]; + return (txr->br); + } else { + return (NULL); + } +} + +int +igb_ring_query(struct ifnet *ifp, struct mbuf *m) +{ + struct adapter *adapter =3D ifp->if_softc; + struct tx_ring *txr; + /* For this hack, we only use 0, since adara stuff + * sends out on queue 0 always. + */ + txr =3D &adapter->tx_rings[0]; + return(drbr_is_on_ring(txr->br, m)); +} + #ifdef IGB_LEGACY_TX =20 /********************************************************************* @@ -1003,6 +1043,7 @@ igb_mq_start_locked(struct ifnet *ifp, struct tx_r struct adapter *adapter =3D txr->adapter; struct mbuf *next; int err =3D 0, enq =3D 0; + uint8_t qused; =20 IGB_TX_LOCK_ASSERT(txr); =20 @@ -1012,11 +1053,11 @@ igb_mq_start_locked(struct ifnet *ifp, struct = tx_r =20 =20 /* Process the queue */ - while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) { + while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) { if ((err =3D igb_xmit(txr, &next)) !=3D 0) { if (next =3D=3D NULL) { /* It was freed, move forward */ - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); } else { /*=20 * Still have one left, it may not be @@ -1023,11 +1064,13 @@ igb_mq_start_locked(struct ifnet *ifp, struct = tx_r * the same since the transmit function * may have changed it. */ - drbr_putback(ifp, txr->br, next); + drbr_putback(ifp, txr->br, next, qused); } break; } - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); + atomic_add_long(&txr->bytes_on_ring,=20 + (u_long)next->m_pkthdr.len); enq++; ifp->if_obytes +=3D next->m_pkthdr.len; if (next->m_flags & M_MCAST) @@ -1035,6 +1078,11 @@ igb_mq_start_locked(struct ifnet *ifp, struct = tx_r ETHER_BPF_MTAP(ifp, next); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) =3D=3D 0) break; + if (adapter->ring_bytes_max &&=20 + (txr->bytes_on_ring >=3D adapter->ring_bytes_max)) { + break; + } + } if (enq > 0) { /* Set the watchdog */ @@ -1072,12 +1120,10 @@ igb_qflush(struct ifnet *ifp) { struct adapter *adapter =3D ifp->if_softc; struct tx_ring *txr =3D adapter->tx_rings; - struct mbuf *m; =20 for (int i =3D 0; i < adapter->num_queues; i++, txr++) { IGB_TX_LOCK(txr); - while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL) - m_freem(m); + drbr_flush(ifp, txr->br); IGB_TX_UNLOCK(txr); } if_qflush(ifp); @@ -3117,6 +3163,9 @@ igb_setup_interface(device_t dev, struct adapter * #ifndef IGB_LEGACY_TX ifp->if_transmit =3D igb_mq_start; ifp->if_qflush =3D igb_qflush; + ifp->if_maxbytes =3D igb_max_bytes; + ifp->if_getdrbr_ring =3D igb_get_ring; + ifp->if_mbuf_on_ring =3D igb_ring_query; #else ifp->if_start =3D igb_start; IFQ_SET_MAXLEN(&ifp->if_snd, adapter->num_tx_desc - 1); @@ -3361,7 +3410,7 @@ igb_allocate_queues(struct adapter *adapter) } #ifndef IGB_LEGACY_TX /* Allocate a buf ring */ - txr->br =3D buf_ring_alloc(igb_buf_ring_size, M_DEVBUF, + txr->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &txr->tx_mtx); #endif } @@ -3421,7 +3470,7 @@ err_tx_desc: free(adapter->rx_rings, M_DEVBUF); rx_fail: #ifndef IGB_LEGACY_TX - buf_ring_free(txr->br, M_DEVBUF); + drbr_free(txr->br, M_DEVBUF); #endif free(adapter->tx_rings, M_DEVBUF); tx_fail: @@ -3539,6 +3588,7 @@ igb_setup_transmit_ring(struct tx_ring *txr) =20 /* Set number of descriptors available */ txr->tx_avail =3D adapter->num_tx_desc; + txr->bytes_on_ring =3D 0; =20 bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); @@ -3680,7 +3730,7 @@ igb_free_transmit_buffers(struct tx_ring *txr) } #ifndef IGB_LEGACY_TX if (txr->br !=3D NULL) - buf_ring_free(txr->br, M_DEVBUF); + drbr_free(txr->br, M_DEVBUF); #endif if (txr->tx_buffers !=3D NULL) { free(txr->tx_buffers, M_DEVBUF); @@ -4016,6 +4066,8 @@ igb_txeof(struct tx_ring *txr) if (buf->m_head) { txr->bytes +=3D buf->m_head->m_pkthdr.len; + = atomic_subtract_long(&txr->bytes_on_ring, + = (uint64_t)buf->m_head->m_pkthdr.len); bus_dmamap_sync(txr->txtag, buf->map, BUS_DMASYNC_POSTWRITE); @@ -5636,7 +5688,7 @@ igb_add_hw_stats(struct adapter *adapter) queue_node =3D SYSCTL_ADD_NODE(ctx, child, OID_AUTO, = namebuf, CTLFLAG_RD, NULL, "Queue = Name"); queue_list =3D SYSCTL_CHILDREN(queue_node); - + drbr_add_sysctl_stats(dev, queue_list, txr->br); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, = "interrupt_rate",=20 CTLFLAG_RD, &adapter->queues[i], sizeof(&adapter->queues[i]), Index: sys/dev/e1000/if_igb.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/e1000/if_igb.h (revision 257322) +++ sys/dev/e1000/if_igb.h (working copy) @@ -309,12 +309,14 @@ struct tx_ring { IGB_QUEUE_DEPLETED =3D 8, } queue_status; u32 txd_cmd; - bus_dma_tag_t txtag; char mtx_name[16]; #ifndef IGB_LEGACY_TX - struct buf_ring *br; + struct drbr_ring *br; struct task txq_task; #endif + bus_dma_tag_t txtag; + volatile u_long bytes_on_ring; + u32 bytes; /* used for AIM */ u32 packets; /* Soft Stats */ @@ -371,17 +373,17 @@ struct adapter { struct device *dev; struct cdev *led_dev; =20 - struct resource *pci_mem; - struct resource *msix_mem; - int memrid; - + struct resource *pci_mem; + struct resource *msix_mem; + uint64_t ring_bytes_max; + int memrid; /* * Interrupt resources: this set is * either used for legacy, or for Link * when doing MSIX */ - void *tag; - struct resource *res; + void *tag; + struct resource *res; =20 struct ifmedia media; struct callout timer; Index: sys/dev/fdt/fdt_common.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/fdt/fdt_common.c (revision 257322) +++ sys/dev/fdt/fdt_common.c (working copy) @@ -183,7 +183,6 @@ fdt_is_compatible(phandle_t node, const char *comp compat +=3D l; len -=3D l; } - return (rv); } =20 @@ -585,15 +584,18 @@ fdt_get_phyaddr(phandle_t node, device_t dev, int if (OF_getencprop(node, "phy-handle", (void *)&phy_handle, sizeof(phy_handle)) <=3D 0) return (ENXIO); - phy_node =3D OF_xref_phandle(phy_handle); + device_printf(dev, "phy-handle:0x%x phy_ihandle:0x%x = phy_node:0x%x\n",=20 + (uint32_t)phy_handle, (uint32_t)phy_ihandle, + (uint32_t)phy_node); =20 if (OF_getprop(phy_node, "reg", (void *)&phy_reg, sizeof(phy_reg)) <=3D 0) return (ENXIO); =20 + device_printf(dev, "reg:0x%x\n", (uint32_t)phy_reg); *phy_addr =3D fdt32_to_cpu(phy_reg); - + device_printf(dev, "tran to reg:0x%x\n", (uint32_t)*phy_addr); /* * Search for softc used to communicate with phy. */ Index: sys/dev/fdt/simplebus.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/fdt/simplebus.c (revision 257322) +++ sys/dev/fdt/simplebus.c (working copy) @@ -154,6 +154,8 @@ simplebus_probe(device_t dev) return (BUS_PROBE_GENERIC); } =20 +extern uint32_t simp_bus_debug; + static int simplebus_attach(device_t dev) { @@ -161,6 +163,7 @@ simplebus_attach(device_t dev) struct simplebus_devinfo *di; struct simplebus_softc *sc; phandle_t dt_node, dt_child; + int ret; =20 sc =3D device_get_softc(dev); =20 @@ -215,13 +218,15 @@ simplebus_attach(device_t dev) free(di, M_SIMPLEBUS); continue; } -#ifdef DEBUG +/*#ifdef DEBUG*/ device_printf(dev, "added child: %s\n\n", = di->di_ofw.obd_name); -#endif +/*#endif*/ device_set_ivars(dev_child, di); } - - return (bus_generic_attach(dev)); + simp_bus_debug =3D 1; + ret =3D bus_generic_attach(dev); + simp_bus_debug =3D 0; + return (ret); } =20 static int Index: sys/dev/ixgbe/ixgbe.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/ixgbe/ixgbe.c (revision 257322) +++ sys/dev/ixgbe/ixgbe.c (working copy) @@ -845,7 +845,8 @@ ixgbe_mq_start_locked(struct ifnet *ifp, struct tx struct adapter *adapter =3D txr->adapter; struct mbuf *next; int enqueued =3D 0, err =3D 0; - + uint8_t qused; +=09 if (((ifp->if_drv_flags & IFF_DRV_RUNNING) =3D=3D 0) || adapter->link_active =3D=3D 0) return (ENETDOWN); @@ -858,18 +859,18 @@ ixgbe_mq_start_locked(struct ifnet *ifp, struct tx if (next !=3D NULL) err =3D drbr_enqueue(ifp, txr->br, = next); #else - while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) { + while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) { if ((err =3D ixgbe_xmit(txr, &next)) !=3D 0) { if (next =3D=3D NULL) { - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); } else { - drbr_putback(ifp, txr->br, next); + drbr_putback(ifp, txr->br, next, qused); } #endif break; } #if __FreeBSD_version >=3D 901504 - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); #endif enqueued++; /* Send a copy of the frame to the BPF listener */ @@ -917,12 +918,10 @@ ixgbe_qflush(struct ifnet *ifp) { struct adapter *adapter =3D ifp->if_softc; struct tx_ring *txr =3D adapter->tx_rings; - struct mbuf *m; =20 for (int i =3D 0; i < adapter->num_queues; i++, txr++) { IXGBE_TX_LOCK(txr); - while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL) - m_freem(m); + drbr_flush(ifp, txr->br); IXGBE_TX_UNLOCK(txr); } if_qflush(ifp); @@ -2891,7 +2890,7 @@ ixgbe_allocate_queues(struct adapter *adapter) } #ifndef IXGBE_LEGACY_TX /* Allocate a buf ring */ - txr->br =3D buf_ring_alloc(IXGBE_BR_SIZE, M_DEVBUF, + txr->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &txr->tx_mtx); if (txr->br =3D=3D NULL) { device_printf(dev, @@ -3253,7 +3252,7 @@ ixgbe_free_transmit_buffers(struct tx_ring *txr) } #ifdef IXGBE_LEGACY_TX if (txr->br !=3D NULL) - buf_ring_free(txr->br, M_DEVBUF); + drbr_free(txr->br, M_DEVBUF); #endif if (txr->tx_buffers !=3D NULL) { free(txr->tx_buffers, M_DEVBUF); Index: sys/dev/ixgbe/ixgbe.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/ixgbe/ixgbe.h (revision 257322) +++ sys/dev/ixgbe/ixgbe.h (working copy) @@ -58,6 +58,7 @@ #include #include #include +#include =20 #include #include @@ -313,7 +314,7 @@ struct tx_ring { bus_dma_tag_t txtag; char mtx_name[16]; #ifndef IXGBE_LEGACY_TX - struct buf_ring *br; + struct drbr_ring *br; struct task txq_task; #endif #ifdef IXGBE_FDIR Index: sys/dev/ixgbe/ixv.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/ixgbe/ixv.c (revision 257322) +++ sys/dev/ixgbe/ixv.c (working copy) @@ -603,6 +603,7 @@ ixv_mq_start_locked(struct ifnet *ifp, struct tx_r struct adapter *adapter =3D txr->adapter; struct mbuf *next; int enqueued, err =3D 0; + uint8_t qused; =20 if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=3D= IFF_DRV_RUNNING || adapter->link_active =3D=3D 0) { @@ -623,16 +624,16 @@ ixv_mq_start_locked(struct ifnet *ifp, struct tx_r } } /* Process the queue */ - while ((next =3D drbr_peek(ifp, txr->br)) !=3D NULL) { + while ((next =3D drbr_peek(ifp, txr->br, &qused)) !=3D NULL) { if ((err =3D ixv_xmit(txr, &next)) !=3D 0) { if (next =3D=3D NULL) { - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); } else { - drbr_putback(ifp, txr->br, next); + drbr_putback(ifp, txr->br, next, qused); } break; } - drbr_advance(ifp, txr->br); + drbr_advance(ifp, txr->br, qused); enqueued++; ifp->if_obytes +=3D next->m_pkthdr.len; if (next->m_flags & M_MCAST) @@ -664,12 +665,10 @@ ixv_qflush(struct ifnet *ifp) { struct adapter *adapter =3D ifp->if_softc; struct tx_ring *txr =3D adapter->tx_rings; - struct mbuf *m; =20 for (int i =3D 0; i < adapter->num_queues; i++, txr++) { IXV_TX_LOCK(txr); - while ((m =3D buf_ring_dequeue_sc(txr->br)) !=3D NULL) - m_freem(m); + drbr_flush(ifp, txr->br); IXV_TX_UNLOCK(txr); } if_qflush(ifp); @@ -2053,8 +2052,7 @@ ixv_allocate_queues(struct adapter *adapter) } #if __FreeBSD_version >=3D 800000 /* Allocate a buf ring */ - txr->br =3D buf_ring_alloc(IXV_BR_SIZE, M_DEVBUF, - M_WAITOK, &txr->tx_mtx); + txr->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, = &txr->tx_mtx); if (txr->br =3D=3D NULL) { device_printf(dev, "Critical Failure setting up buf ring\n"); @@ -2355,7 +2353,7 @@ ixv_free_transmit_buffers(struct tx_ring *txr) } #if __FreeBSD_version >=3D 800000 if (txr->br !=3D NULL) - buf_ring_free(txr->br, M_DEVBUF); + drbr_free(txr->br, M_DEVBUF); #endif if (txr->tx_buffers !=3D NULL) { free(txr->tx_buffers, M_DEVBUF); Index: sys/dev/ixgbe/ixv.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/ixgbe/ixv.h (revision 257322) +++ sys/dev/ixgbe/ixv.h (working copy) @@ -61,6 +61,7 @@ #include #include #include +#include =20 #include #include @@ -267,7 +268,7 @@ struct tx_ring { u32 txd_cmd; bus_dma_tag_t txtag; char mtx_name[16]; - struct buf_ring *br; + struct drbr_ring *br; /* Soft Stats */ u32 bytes; u32 packets; Index: sys/dev/mxge/if_mxge.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/mxge/if_mxge.c (revision 257322) +++ sys/dev/mxge/if_mxge.c (working copy) @@ -59,6 +59,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include =20 #include #include @@ -2243,14 +2244,12 @@ mxge_qflush(struct ifnet *ifp) { mxge_softc_t *sc =3D ifp->if_softc; mxge_tx_ring_t *tx; - struct mbuf *m; int slice; =20 for (slice =3D 0; slice < sc->num_slices; slice++) { tx =3D &sc->ss[slice].tx; mtx_lock(&tx->mtx); - while ((m =3D buf_ring_dequeue_sc(tx->br)) !=3D NULL) - m_freem(m); + drbr_flush(ifp, tx->br); mtx_unlock(&tx->mtx); } if_qflush(ifp); @@ -4060,7 +4059,7 @@ mxge_update_stats(mxge_softc_t *sc) #ifdef IFNET_BUF_RING obytes +=3D ss->obytes; omcasts +=3D ss->omcasts; - odrops +=3D ss->tx.br->br_drops; + odrops +=3D drbr_get_dropcnt(ss->tx.br); #endif oerrors +=3D ss->oerrors; } @@ -4436,7 +4435,7 @@ mxge_alloc_slices(mxge_softc_t *sc) "%s:tx(%d)", device_get_nameunit(sc->dev), i); mtx_init(&ss->tx.mtx, ss->tx.mtx_name, NULL, MTX_DEF); #ifdef IFNET_BUF_RING - ss->tx.br =3D buf_ring_alloc(2048, M_DEVBUF, M_WAITOK, + ss->tx.br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &ss->tx.mtx); #endif } Index: sys/dev/mxge/if_mxge_var.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/mxge/if_mxge_var.h (revision 257322) +++ sys/dev/mxge/if_mxge_var.h (working copy) @@ -167,7 +167,7 @@ typedef struct { struct mtx mtx; #ifdef IFNET_BUF_RING - struct buf_ring *br; + struct drbr_ring *br; #endif volatile mcp_kreq_ether_send_t *lanai; /* lanai ptr for sendq = */ volatile uint32_t *send_go; /* doorbell for sendq */ Index: sys/dev/oce/oce_hw.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/oce/oce_hw.c (revision 257322) +++ sys/dev/oce/oce_hw.c (working copy) @@ -360,8 +360,8 @@ oce_hw_shutdown(POCE_SOFTC sc) /* release PCI resources */ oce_hw_pci_free(sc); /* free mbox specific resources */ - LOCK_DESTROY(&sc->bmbx_lock); - LOCK_DESTROY(&sc->dev_lock); + LOCK_DESTROY_OCE(&sc->bmbx_lock); + LOCK_DESTROY_OCE(&sc->dev_lock); =20 oce_dma_free(sc, &sc->bsmbx); } Index: sys/dev/oce/oce_if.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/oce/oce_if.c (revision 257322) +++ sys/dev/oce/oce_if.c (working copy) @@ -296,8 +296,8 @@ oce_attach(device_t dev) sc->flow_control =3D OCE_DEFAULT_FLOW_CONTROL; sc->promisc =3D OCE_DEFAULT_PROMISCUOUS; =20 - LOCK_CREATE(&sc->bmbx_lock, "Mailbox_lock"); - LOCK_CREATE(&sc->dev_lock, "Device_lock"); + LOCK_CREATE_OCE(&sc->bmbx_lock, "Mailbox_lock"); + LOCK_CREATE_OCE(&sc->dev_lock, "Device_lock"); =20 /* initialise the hardware */ rc =3D oce_hw_init(sc); @@ -372,8 +372,8 @@ mbox_free: oce_dma_free(sc, &sc->bsmbx); pci_res_free: oce_hw_pci_free(sc); - LOCK_DESTROY(&sc->dev_lock); - LOCK_DESTROY(&sc->bmbx_lock); + LOCK_DESTROY_OCE(&sc->dev_lock); + LOCK_DESTROY_OCE(&sc->bmbx_lock); return rc; =20 } @@ -384,9 +384,9 @@ oce_detach(device_t dev) { POCE_SOFTC sc =3D device_get_softc(dev); =20 - LOCK(&sc->dev_lock); + LOCK_OCE(&sc->dev_lock); oce_if_deactivate(sc); - UNLOCK(&sc->dev_lock); + UNLOCK_OCE(&sc->dev_lock); =20 callout_drain(&sc->timer); =09 @@ -447,13 +447,13 @@ oce_ioctl(struct ifnet *ifp, u_long command, caddr } device_printf(sc->dev, "Interface Up\n");=09 } else { - LOCK(&sc->dev_lock); + LOCK_OCE(&sc->dev_lock); =20 sc->ifp->if_drv_flags &=3D ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE); oce_if_deactivate(sc); =20 - UNLOCK(&sc->dev_lock); + UNLOCK_OCE(&sc->dev_lock); =20 device_printf(sc->dev, "Interface Down\n"); } @@ -543,7 +543,7 @@ oce_init(void *arg) { POCE_SOFTC sc =3D arg; =09 - LOCK(&sc->dev_lock); + LOCK_OCE(&sc->dev_lock); =20 if (sc->ifp->if_flags & IFF_UP) { oce_if_deactivate(sc); @@ -550,7 +550,7 @@ oce_init(void *arg) oce_if_activate(sc); } =09 - UNLOCK(&sc->dev_lock); + UNLOCK_OCE(&sc->dev_lock); =20 } =20 @@ -571,9 +571,9 @@ oce_multiq_start(struct ifnet *ifp, struct mbuf *m =20 wq =3D sc->wq[queue_index]; =20 - LOCK(&wq->tx_lock); + LOCK_OCE(&wq->tx_lock); status =3D oce_multiq_transmit(ifp, m, wq); - UNLOCK(&wq->tx_lock); + UNLOCK_OCE(&wq->tx_lock); =20 return status; =20 @@ -584,12 +584,10 @@ static void oce_multiq_flush(struct ifnet *ifp) { POCE_SOFTC sc =3D ifp->if_softc; - struct mbuf *m; int i =3D 0; =20 for (i =3D 0; i < sc->nwqs; i++) { - while ((m =3D buf_ring_dequeue_sc(sc->wq[i]->br)) !=3D = NULL) - m_freem(m); + drbr_flush(ifp, sc->wq[i]->br); } if_qflush(ifp); } @@ -1136,13 +1134,13 @@ oce_tx_task(void *arg, int npending) int rc =3D 0; =20 #if __FreeBSD_version >=3D 800000 - LOCK(&wq->tx_lock); + LOCK_OCE(&wq->tx_lock); rc =3D oce_multiq_transmit(ifp, NULL, wq); if (rc) { device_printf(sc->dev, "TX[%d] restart failed\n", = wq->queue_index); } - UNLOCK(&wq->tx_lock); + UNLOCK_OCE(&wq->tx_lock); #else oce_start(ifp); #endif @@ -1170,9 +1168,9 @@ oce_start(struct ifnet *ifp) if (m =3D=3D NULL) break; =20 - LOCK(&sc->wq[def_q]->tx_lock); + LOCK_OCE(&sc->wq[def_q]->tx_lock); rc =3D oce_tx(sc, &m, def_q); - UNLOCK(&sc->wq[def_q]->tx_lock); + UNLOCK_OCE(&sc->wq[def_q]->tx_lock); if (rc) { if (m !=3D NULL) { sc->wq[def_q]->tx_stats.tx_stops ++; @@ -1247,7 +1245,8 @@ oce_multiq_transmit(struct ifnet *ifp, struct mbuf POCE_SOFTC sc =3D ifp->if_softc; int status =3D 0, queue_index =3D 0; struct mbuf *next =3D NULL; - struct buf_ring *br =3D NULL; + struct drbr_ring *br =3D NULL; + uint8_t qused; =20 br =3D wq->br; queue_index =3D wq->queue_index; @@ -1263,12 +1262,12 @@ oce_multiq_transmit(struct ifnet *ifp, struct = mbuf if ((status =3D drbr_enqueue(ifp, br, m)) !=3D 0) return status; }=20 - while ((next =3D drbr_peek(ifp, br)) !=3D NULL) { + while ((next =3D drbr_peek(ifp, br, &qused)) !=3D NULL) { if (oce_tx(sc, &next, queue_index)) { if (next =3D=3D NULL) { - drbr_advance(ifp, br); + drbr_advance(ifp, br, qused); } else { - drbr_putback(ifp, br, next); + drbr_putback(ifp, br, next, qused); wq->tx_stats.tx_stops ++; ifp->if_drv_flags |=3D IFF_DRV_OACTIVE; status =3D drbr_enqueue(ifp, br, next); @@ -1275,7 +1274,7 @@ oce_multiq_transmit(struct ifnet *ifp, struct mbuf } =20 break; } - drbr_advance(ifp, br); + drbr_advance(ifp, br, qused); ifp->if_obytes +=3D next->m_pkthdr.len; if (next->m_flags & M_MCAST) ifp->if_omcasts++; @@ -2078,13 +2077,13 @@ oce_if_deactivate(POCE_SOFTC sc) any other lock. So unlock device lock and require after completing taskqueue_drain. */ - UNLOCK(&sc->dev_lock); + UNLOCK_OCE(&sc->dev_lock); for (i =3D 0; i < sc->intr_count; i++) { if (sc->intrs[i].tq !=3D NULL) { taskqueue_drain(sc->intrs[i].tq, = &sc->intrs[i].task); } } - LOCK(&sc->dev_lock); + LOCK_OCE(&sc->dev_lock); =20 /* Delete RX queue in card with flush param */ oce_stop_rx(sc); Index: sys/dev/oce/oce_if.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/oce/oce_if.h (revision 257322) +++ sys/dev/oce/oce_if.h (working copy) @@ -70,6 +70,7 @@ #include #include #include +#include =20 #include #include @@ -528,18 +529,18 @@ struct oce_lock { }; #define OCE_LOCK struct oce_lock =20 -#define LOCK_CREATE(lock, desc) { \ +#define LOCK_CREATE_OCE(lock, desc) { \ strncpy((lock)->name, (desc), MAX_LOCK_DESC_LEN); \ (lock)->name[MAX_LOCK_DESC_LEN] =3D '\0'; \ mtx_init(&(lock)->mutex, (lock)->name, NULL, MTX_DEF); \ } -#define LOCK_DESTROY(lock) \ +#define LOCK_DESTROY_OCE(lock) \ if (mtx_initialized(&(lock)->mutex))\ mtx_destroy(&(lock)->mutex) -#define TRY_LOCK(lock) = mtx_trylock(&(lock)->mutex) -#define LOCK(lock) mtx_lock(&(lock)->mutex) -#define LOCKED(lock) = mtx_owned(&(lock)->mutex) -#define UNLOCK(lock) = mtx_unlock(&(lock)->mutex) +#define TRY_LOCK_OCE(lock) = mtx_trylock(&(lock)->mutex) +#define LOCK_OCE(lock) mtx_lock(&(lock)->mutex) +#define LOCKED_OCE(lock) = mtx_owned(&(lock)->mutex) +#define UNLOCK_OCE(lock) = mtx_unlock(&(lock)->mutex) =20 #define DEFAULT_MQ_MBOX_TIMEOUT (5 * 1000 * = 1000) #define MBX_READY_TIMEOUT (1 * 1000 * = 1000) @@ -702,7 +703,7 @@ struct oce_wq { struct wq_config cfg; int queue_index; struct oce_tx_queue_stats tx_stats; - struct buf_ring *br; + struct drbr_ring *br; struct task txtask; uint32_t db_offset; }; Index: sys/dev/oce/oce_mbox.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/oce/oce_mbox.c (revision 257322) +++ sys/dev/oce/oce_mbox.c (working copy) @@ -345,7 +345,7 @@ oce_mbox_post(POCE_SOFTC sc, struct oce_mbx *mbx, uint32_t cstatus =3D 0; uint32_t xstatus =3D 0; =20 - LOCK(&sc->bmbx_lock); + LOCK_OCE(&sc->bmbx_lock); =20 mb =3D OCE_DMAPTR(&sc->bsmbx, struct oce_bmbx); mb_mbx =3D &mb->mbx; @@ -387,7 +387,7 @@ oce_mbox_post(POCE_SOFTC sc, struct oce_mbx *mbx, } } =20 - UNLOCK(&sc->bmbx_lock); + UNLOCK_OCE(&sc->bmbx_lock); =20 return rc; } Index: sys/dev/oce/oce_queue.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/oce/oce_queue.c (revision 257322) +++ sys/dev/oce/oce_queue.c (working copy) @@ -253,12 +253,11 @@ oce_wq *oce_wq_init(POCE_SOFTC sc, uint32_t q_len, goto free_wq; =20 =20 - LOCK_CREATE(&wq->tx_lock, "TX_lock"); + LOCK_CREATE_OCE(&wq->tx_lock, "TX_lock"); =09 #if __FreeBSD_version >=3D 800000 /* Allocate buf ring for multiqueue*/ - wq->br =3D buf_ring_alloc(4096, M_DEVBUF, - M_WAITOK, &wq->tx_lock.mutex); + wq->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &wq->tx_lock.mutex); if (!wq->br) goto free_wq; #endif @@ -301,9 +300,9 @@ oce_wq_free(struct oce_wq *wq) if (wq->tag !=3D NULL) bus_dma_tag_destroy(wq->tag); if (wq->br !=3D NULL) - buf_ring_free(wq->br, M_DEVBUF); + drbr_free(wq->br, M_DEVBUF); =20 - LOCK_DESTROY(&wq->tx_lock); + LOCK_DESTROY_OCE(&wq->tx_lock); free(wq, M_DEVBUF); } =20 @@ -451,7 +450,7 @@ oce_rq *oce_rq_init(POCE_SOFTC sc, if (!rq->ring) goto free_rq; =20 - LOCK_CREATE(&rq->rx_lock, "RX_lock"); + LOCK_CREATE_OCE(&rq->rx_lock, "RX_lock"); =20 return rq; =20 @@ -493,7 +492,7 @@ oce_rq_free(struct oce_rq *rq) if (rq->tag !=3D NULL) bus_dma_tag_destroy(rq->tag); =20 - LOCK_DESTROY(&rq->rx_lock); + LOCK_DESTROY_OCE(&rq->rx_lock); free(rq, M_DEVBUF); } =20 Index: sys/dev/virtio/network/if_vtnet.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/virtio/network/if_vtnet.c (revision 257322) +++ sys/dev/virtio/network/if_vtnet.c (working copy) @@ -57,6 +57,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include =20 #include =20 @@ -685,7 +686,7 @@ vtnet_init_txq(struct vtnet_softc *sc, int id) txq->vtntx_id =3D id; =20 #ifndef VTNET_LEGACY_TX - txq->vtntx_br =3D buf_ring_alloc(VTNET_DEFAULT_BUFRING_SIZE, = M_DEVBUF, + txq->vtntx_br =3D drbr_alloc(M_DEVBUF, M_NOWAIT, &txq->vtntx_mtx); if (txq->vtntx_br =3D=3D NULL) return (ENOMEM); @@ -749,7 +750,7 @@ vtnet_destroy_txq(struct vtnet_txq *txq) =20 #ifndef VTNET_LEGACY_TX if (txq->vtntx_br !=3D NULL) { - buf_ring_free(txq->vtntx_br, M_DEVBUF); + drbr_free(txq->vtntx_br, M_DEVBUF); txq->vtntx_br =3D NULL; } #endif @@ -2211,9 +2212,10 @@ vtnet_txq_mq_start_locked(struct vtnet_txq *txq, = s { struct vtnet_softc *sc; struct virtqueue *vq; - struct buf_ring *br; + struct drbr_ring *br; struct ifnet *ifp; int enq, error; + uint8_t qnum; =20 sc =3D txq->vtntx_sc; vq =3D txq->vtntx_vq; @@ -2239,16 +2241,16 @@ vtnet_txq_mq_start_locked(struct vtnet_txq *txq, = s =20 vtnet_txq_eof(txq); =20 - while ((m =3D drbr_peek(ifp, br)) !=3D NULL) { + while ((m =3D drbr_peek(ifp, br, &qnum)) !=3D NULL) { error =3D vtnet_txq_encap(txq, &m); if (error) { if (m !=3D NULL) - drbr_putback(ifp, br, m); + drbr_putback(ifp, br, m, qnum); else - drbr_advance(ifp, br); + drbr_advance(ifp, br, qnum); break; } - drbr_advance(ifp, br); + drbr_advance(ifp, br, qnum); =20 enq++; ETHER_BPF_MTAP(ifp, m); @@ -2458,7 +2460,6 @@ vtnet_qflush(struct ifnet *ifp) { struct vtnet_softc *sc; struct vtnet_txq *txq; - struct mbuf *m; int i; =20 sc =3D ifp->if_softc; @@ -2467,8 +2468,7 @@ vtnet_qflush(struct ifnet *ifp) txq =3D &sc->vtnet_txqs[i]; =20 VTNET_TXQ_LOCK(txq); - while ((m =3D buf_ring_dequeue_sc(txq->vtntx_br)) !=3D = NULL) - m_freem(m); + drbr_flush(ifp, txq->vtntx_br); VTNET_TXQ_UNLOCK(txq); } =20 Index: sys/dev/virtio/network/if_vtnetvar.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/virtio/network/if_vtnetvar.h (revision 257322) +++ sys/dev/virtio/network/if_vtnetvar.h (working copy) @@ -100,7 +100,7 @@ struct vtnet_txq { struct vtnet_softc *vtntx_sc; struct virtqueue *vtntx_vq; #ifndef VTNET_LEGACY_TX - struct buf_ring *vtntx_br; + struct drbr_ring *vtntx_br; #endif int vtntx_id; int vtntx_watchdog; Index: sys/dev/vxge/vxge.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/vxge/vxge.c (revision 257322) +++ sys/dev/vxge/vxge.c (working copy) @@ -31,6 +31,7 @@ /*$FreeBSD$*/ =20 #include +#include =20 static int vxge_pci_bd_no =3D -1; static u32 vxge_drv_copyright =3D 0; @@ -729,7 +730,6 @@ void vxge_mq_qflush(ifnet_t ifp) { int i; - mbuf_t m_head; vxge_vpath_t *vpath; =20 vxge_dev_t *vdev =3D (vxge_dev_t *) ifp->if_softc; @@ -740,9 +740,7 @@ vxge_mq_qflush(ifnet_t ifp) continue; =20 VXGE_TX_LOCK(vpath); - while ((m_head =3D buf_ring_dequeue_sc(vpath->br)) !=3D = NULL) - vxge_free_packet(m_head); - + drbr_flush(ifp, vpath->br); VXGE_TX_UNLOCK(vpath); } if_qflush(ifp); @@ -2294,7 +2292,7 @@ vxge_vpath_open(vxge_dev_t *vdev) break; } #if __FreeBSD_version >=3D 800000 - vpath->br =3D buf_ring_alloc(VXGE_DEFAULT_BR_SIZE, = M_DEVBUF, + vpath->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &vpath->mtx_tx); if (vpath->br =3D=3D NULL) { err =3D ENOMEM; @@ -2433,7 +2431,7 @@ vxge_vpath_close(vxge_dev_t *vdev) =20 #if __FreeBSD_version >=3D 800000 if (vpath->br !=3D NULL) - buf_ring_free(vpath->br, M_DEVBUF); + drbr_free(vpath->br, M_DEVBUF); #endif /* Free LRO memory */ if (vpath->lro_enable) Index: sys/dev/vxge/vxge.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/dev/vxge/vxge.h (revision 257322) +++ sys/dev/vxge/vxge.h (working copy) @@ -337,7 +337,7 @@ typedef struct _vxge_vpath_t { struct lro_ctrl lro; =20 #if __FreeBSD_version >=3D 800000 - struct buf_ring *br; + struct drbr_ring *br; #endif =20 } vxge_vpath_t; Index: sys/kern/kern_mbuf.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/kern/kern_mbuf.c (revision 257322) +++ sys/kern/kern_mbuf.c (working copy) @@ -653,7 +653,6 @@ m_pkthdr_init(struct mbuf *m, int how) m->m_pkthdr.flowid =3D 0; m->m_pkthdr.csum_flags =3D 0; m->m_pkthdr.fibnum =3D 0; - m->m_pkthdr.cosqos =3D 0; m->m_pkthdr.rsstype =3D 0; m->m_pkthdr.l2hlen =3D 0; m->m_pkthdr.l3hlen =3D 0; @@ -661,6 +660,7 @@ m_pkthdr_init(struct mbuf *m, int how) m->m_pkthdr.l5hlen =3D 0; m->m_pkthdr.PH_per.sixtyfour[0] =3D 0; m->m_pkthdr.PH_loc.sixtyfour[0] =3D 0; + m->m_pkthdr.cosqos =3D 0xff; /*drbr_maxq-1;*/ #ifdef MAC /* If the label init fails, fail the alloc */ error =3D mac_mbuf_init(m, how); Index: sys/kern/subr_bufring.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/kern/subr_bufring.c (revision 257322) +++ sys/kern/subr_bufring.c (working copy) @@ -34,6 +34,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include =20 =20 struct buf_ring * @@ -63,3 +64,317 @@ buf_ring_free(struct buf_ring *br, struct malloc_t { free(br, type); } + +/* + * multi-producer safe lock-free ring buffer enqueue + * + */ +extern uint32_t panic_on_dup_buf; + +int +buf_ring_mbufon(struct buf_ring *br, void *buf) +{ + int i; + /* We don't count what the driver is peeking at */ + for (i =3D br->br_cons_head; i !=3D br->br_prod_head; + i =3D ((i + 1) & br->br_cons_mask)) { + if(br->br_ring[i] =3D=3D buf) { + return(1); + } + } + return(0); +} + +__attribute__((noinline)) +int +buf_ring_enqueue(struct buf_ring *br, void *buf) +{ + uint32_t prod_head, prod_next; + uint32_t cons_tail; +#ifdef DEBUG_BUFRING + int i; + critical_enter(); + mb(); + for (i =3D br->br_cons_head; i !=3D br->br_prod_head; + i =3D ((i + 1) & br->br_cons_mask)) + if(br->br_ring[i] =3D=3D buf) { + if (panic_on_dup_buf) + panic("help br:%p buf:%p", br, buf); + critical_exit(); + return(0); + } +#else + critical_enter(); +#endif=09 + do { + prod_head =3D br->br_prod_head; + cons_tail =3D br->br_cons_tail; + + prod_next =3D (prod_head + 1) & br->br_prod_mask; + =09 + if (prod_next =3D=3D cons_tail) { + br->br_drops++; + critical_exit(); + return (ENOBUFS); + } + } while (!atomic_cmpset_int(&br->br_prod_head, prod_head, = prod_next)); +#ifdef DEBUG_BUFRING + if (br->br_ring[prod_head] !=3D NULL) { + printf("Dangling value in enqueue %d br:%p\n",=20 + prod_head, br); + } +#endif=09 + br->br_ring[prod_head] =3D buf; + + /* + * The full memory barrier also avoids that br_prod_tail store + * is reordered before the br_ring[prod_head] is full setup. + */ + mb(); + + /* + * If there are other enqueues in progress + * that preceeded us, we need to wait for them + * to complete=20 + */ =20 + while (br->br_prod_tail !=3D prod_head) + cpu_spinwait(); + br->br_prod_tail =3D prod_next; + critical_exit(); + return (0); +} + +/* + * multi-consumer safe dequeue=20 + * + */ +void * +buf_ring_dequeue_mc(struct buf_ring *br) +{ + uint32_t cons_head, cons_next; + uint32_t prod_tail; + void *buf; + int success; + + critical_enter(); + do { + cons_head =3D br->br_cons_head; + prod_tail =3D br->br_prod_tail; + + cons_next =3D (cons_head + 1) & br->br_cons_mask; + =09 + if (cons_head =3D=3D prod_tail) { + critical_exit(); + return (NULL); + } + =09 + success =3D atomic_cmpset_int(&br->br_cons_head, = cons_head, + cons_next); + } while (success =3D=3D 0); =09 + + buf =3D br->br_ring[cons_head]; +#ifdef DEBUG_BUFRING + br->br_ring[cons_head] =3D NULL; +#endif + /* + * The full memory barrier also avoids that br_ring[cons_read] + * load is reordered after br_cons_tail is set. + */ + mb(); +=09 + /* + * If there are other dequeues in progress + * that preceeded us, we need to wait for them + * to complete=20 + */ =20 + while (br->br_cons_tail !=3D cons_head) + cpu_spinwait(); + + br->br_cons_tail =3D cons_next; + critical_exit(); + + return (buf); +} + +/* + * single-consumer dequeue=20 + * use where dequeue is protected by a lock + * e.g. a network driver's tx queue lock + */ +void * +buf_ring_dequeue_sc(struct buf_ring *br) +{ + uint32_t cons_head, cons_next, cons_next_next; + uint32_t prod_tail; + void *buf; +=09 + cons_head =3D br->br_cons_head; + prod_tail =3D br->br_prod_tail; +=09 + cons_next =3D (cons_head + 1) & br->br_cons_mask; + cons_next_next =3D (cons_head + 2) & br->br_cons_mask; +=09 + if (cons_head =3D=3D prod_tail)=20 + return (NULL); + +#ifdef PREFETCH_DEFINED=09 + if (cons_next !=3D prod_tail) { =09 + prefetch(br->br_ring[cons_next]); + if (cons_next_next !=3D prod_tail)=20 + prefetch(br->br_ring[cons_next_next]); + } +#endif + br->br_cons_head =3D cons_next; + buf =3D br->br_ring[cons_head]; + +#ifdef DEBUG_BUFRING + br->br_ring[cons_head] =3D NULL; +#endif + br->br_cons_tail =3D cons_next; + return (buf); +} + +/* + * single-consumer advance after a peek + * use where it is protected by a lock + * e.g. a network driver's tx queue lock + */ +void +buf_ring_advance_sc(struct buf_ring *br) +{ + uint32_t cons_head, cons_next; + uint32_t prod_tail; +=09 + cons_head =3D br->br_cons_head; + prod_tail =3D br->br_prod_tail; +=09 + cons_next =3D (cons_head + 1) & br->br_cons_mask; + if (cons_head =3D=3D prod_tail)=20 + return; + br->br_cons_head =3D cons_next; + br->br_cons_tail =3D cons_next; +} + +void +buf_ring_advance_mc(struct buf_ring *br) +{ + uint32_t cons_head, cons_next; + uint32_t prod_tail; + int success; + + critical_enter(); + do { + cons_head =3D br->br_cons_head; + prod_tail =3D br->br_prod_tail; + + cons_next =3D (cons_head + 1) & br->br_cons_mask; + =09 + if (cons_head =3D=3D prod_tail) { + critical_exit(); + return; + } + =09 + success =3D atomic_cmpset_int(&br->br_cons_head, = cons_head, + cons_next); + } while (success =3D=3D 0); =09 + /* + * The full memory barrier also avoids that br_ring[cons_read] + * load is reordered after br_cons_tail is set. + */ + mb(); +=09 + /* + * If there are other dequeues in progress + * that preceeded us, we need to wait for them + * to complete=20 + */ =20 + while (br->br_cons_tail !=3D cons_head) + cpu_spinwait(); + + br->br_cons_tail =3D cons_next; + critical_exit(); +} + + +/* + * Used to return a buffer (most likely already there) + * to the top od the ring. The caller should *not* + * have used any dequeue to pull it out of the ring + * but instead should have used the peek() function. + * This is normally used where the transmit queue + * of a driver is full, and an mubf must be returned. + * Most likely whats in the ring-buffer is what + * is being put back (since it was not removed), but + * sometimes the lower transmit function may have + * done a pullup or other function that will have + * changed it. As an optimzation we always put it + * back (since jhb says the store is probably cheaper), + * if we have to do a multi-queue version we will need + * the compare and an atomic. + */ +void +buf_ring_putback_mc(struct buf_ring *br, void *new) +{ + KASSERT(br->br_cons_head !=3D br->br_prod_tail,=20 + ("Buf-Ring has none in putback")) ; + critical_enter(); + br->br_ring[br->br_cons_head] =3D new; + mb(); + critical_exit(); +} + +void +buf_ring_putback_sc(struct buf_ring *br, void *new) +{ + KASSERT(br->br_cons_head !=3D br->br_prod_tail,=20 + ("Buf-Ring has none in putback")) ; + br->br_ring[br->br_cons_head] =3D new; +} + +/* + * return a pointer to the first entry in the ring + * without modifying it, or NULL if the ring is empty + * race-prone if not protected by a lock + */ +void * +buf_ring_peek(struct buf_ring *br) +{ + struct mbuf *m; +#ifdef DEBUG_BUFRING + if ((br->br_lock !=3D NULL) && !mtx_owned(br->br_lock)) { + printf("br:%p lock not held on single consumer = dequeue\n", + br); + } + +#endif=09 + if (br->br_cons_head =3D=3D br->br_prod_tail) + return (NULL); + m =3D br->br_ring[br->br_cons_head]; +#ifdef DEBUG_BUFRING + br->br_ring[br->br_cons_head] =3D NULL; + mb(); +#endif + return (m); +} + +int +buf_ring_full(struct buf_ring *br) +{ + + return (((br->br_prod_head + 1) & br->br_prod_mask) =3D=3D = br->br_cons_tail); +} + +int +buf_ring_empty(struct buf_ring *br) +{ + + return (br->br_cons_head =3D=3D br->br_prod_tail); +} + +int +buf_ring_count(struct buf_ring *br) +{ + + return ((br->br_prod_size + br->br_prod_tail - br->br_cons_tail) + & br->br_prod_mask); +} Index: sys/kern/subr_bus.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/kern/subr_bus.c (revision 257322) +++ sys/kern/subr_bus.c (working copy) @@ -2722,7 +2722,7 @@ device_probe(device_t dev) } return (0); } - +uint32_t simp_bus_debug=3D0; /** * @brief Probe a device and attach a driver if possible * @@ -2742,6 +2742,11 @@ device_probe_and_attach(device_t dev) return (error); =20 CURVNET_SET_QUIET(vnet0); + if (simp_bus_debug) { + printf("%s:Attach for device 0x%x\n",=20 + __FUNCTION__, + (uint32_t)dev); + } error =3D device_attach(dev); CURVNET_RESTORE(); return error; @@ -2778,12 +2783,20 @@ device_attach(device_t dev) device_printf(dev, "disabled via hints = entry\n"); return (ENXIO); } - + if (simp_bus_debug) { + device_printf(dev, "init its sysctl info\n"); + } device_sysctl_init(dev); if (!device_is_quiet(dev)) device_print_child(dev->parent, dev); attachtime =3D get_cyclecount(); dev->state =3D DS_ATTACHING; + if (simp_bus_debug) { + device_printf(dev, "Calling attach\n"); + } + if (simp_bus_debug) { + device_printf(dev, "call the attach\n"); + } if ((error =3D DEVICE_ATTACH(dev)) !=3D 0) { printf("device_attach: %s%d attach returned %d\n", dev->driver->name, dev->unit, error); @@ -2812,6 +2825,9 @@ device_attach(device_t dev) else dev->state =3D DS_ATTACHED; dev->flags &=3D ~DF_DONENOMATCH; + if (simp_bus_debug) { + device_printf(dev, "finish out...\n"); + } devadded(dev); return (0); } Index: sys/net/drbr.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/net/drbr.c (revision 0) +++ sys/net/drbr.c (working copy) @@ -0,0 +1,507 @@ +#include + +SYSCTL_DECL(_net_link); +uint32_t drbr_maxq=3DDRBR_MAXQ_DEFAULT; + +TUNABLE_INT("net.link.drbr_maxq", &drbr_maxq); +SYSCTL_NODE(_net, OID_AUTO, drbr, CTLFLAG_RD, 0, "DRBR Parameters"); +SYSCTL_INT(_net_drbr, OID_AUTO, drbr_maxq, CTLFLAG_RDTUN, + &drbr_maxq, 0, "max number of priority queues per interface"); + +uint8_t set_up_drbr_depth=3D0; +uint32_t drbr_max_priority=3DDRBR_MAXQ_DEFAULT-1; +uint32_t drbr_queue_depth=3DDRBR_MIN_DEPTH; +uint32_t panic_on_dup_buf =3D 0; +uint32_t use_drbr_lock =3D 0; + +SYSCTL_INT(_net_drbr, OID_AUTO, drbr_queue_depth, CTLFLAG_RD, + &drbr_queue_depth, 0, "Queue length configed via ifqmaxlen"); + +SYSCTL_INT(_net_drbr, OID_AUTO, drbr_max_priority, CTLFLAG_RD, + &drbr_max_priority, 0, "Queue length configed via ifqmaxlen"); + +SYSCTL_INT(_net_drbr, OID_AUTO, drbr_panicdup, CTLFLAG_RW, + &panic_on_dup_buf, 0, "Panic on dup buf into br ring"); + +SYSCTL_INT(_net_drbr, OID_AUTO, drbr_usemtx, CTLFLAG_RW, + &use_drbr_lock, 0, "Use drbr mtx"); + +struct drbr_ring * +drbr_alloc(struct malloc_type *type, int flags, struct mtx *tmtx) +{ + struct drbr_ring *rng; + int i; + if (set_up_drbr_depth =3D=3D 0) { + drbr_max_priority =3D drbr_maxq-1; + set_up_drbr_depth =3D 1; + drbr_queue_depth =3D 1 << ((fls(ifqmaxlen)-1)); + if (drbr_queue_depth < DRBR_MIN_DEPTH) { + drbr_queue_depth =3D DRBR_MIN_DEPTH; + } + } + rng =3D (struct drbr_ring *)malloc(sizeof(struct drbr_ring), = type, flags); + if (rng =3D=3D NULL) { + return(NULL); + } + memset(rng, 0, sizeof(struct drbr_ring)); + DRBR_LOCK_INIT(rng); + rng->re =3D (struct drbr_ring_entry *)malloc((sizeof(struct = drbr_ring_entry)*drbr_maxq),=20 + type, flags); + if (rng->re =3D=3D NULL) { + free(rng, type); + return(NULL); + } + memset(rng->re, 0, (sizeof(struct drbr_ring_entry) * = drbr_maxq)); + /* Ok get the queues */ + for (i=3D0; ire[i].re_qs =3D buf_ring_alloc(drbr_queue_depth, = type, flags, tmtx); + if (rng->re[i].re_qs =3D=3D NULL) { + goto out_err; + } + } + rng->lowq_with_data =3D 0xffffffff; + return(rng); +out_err: + for(i=3D0; ire[i].re_qs) { + free(rng->re[i].re_qs, type); + } + } + free(rng->re, type); + free(rng, type); + return (NULL); +} + +#define PRIO_NAME_LEN 32 +void=20 +drbr_add_sysctl_stats(device_t dev, struct sysctl_oid_list *queue_list,=20= + struct drbr_ring *rng) +{ + int i; + struct sysctl_ctx_list *ctx =3D device_get_sysctl_ctx(dev); + struct sysctl_oid *prio_node; + struct sysctl_oid_list *prio_list; + char namebuf[PRIO_NAME_LEN]; + + if (rng =3D=3D NULL) + /* TSNH */ + return; + for (i=3D0; ire[i].re_cnt_sent, + "Packets Enqueued"); + SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, "bytes_sent", + CTLFLAG_RD, &rng->re[i].re_bytecnt_sent, + "Bytes Enqueued"); + SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, = "dropped_packets", + CTLFLAG_RD, &rng->re[i].re_drop_cnt, + "Packets Dropped"); + SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, = "dropped_bytes", + CTLFLAG_RD, &rng->re[i].re_bytedrop_cnt, + "Bytes Dropped"); + SYSCTL_ADD_UINT(ctx, prio_list, OID_AUTO, = "on_queue_now", + CTLFLAG_RD, &rng->re[i].re_cnt, 0, + "Current Queue Size"); + + } + +} + +u_long +drbr_get_dropcnt(struct drbr_ring *rng) +{ + u_long total; + int i; + + total =3D 0; + for (i=3D0; ire[i].re_drop_cnt; + } + return (total); +} + +void=20 +drbr_add_sysctl_stats_nodev(struct sysctl_oid_list *queue_list,=20 + struct sysctl_ctx_list *ctx, + struct drbr_ring *rng) +{ + int i; + struct sysctl_oid *prio_node; + struct sysctl_oid_list *prio_list; + char namebuf[PRIO_NAME_LEN]; + + if (rng =3D=3D NULL) + return; + for (i=3D0; ire[i].re_cnt_sent, + "Packets Enqueued"); + SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, "bytes_sent", + CTLFLAG_RD, &rng->re[i].re_bytecnt_sent, + "Bytes Enqueued"); + SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, = "dropped_packets", + CTLFLAG_RD, &rng->re[i].re_drop_cnt, + "Packets Dropped"); + SYSCTL_ADD_QUAD(ctx, prio_list, OID_AUTO, = "dropped_bytes", + CTLFLAG_RD, &rng->re[i].re_bytedrop_cnt, + "Bytes Dropped"); + SYSCTL_ADD_UINT(ctx, prio_list, OID_AUTO, = "on_queue_now", + CTLFLAG_RD, &rng->re[i].re_cnt, 0, + "Current Queue Size"); + } +} + +int +drbr_enqueue(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf *m) +{=09 + int error =3D 0; + uint8_t qused; + uint64_t bytecnt; + int locked =3D 0; + +#ifdef ALTQ + if ((ifp !=3D NULL) &&=20 + (ALTQ_IS_ENABLED(&ifp->if_snd))) { + IFQ_ENQUEUE(&ifp->if_snd, m, error); + return (error); + } +#endif + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + if (m->m_pkthdr.cosqos >=3D drbr_maxq) { + /* Lowest priority queue */ + qused =3D drbr_maxq - 1; + } else { + qused =3D m->m_pkthdr.cosqos; + } + bytecnt =3D m->m_pkthdr.len; + error =3D buf_ring_enqueue(rng->re[qused].re_qs, m); + if (error) { + m_freem(m); + atomic_add_long(&rng->re[qused].re_drop_cnt, 1); + atomic_add_long(&rng->re[qused].re_bytedrop_cnt, = bytecnt); + } else { + if (qused < rng->lowq_with_data) { + atomic_clear_int(&rng->lowq_with_data, = 0xffffffff); + atomic_set_int(&rng->lowq_with_data, qused); + } + atomic_add_int(&rng->count_on_queues, 1); + atomic_add_int(&rng->re[qused].re_cnt, 1); + atomic_add_long(&rng->re[qused].re_cnt_sent, 1); + atomic_add_long(&rng->re[qused].re_bytecnt_sent, = bytecnt); + } + if (locked) { + DRBR_UNLOCK(rng); + } + return (error); +} + +int +drbr_is_on_ring(struct drbr_ring *rng, struct mbuf *m) +{ + int locked =3D 0; + int answer =3D 0; /* No its not by default */ + int i; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + for(i=3D0; ire[i].re_qs)) + continue; + if (buf_ring_mbufon(rng->re[i].re_qs, m)) { + answer =3D 1; + break; + } + }=09 + if (locked) { + DRBR_UNLOCK(rng); + } + return(answer); +} + +void +drbr_putback(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf = *new, uint8_t qused) +{ + /* + * The top of the list needs to be swapped=20 + * for this one. + */ + int locked =3D 0; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + buf_ring_putback_mc(rng->re[qused].re_qs, new); + if (locked) { + DRBR_UNLOCK(rng); + } +} + +struct mbuf * +drbr_peek(struct ifnet *ifp, struct drbr_ring *rng, uint8_t *qused) +{ + int i; + int locked =3D 0; + struct mbuf *m; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + if (rng->count_on_queues =3D=3D 0) { + /* All done now */ + if (locked) { + DRBR_UNLOCK(rng); + } + return (NULL); + } + if (rng->lowq_with_data =3D=3D 0xffffffff) { + rng->lowq_with_data =3D 0; + } + for(i=3Drng->lowq_with_data; ire[i].re_qs)) + continue; + rng->lowq_with_data =3D i; + break; + } + if (i >=3D drbr_maxq) { + /* Huh? */ + rng->lowq_with_data =3D 0; + for (i=3Drng->lowq_with_data; ire[i].re_qs)) + continue; + rng->lowq_with_data =3D i; + break; + } + if (i >=3D drbr_maxq) { + /* Really huh? */ + rng->count_on_queues =3D 0; + if (locked) { + DRBR_UNLOCK(rng); + } + return (NULL); + } + } + *qused =3D i; + m =3D buf_ring_peek(rng->re[i].re_qs); + if (locked) { + DRBR_UNLOCK(rng); + } + return(m); +} + +static void +drbr_flush_locked(struct ifnet *ifp, struct drbr_ring *rng) +{ + int i; + struct mbuf *m; + int locked =3D 0; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + if (rng =3D=3D NULL) { + return; + } + for(i=3D0; ire[i].re_qs)) !=3D = NULL) { + atomic_subtract_long(&rng->re[i].re_cnt_sent, = 1); + if (ifp) { + ifp->if_oerrors++; + } + m_freem(m); + } + rng->re[i].re_cnt =3D 0; + } + rng->lowq_with_data =3D 0xffffffff; + rng->count_on_queues =3D 0; + if (locked) { + DRBR_UNLOCK(rng); + } +} + +void +drbr_flush(struct ifnet *ifp, struct drbr_ring *rng) +{ + drbr_flush_locked(ifp, rng); +} + +void +drbr_free(struct drbr_ring *rng, struct malloc_type *type) +{ + int i; + int locked =3D 0; + if (rng =3D=3D NULL) { + return; + } + drbr_flush_locked(NULL, rng); + + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + for(i=3D0; ire[i].re_qs) { + buf_ring_free(rng->re[i].re_qs, type); + } + } + DRBR_LOCK_DESTROY(rng); + free(rng->re, type); + free(rng, type); +} + +struct mbuf * +drbr_dequeue(struct ifnet *ifp, struct drbr_ring *rng) +{ + int i; + struct mbuf *m; + int locked =3D 0; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + if (rng->count_on_queues =3D=3D 0) { + if (locked) { + DRBR_UNLOCK(rng); + } + return (NULL); + } + if (rng->lowq_with_data =3D=3D 0xffffffff) { + rng->lowq_with_data =3D 0; + } + for(i=3Drng->lowq_with_data; ire[i].re_qs)) + continue; + rng->lowq_with_data =3D i; + break; + } +#ifdef INVARIANT + if (i >=3D drbr_maxq) { + /* Nothing on ring from marker up? */ + rng->lowq_with_data =3D 0; + for (i=3Drng->lowq_with_data; ire[i].re_qs)) + continue; + rng->lowq_with_data =3D i; + break; + } + if (i >=3D drbr_maxq) { + /* Count was off? */ + rng->count_on_queues =3D 0; + if (locked) { + DRBR_UNLOCK(rng); + } + return (NULL); + } + } +#else + if (i >=3D drbr_maxq) { + /* Huh */ + i =3D 0; + } +#endif + m =3D buf_ring_dequeue_mc(rng->re[i].re_qs); + if (m) { + atomic_subtract_int(&rng->re[i].re_cnt, 1); + atomic_subtract_int(&rng->count_on_queues, 1); + if (rng->count_on_queues =3D=3D 0) { + atomic_set_int(&rng->lowq_with_data, = 0xffffffff); + } + } else { + /* TSNH */ + rng->re[i].re_cnt =3D 0; + } + if (locked) { + DRBR_UNLOCK(rng); + } + return(m); +} + +void +drbr_advance(struct ifnet *ifp, struct drbr_ring *rng, uint8_t qused) +{ + int locked =3D 0; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + if (rng->count_on_queues =3D=3D 0) { + /* Huh? */ + if (locked) { + DRBR_UNLOCK(rng); + } + return; + } + atomic_subtract_int(&rng->count_on_queues, 1); + if (rng->count_on_queues =3D=3D 0) { + atomic_set_int(&rng->lowq_with_data, 0xffffffff); + } + buf_ring_advance_mc(rng->re[qused].re_qs); + atomic_subtract_int(&rng->re[qused].re_cnt, 1); + if (locked) { + DRBR_UNLOCK(rng); + } +} + +struct mbuf * +drbr_dequeue_cond(struct ifnet *ifp, struct drbr_ring *rng, + int (*func) (struct mbuf *, void *), void *arg)=20 +{ + uint8_t qused; + struct mbuf *m; + int locked =3D 0; + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } + qused =3D 0; + m =3D drbr_peek(ifp, rng, &qused); + if (locked) { + DRBR_UNLOCK(rng); + } + if (m =3D=3D NULL || func(m, arg) =3D=3D 0) { + return (NULL); + } + if (use_drbr_lock) { + DRBR_LOCK(rng); + locked =3D 1; + } else { + locked =3D 0; + } + atomic_subtract_int(&rng->re[qused].re_cnt, 1); + atomic_subtract_int(&rng->count_on_queues, 1); + m =3D buf_ring_dequeue_mc(rng->re[qused].re_qs); + if (locked) { + DRBR_UNLOCK(rng); + } + return (m); +} + +int +drbr_empty(struct ifnet *ifp, struct drbr_ring *rng) +{ + return (!rng->count_on_queues); +} + +int +drbr_needs_enqueue(struct ifnet *ifp, struct drbr_ring *rng) +{ + return (!(rng->count_on_queues =3D=3D 0)); +} + +int +drbr_inuse(struct ifnet *ifp, struct drbr_ring *rng) +{ + return (rng->count_on_queues); +} Property changes on: sys/net/drbr.c ___________________________________________________________________ Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=3D%H \ No newline at end of property Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: sys/net/drbr.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/net/drbr.h (revision 0) +++ sys/net/drbr.h (working copy) @@ -0,0 +1,89 @@ +#ifndef __drbr_h__ +#define __drbr_h__ +#include +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#endif +#include +#include +#include +#include +#include +#include +#include + +#define DRBR_MAXQ_DEFAULT 8 +#define DRBR_MIN_DEPTH 64 /* Must be power of 2 */ + +#define USE_LOCK + +#ifdef _KERNEL +extern uint32_t drbr_maxq; +#endif + +struct drbr_ring_entry { + struct buf_ring *re_qs; /* Ring itself */ + u_long re_drop_cnt; /* Drop count in pkts */ + u_long re_bytedrop_cnt;/* Drop count in bytes = */ + u_long re_cnt_sent; /* Total sent in pkts */ + u_long re_bytecnt_sent;/* Total sent in bytes = */ + uint32_t re_cnt; /* Count on ring */ +}; + +#define DRBR_LOCK_INIT(rng) mtx_init(&(rng)->rng_mtx, "drbr_lock", = "drbr", MTX_DEF | MTX_DUPOK) +#define DRBR_LOCK_DESTROY(rng) mtx_destroy(&(rng)->rng_mtx) +#define DRBR_LOCK(rng) mtx_lock(&(rng)->rng_mtx) +#define DRBR_UNLOCK(rng) mtx_unlock(&(rng)->rng_mtx) +#define DRBR_LOCK_OWNED(rng) mtx_owned(&(rng)->rng_mtx) + +struct drbr_ring { +#ifdef _KERNEL + struct mtx rng_mtx; +#endif + struct drbr_ring_entry *re; + uint32_t count_on_queues; + uint32_t lowq_with_data; +}; + +#ifdef _KERNEL +struct drbr_ring * +drbr_alloc(struct malloc_type *type, int flags, struct mtx *tmtx); +int drbr_enqueue(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf = *m); +void drbr_putback(struct ifnet *ifp, struct drbr_ring *rng, struct mbuf = *new,=20 + uint8_t qused); +struct mbuf *drbr_peek(struct ifnet *ifp, struct drbr_ring *rng, + uint8_t *qused); +void drbr_flush(struct ifnet *ifp, struct drbr_ring *rng); +void drbr_free(struct drbr_ring *rng, struct malloc_type *type); +struct mbuf *drbr_dequeue(struct ifnet *ifp, struct drbr_ring *rng); +void drbr_advance(struct ifnet *ifp, struct drbr_ring *rng, uint8_t = qused); +struct mbuf * +drbr_dequeue_cond(struct ifnet *ifp, struct drbr_ring *rng, + int (*func) (struct mbuf *, void *), void *arg) ; +int drbr_empty(struct ifnet *ifp, struct drbr_ring *rng); +int drbr_needs_enqueue(struct ifnet *ifp, struct drbr_ring *rng); +int drbr_inuse(struct ifnet *ifp, struct drbr_ring *rng); +void drbr_add_sysctl_stats(device_t dev, struct sysctl_oid_list = *queue_list,=20 + struct drbr_ring *rng); +void=20 +drbr_add_sysctl_stats_nodev(struct sysctl_oid_list *queue_list,=20 + struct sysctl_ctx_list *ctx, + struct drbr_ring *rng); + +int drbr_is_on_ring(struct drbr_ring *rng, struct mbuf *m); +u_long drbr_get_dropcnt(struct drbr_ring *rng); + +#endif + +#endif Property changes on: sys/net/drbr.h ___________________________________________________________________ Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=3D%H \ No newline at end of property Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: sys/net/if_var.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/net/if_var.h (revision 257322) +++ sys/net/if_var.h (working copy) @@ -205,7 +205,14 @@ struct ifnet { */ char if_cspare[3]; int if_ispare[4]; - void *if_pspare[8]; /* 1 netmap, 7 TDB */ + /* Set max bytes on ring - buffer bloat managment */ + void (*if_maxbytes)(struct ifnet *, uint64_t maxbytes); + /* Get a drbr ring to peak at */ + struct drbr_ring * (*if_getdrbr_ring)(struct ifnet *, int = queuenum); + /* Is this mbuf on one of your rings? */ + int (*if_mbuf_on_ring)(struct ifnet *, struct mbuf *); + + void *if_pspare[5]; /* 1 netmap, 4 TDB */ }; =20 /* @@ -599,165 +606,7 @@ if_initbaudrate(struct ifnet *ifp, uintmax_t baud) ifp->if_baudrate =3D baud; } =20 -static __inline int -drbr_enqueue(struct ifnet *ifp, struct buf_ring *br, struct mbuf *m) -{=09 - int error =3D 0; - -#ifdef ALTQ - if (ALTQ_IS_ENABLED(&ifp->if_snd)) { - IFQ_ENQUEUE(&ifp->if_snd, m, error); - return (error); - } #endif - error =3D buf_ring_enqueue(br, m); - if (error) - m_freem(m); - - return (error); -} - -static __inline void -drbr_putback(struct ifnet *ifp, struct buf_ring *br, struct mbuf *new) -{ - /* - * The top of the list needs to be swapped=20 - * for this one. - */ -#ifdef ALTQ - if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) { - /*=20 - * Peek in altq case dequeued it - * so put it back. - */ - IFQ_DRV_PREPEND(&ifp->if_snd, new); - return; - } -#endif - buf_ring_putback_sc(br, new); -} - -static __inline struct mbuf * -drbr_peek(struct ifnet *ifp, struct buf_ring *br) -{ -#ifdef ALTQ - struct mbuf *m; - if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) { - /*=20 - * Pull it off like a dequeue - * since drbr_advance() does nothing - * for altq and drbr_putback() will - * use the old prepend function. - */ - IFQ_DEQUEUE(&ifp->if_snd, m); - return (m); - } -#endif - return(buf_ring_peek(br)); -} - -static __inline void -drbr_flush(struct ifnet *ifp, struct buf_ring *br) -{ - struct mbuf *m; - -#ifdef ALTQ - if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) - IFQ_PURGE(&ifp->if_snd); -#endif=09 - while ((m =3D buf_ring_dequeue_sc(br)) !=3D NULL) - m_freem(m); -} - -static __inline void -drbr_free(struct buf_ring *br, struct malloc_type *type) -{ - - drbr_flush(NULL, br); - buf_ring_free(br, type); -} - -static __inline struct mbuf * -drbr_dequeue(struct ifnet *ifp, struct buf_ring *br) -{ -#ifdef ALTQ - struct mbuf *m; - - if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) {=09 - IFQ_DEQUEUE(&ifp->if_snd, m); - return (m); - } -#endif - return (buf_ring_dequeue_sc(br)); -} - -static __inline void -drbr_advance(struct ifnet *ifp, struct buf_ring *br) -{ -#ifdef ALTQ - /* Nothing to do here since peek dequeues in altq case */ - if (ifp !=3D NULL && ALTQ_IS_ENABLED(&ifp->if_snd)) - return; -#endif - return (buf_ring_advance_sc(br)); -} - - -static __inline struct mbuf * -drbr_dequeue_cond(struct ifnet *ifp, struct buf_ring *br, - int (*func) (struct mbuf *, void *), void *arg)=20 -{ - struct mbuf *m; -#ifdef ALTQ - if (ALTQ_IS_ENABLED(&ifp->if_snd)) { - IFQ_LOCK(&ifp->if_snd); - IFQ_POLL_NOLOCK(&ifp->if_snd, m); - if (m !=3D NULL && func(m, arg) =3D=3D 0) { - IFQ_UNLOCK(&ifp->if_snd); - return (NULL); - } - IFQ_DEQUEUE_NOLOCK(&ifp->if_snd, m); - IFQ_UNLOCK(&ifp->if_snd); - return (m); - } -#endif - m =3D buf_ring_peek(br); - if (m =3D=3D NULL || func(m, arg) =3D=3D 0) - return (NULL); - - return (buf_ring_dequeue_sc(br)); -} - -static __inline int -drbr_empty(struct ifnet *ifp, struct buf_ring *br) -{ -#ifdef ALTQ - if (ALTQ_IS_ENABLED(&ifp->if_snd)) - return (IFQ_IS_EMPTY(&ifp->if_snd)); -#endif - return (buf_ring_empty(br)); -} - -static __inline int -drbr_needs_enqueue(struct ifnet *ifp, struct buf_ring *br) -{ -#ifdef ALTQ - if (ALTQ_IS_ENABLED(&ifp->if_snd)) - return (1); -#endif - return (!buf_ring_empty(br)); -} - -static __inline int -drbr_inuse(struct ifnet *ifp, struct buf_ring *br) -{ -#ifdef ALTQ - if (ALTQ_IS_ENABLED(&ifp->if_snd)) - return (ifp->if_snd.ifq_len); -#endif - return (buf_ring_count(br)); -} -#endif /* * 72 was chosen below because it is the size of a TCP/IP * header (40) + the minimum mss (32). Index: sys/netinet/if_ether.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/netinet/if_ether.c (revision 257322) +++ sys/netinet/if_ether.c (working copy) @@ -283,6 +283,7 @@ arprequest(struct ifnet *ifp, const struct in_addr sa.sa_len =3D 2; m->m_flags |=3D M_BCAST; m_clrprotoflags(m); /* Avoid confusing lower layers. */ + m->m_pkthdr.cosqos =3D 0; /* Highest Priority */ (*ifp->if_output)(ifp, m, &sa, NULL); ARPSTAT_INC(txrequests); } Index: sys/ofed/drivers/net/mlx4/en_tx.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/ofed/drivers/net/mlx4/en_tx.c (revision 257322) +++ sys/ofed/drivers/net/mlx4/en_tx.c (working copy) @@ -39,6 +39,7 @@ =20 #include #include +#include #include =20 #include @@ -78,7 +79,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *pr mtx_init(&ring->comp_lock.m, "mlx4 comp", NULL, MTX_DEF); =20 /* Allocate the buf ring */ - ring->br =3D buf_ring_alloc(MLX4_EN_DEF_TX_QUEUE_SIZE, M_DEVBUF, + ring->br =3D drbr_alloc(M_DEVBUF, M_WAITOK, &ring->tx_lock.m); if (ring->br =3D=3D NULL) { en_err(priv, "Failed allocating tx_info ring\n"); @@ -155,7 +156,7 @@ err_bounce: kfree(ring->bounce_buf); ring->bounce_buf =3D NULL; err_tx: - buf_ring_free(ring->br, M_DEVBUF); + drbr_free(ring->br, M_DEVBUF); kfree(ring->tx_info); ring->tx_info =3D NULL; return err; @@ -167,7 +168,7 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv * struct mlx4_en_dev *mdev =3D priv->mdev; en_dbg(DRV, priv, "Destroying tx ring, qpn: %d\n", ring->qpn); =20 - buf_ring_free(ring->br, M_DEVBUF); + drbr_free(ring->br, M_DEVBUF); if (ring->bf_enabled) mlx4_bf_free(mdev->dev, &ring->bf); mlx4_qp_remove(mdev->dev, &ring->qp); @@ -925,6 +926,7 @@ mlx4_en_transmit_locked(struct ifnet *dev, int tx_ struct mlx4_en_tx_ring *ring; struct mbuf *next; int enqueued, err =3D 0; + uint8_t queue; =20 ring =3D &priv->tx_ring[tx_ind]; if ((dev->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) !=3D= @@ -940,16 +942,16 @@ mlx4_en_transmit_locked(struct ifnet *dev, int tx_ return (err); } /* Process the queue */ - while ((next =3D drbr_peek(dev, ring->br)) !=3D NULL) { + while ((next =3D drbr_peek(dev, ring->br, &queue)) !=3D NULL) { if ((err =3D mlx4_en_xmit(dev, tx_ind, &next)) !=3D 0) { if (next =3D=3D NULL) { - drbr_advance(dev, ring->br); + drbr_advance(dev, ring->br, queue); } else { - drbr_putback(dev, ring->br, next); + drbr_putback(dev, ring->br, next, = queue); } break; } - drbr_advance(dev, ring->br); + drbr_advance(dev, ring->br, queue); enqueued++; dev->if_obytes +=3D next->m_pkthdr.len; if (next->m_flags & M_MCAST) @@ -1027,12 +1029,10 @@ mlx4_en_qflush(struct ifnet *dev) { struct mlx4_en_priv *priv =3D netdev_priv(dev); struct mlx4_en_tx_ring *ring =3D priv->tx_ring; - struct mbuf *m; =20 for (int i =3D 0; i < priv->tx_ring_num; i++, ring++) { spin_lock(&ring->tx_lock); - while ((m =3D buf_ring_dequeue_sc(ring->br)) !=3D NULL) - m_freem(m); + drbr_flush(dev, ring->br); spin_unlock(&ring->tx_lock); } if_qflush(dev); Index: sys/ofed/drivers/net/mlx4/mlx4_en.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/ofed/drivers/net/mlx4/mlx4_en.h (revision 257322) +++ sys/ofed/drivers/net/mlx4/mlx4_en.h (working copy) @@ -285,7 +285,7 @@ struct mlx4_en_tx_ring { void *buf; u16 poll_cnt; int blocked; - struct buf_ring *br; + struct drbr_ring *br; struct mlx4_en_tx_info *tx_info; u8 *bounce_buf; u32 last_nr_txbb; Index: sys/sys/buf_ring.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/sys/buf_ring.h (revision 257322) +++ sys/sys/buf_ring.h (working copy) @@ -61,176 +61,25 @@ struct buf_ring { * multi-producer safe lock-free ring buffer enqueue * */ -static __inline int -buf_ring_enqueue(struct buf_ring *br, void *buf) -{ - uint32_t prod_head, prod_next; - uint32_t cons_tail; -#ifdef DEBUG_BUFRING - int i; - for (i =3D br->br_cons_head; i !=3D br->br_prod_head; - i =3D ((i + 1) & br->br_cons_mask)) - if(br->br_ring[i] =3D=3D buf) - panic("buf=3D%p already enqueue at %d prod=3D%d = cons=3D%d", - buf, i, br->br_prod_tail, br->br_cons_tail); -#endif=09 - critical_enter(); - do { - prod_head =3D br->br_prod_head; - cons_tail =3D br->br_cons_tail; - - prod_next =3D (prod_head + 1) & br->br_prod_mask; - =09 - if (prod_next =3D=3D cons_tail) { - br->br_drops++; - critical_exit(); - return (ENOBUFS); - } - } while (!atomic_cmpset_int(&br->br_prod_head, prod_head, = prod_next)); -#ifdef DEBUG_BUFRING - if (br->br_ring[prod_head] !=3D NULL) - panic("dangling value in enqueue"); -#endif=09 - br->br_ring[prod_head] =3D buf; - - /* - * The full memory barrier also avoids that br_prod_tail store - * is reordered before the br_ring[prod_head] is full setup. - */ - mb(); - - /* - * If there are other enqueues in progress - * that preceeded us, we need to wait for them - * to complete=20 - */ =20 - while (br->br_prod_tail !=3D prod_head) - cpu_spinwait(); - br->br_prod_tail =3D prod_next; - critical_exit(); - return (0); -} - +int buf_ring_enqueue(struct buf_ring *br, void *buf); /* * multi-consumer safe dequeue=20 * */ -static __inline void * -buf_ring_dequeue_mc(struct buf_ring *br) -{ - uint32_t cons_head, cons_next; - uint32_t prod_tail; - void *buf; - int success; - - critical_enter(); - do { - cons_head =3D br->br_cons_head; - prod_tail =3D br->br_prod_tail; - - cons_next =3D (cons_head + 1) & br->br_cons_mask; - =09 - if (cons_head =3D=3D prod_tail) { - critical_exit(); - return (NULL); - } - =09 - success =3D atomic_cmpset_int(&br->br_cons_head, = cons_head, - cons_next); - } while (success =3D=3D 0); =09 - - buf =3D br->br_ring[cons_head]; -#ifdef DEBUG_BUFRING - br->br_ring[cons_head] =3D NULL; -#endif - - /* - * The full memory barrier also avoids that br_ring[cons_read] - * load is reordered after br_cons_tail is set. - */ - mb(); -=09 - /* - * If there are other dequeues in progress - * that preceeded us, we need to wait for them - * to complete=20 - */ =20 - while (br->br_cons_tail !=3D cons_head) - cpu_spinwait(); - - br->br_cons_tail =3D cons_next; - critical_exit(); - - return (buf); -} - +void *buf_ring_dequeue_mc(struct buf_ring *br); /* * single-consumer dequeue=20 * use where dequeue is protected by a lock * e.g. a network driver's tx queue lock */ -static __inline void * -buf_ring_dequeue_sc(struct buf_ring *br) -{ - uint32_t cons_head, cons_next, cons_next_next; - uint32_t prod_tail; - void *buf; -=09 - cons_head =3D br->br_cons_head; - prod_tail =3D br->br_prod_tail; -=09 - cons_next =3D (cons_head + 1) & br->br_cons_mask; - cons_next_next =3D (cons_head + 2) & br->br_cons_mask; -=09 - if (cons_head =3D=3D prod_tail)=20 - return (NULL); - -#ifdef PREFETCH_DEFINED=09 - if (cons_next !=3D prod_tail) { =09 - prefetch(br->br_ring[cons_next]); - if (cons_next_next !=3D prod_tail)=20 - prefetch(br->br_ring[cons_next_next]); - } -#endif - br->br_cons_head =3D cons_next; - buf =3D br->br_ring[cons_head]; - -#ifdef DEBUG_BUFRING - br->br_ring[cons_head] =3D NULL; - if (!mtx_owned(br->br_lock)) - panic("lock not held on single consumer dequeue"); - if (br->br_cons_tail !=3D cons_head) - panic("inconsistent list cons_tail=3D%d cons_head=3D%d", - br->br_cons_tail, cons_head); -#endif - br->br_cons_tail =3D cons_next; - return (buf); -} - +void *buf_ring_dequeue_sc(struct buf_ring *br); /* * single-consumer advance after a peek * use where it is protected by a lock * e.g. a network driver's tx queue lock */ -static __inline void -buf_ring_advance_sc(struct buf_ring *br) -{ - uint32_t cons_head, cons_next; - uint32_t prod_tail; -=09 - cons_head =3D br->br_cons_head; - prod_tail =3D br->br_prod_tail; -=09 - cons_next =3D (cons_head + 1) & br->br_cons_mask; - if (cons_head =3D=3D prod_tail)=20 - return; - br->br_cons_head =3D cons_next; -#ifdef DEBUG_BUFRING - br->br_ring[cons_head] =3D NULL; -#endif - br->br_cons_tail =3D cons_next; -} - +void buf_ring_advance_sc(struct buf_ring *br); +void buf_ring_advance_mc(struct buf_ring *br); /* * Used to return a buffer (most likely already there) * to the top od the ring. The caller should *not* @@ -247,65 +96,27 @@ struct buf_ring { * if we have to do a multi-queue version we will need * the compare and an atomic. */ -static __inline void -buf_ring_putback_sc(struct buf_ring *br, void *new) -{ - KASSERT(br->br_cons_head !=3D br->br_prod_tail,=20 - ("Buf-Ring has none in putback")) ; - br->br_ring[br->br_cons_head] =3D new; -} - +void buf_ring_putback_mc(struct buf_ring *br, void *new); +void buf_ring_putback_sc(struct buf_ring *br, void *new); /* * return a pointer to the first entry in the ring * without modifying it, or NULL if the ring is empty * race-prone if not protected by a lock */ -static __inline void * -buf_ring_peek(struct buf_ring *br) -{ +void *buf_ring_peek(struct buf_ring *br); =20 -#ifdef DEBUG_BUFRING - if ((br->br_lock !=3D NULL) && !mtx_owned(br->br_lock)) - panic("lock not held on single consumer dequeue"); -#endif=09 - /* - * I believe it is safe to not have a memory barrier - * here because we control cons and tail is worst case - * a lagging indicator so we worst case we might - * return NULL immediately after a buffer has been enqueued - */ - if (br->br_cons_head =3D=3D br->br_prod_tail) - return (NULL); -=09 - return (br->br_ring[br->br_cons_head]); -} +int buf_ring_full(struct buf_ring *br); =20 -static __inline int -buf_ring_full(struct buf_ring *br) -{ +int buf_ring_empty(struct buf_ring *br); =20 - return (((br->br_prod_head + 1) & br->br_prod_mask) =3D=3D = br->br_cons_tail); -} +int buf_ring_count(struct buf_ring *br); =20 -static __inline int -buf_ring_empty(struct buf_ring *br) -{ - - return (br->br_cons_head =3D=3D br->br_prod_tail); -} - -static __inline int -buf_ring_count(struct buf_ring *br) -{ - - return ((br->br_prod_size + br->br_prod_tail - br->br_cons_tail) - & br->br_prod_mask); -} - struct buf_ring *buf_ring_alloc(int count, struct malloc_type *type, = int flags, struct mtx *); + void buf_ring_free(struct buf_ring *br, struct malloc_type *type); =20 +int buf_ring_mbufon(struct buf_ring *br, void *buf); =20 =20 #endif Index: usr.sbin/ofwdump/ofwdump.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- usr.sbin/ofwdump/ofwdump.c (revision 257322) +++ usr.sbin/ofwdump/ofwdump.c (working copy) @@ -63,6 +63,8 @@ usage(void) exit(EX_USAGE); } =20 +static int query_mode =3D 0; + int main(int argc, char *argv[]) { @@ -72,10 +74,13 @@ main(int argc, char *argv[]) =20 aflag =3D pflag =3D rflag =3D Rflag =3D Sflag =3D 0; Parg =3D NULL; - while ((opt =3D getopt(argc, argv, "-aprP:RS")) !=3D -1) { + while ((opt =3D getopt(argc, argv, "-aqprP:RS")) !=3D -1) { if (opt =3D=3D '-') break; switch (opt) { + case 'q': + query_mode =3D 1; + break; case 'a': aflag =3D 1; rflag =3D 1; @@ -209,6 +214,7 @@ ofw_dump_node(int fd, phandle_t n, int level, int static int nblen =3D 0; int plen; phandle_t c; + int my_prop =3D 0; =20 if (!(raw || str)) { ofw_indent(level * LVLINDENT); @@ -218,9 +224,26 @@ ofw_dump_node(int fd, phandle_t n, int level, int printf(": %.*s\n", (int)plen, (char *)nbuf); else putchar('\n'); + if (query_mode) { + char input[100]; + fprintf(stdout, "Dump properties (y or n)?"); + fflush(stdout); + input[0] =3D 0; + fgets(input, sizeof(input), stdin); + if (input[0] =3D=3D 'y') { + my_prop =3D 1; + } + } + =09 } if (prop) ofw_dump_properties(fd, n, level, pmatch, raw, str); + if (my_prop) { + ofw_dump_properties(fd, n, level, pmatch, raw, str); + printf("Exiting\n"); + exit(0); + } + if (rec) { for (c =3D ofw_child(fd, n); c !=3D 0; c =3D = ofw_peer(fd, c)) { ofw_dump_node(fd, c, level + 1, rec, prop, = pmatch, --Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii ------------------------------ Randall Stewart 803-317-4952 (cell) --Apple-Mail=_49C5FDEE-E4BA-44F6-8F6A-342853C67ED4-- From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 11:04:31 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 47F4F8EF for ; Tue, 29 Oct 2013 11:04:31 +0000 (UTC) (envelope-from rrs@lakerest.net) Received: from lakerest.net (lakerest.net [162.235.35.161]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CFEC82C42 for ; Tue, 29 Oct 2013 11:04:30 +0000 (UTC) Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162]) (authenticated bits=0) by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TB4OPY068744 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 29 Oct 2013 07:04:24 -0400 (EDT) (envelope-from rrs@lakerest.net) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1283) Subject: Re: MQ Patch. From: Randall Stewart In-Reply-To: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> Date: Tue, 29 Oct 2013 07:04:24 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <06B5EC19-8F81-4726-9DF1-96286B0967A5@lakerest.net> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> To: net@freebsd.org X-Mailer: Apple Mail (2.1283) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 11:04:31 -0000 A quick follow up note. I will have an update to this.. it looks like in my build-universe I see if_var.h changed (includes and such) so I will have to touch up drbr.h (nothing like trying to hit a moving target :-D) I will send out an update after my build-universe completes (hopefully = today).. but take a look at this one anyway (understand a couple of includes and such = may change) :-) R On Oct 29, 2013, at 6:50 AM, Randall Stewart wrote: > Hi: >=20 > As discussed at vBSDcon with andre/emaste and gnn, I am sending > this patch out to all of you ;-) >=20 > I have previously sent it to gnn, andre, jhb, rwatson, and several = other > of the usual suspects (as gnn put it) and received dead silence. >=20 > What does this patch do? >=20 > Well it add the ability to do multi-queue at the driver level. = Basically > any driver that uses the new interface gets under it N queues (default > is 8) for each physical transmit ring it has. The driver picks up=20 > its queue 0 first, then queue 1 .. up to the max. >=20 > This allows you to prioritize packets. Also in here is the start of = some > work I will be doing for AQM.. think either Pi or Codel ;-) >=20 > Right now thats pretty simple and just (in a few drivers) as the = ability > to limit the amount of data on the ring=85 which can help reduce = buffer > bloat. That needs to be refined into a lot more. >=20 > This work is donated by Adara Networks and has been discussed in = several > of the past vendor summits. >=20 > I plan on committing this before the IETF unless I hear major = objections. >=20 > Please have a look ;-) >=20 > Best wishes >=20 > R >=20 > > ------------------------------ > Randall Stewart > 803-317-4952 (cell) >=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" ------------------------------ Randall Stewart 803-317-4952 (cell) From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 13:00:02 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 9BD81843 for ; Tue, 29 Oct 2013 13:00:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7A3DA2504 for ; Tue, 29 Oct 2013 13:00:02 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9TD02ag040343 for ; Tue, 29 Oct 2013 13:00:02 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9TD022d040342; Tue, 29 Oct 2013 13:00:02 GMT (envelope-from gnats) Date: Tue, 29 Oct 2013 13:00:02 GMT Message-Id: <201310291300.r9TD022d040342@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: dfilter@FreeBSD.ORG (dfilter service) Subject: Re: kern/134531: commit references a PR X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: dfilter service List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 13:00:02 -0000 The following reply was made to PR kern/134531; it has been noted by GNATS. From: dfilter@FreeBSD.ORG (dfilter service) To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/134531: commit references a PR Date: Tue, 29 Oct 2013 12:53:33 +0000 (UTC) Author: melifaro Date: Tue Oct 29 12:53:23 2013 New Revision: 257330 URL: http://svnweb.freebsd.org/changeset/base/257330 Log: MFC r256624: Fix long-standing issue with incorrect radix mask calculation. Usual symptoms are messages like rn_delete: inconsistent annotation rn_addmask: mask impossibly already in tree routing daemon constantly deleting IPv6 default route or inability to flush/delete particular prefix in ipfw table. Changes: * Assume 32 bytes as maximum radix key length * Remove rn_init() * Statically allocate rn_ones/rn_zeroes * Make separate mask tree for each "normal" tree instead of system global one * Remove "optimization" on masks reusage and key zeroying * Change rn_addmask() arguments to accept tree pointer (no users in base) MFC changes: * keep rn_init() * create global mask tree, protected with mutex, for old rn_addmask users (currently 0 in base) * Add new rn_addmask_r() function (rn_addmask in head) with additional argument to accept tree pointer PR: kern/182851, kern/169206, kern/135476, kern/134531 Found by: Slawa Olhovchenkov Reviewed by: glebius (previous versions) Sponsored by: Yandex LLC Approved by: re (glebius) Modified: stable/10/sys/net/radix.c stable/10/sys/net/radix.h Modified: stable/10/sys/net/radix.c ============================================================================== --- stable/10/sys/net/radix.c Tue Oct 29 12:34:11 2013 (r257329) +++ stable/10/sys/net/radix.c Tue Oct 29 12:53:23 2013 (r257330) @@ -66,27 +66,27 @@ static struct radix_node *rn_search(void *, struct radix_node *), *rn_search_m(void *, struct radix_node *, void *); -static int max_keylen; -static struct radix_mask *rn_mkfreelist; -static struct radix_node_head *mask_rnhead; +static void rn_detachhead_internal(void **head); +static int rn_inithead_internal(void **head, int off); + +#define RADIX_MAX_KEY_LEN 32 + +static char rn_zeros[RADIX_MAX_KEY_LEN]; +static char rn_ones[RADIX_MAX_KEY_LEN] = { + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, +}; + /* - * Work area -- the following point to 3 buffers of size max_keylen, - * allocated in this order in a block of memory malloc'ed by rn_init. - * rn_zeros, rn_ones are set in rn_init and used in readonly afterwards. - * addmask_key is used in rn_addmask in rw mode and not thread-safe. + * XXX: Compat stuff for old rn_addmask() users */ -static char *rn_zeros, *rn_ones, *addmask_key; - -#define MKGet(m) { \ - if (rn_mkfreelist) { \ - m = rn_mkfreelist; \ - rn_mkfreelist = (m)->rm_mklist; \ - } else \ - R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); } - -#define MKFree(m) { (m)->rm_mklist = rn_mkfreelist; rn_mkfreelist = (m);} +static struct radix_node_head *mask_rnhead_compat; +#ifdef _KERNEL +static struct mtx mask_mtx; +#endif -#define rn_masktop (mask_rnhead->rnh_treetop) static int rn_lexobetter(void *m_arg, void *n_arg); static struct radix_mask * @@ -230,7 +230,8 @@ rn_lookup(v_arg, m_arg, head) caddr_t netmask = 0; if (m_arg) { - x = rn_addmask(m_arg, 1, head->rnh_treetop->rn_offset); + x = rn_addmask_r(m_arg, head->rnh_masks, 1, + head->rnh_treetop->rn_offset); if (x == 0) return (0); netmask = x->rn_key; @@ -489,53 +490,47 @@ on1: } struct radix_node * -rn_addmask(n_arg, search, skip) - int search, skip; - void *n_arg; +rn_addmask_r(void *arg, struct radix_node_head *maskhead, int search, int skip) { - caddr_t netmask = (caddr_t)n_arg; + caddr_t netmask = (caddr_t)arg; register struct radix_node *x; register caddr_t cp, cplim; register int b = 0, mlen, j; - int maskduplicated, m0, isnormal; + int maskduplicated, isnormal; struct radix_node *saved_x; - static int last_zeroed = 0; + char addmask_key[RADIX_MAX_KEY_LEN]; - if ((mlen = LEN(netmask)) > max_keylen) - mlen = max_keylen; + if ((mlen = LEN(netmask)) > RADIX_MAX_KEY_LEN) + mlen = RADIX_MAX_KEY_LEN; if (skip == 0) skip = 1; if (mlen <= skip) - return (mask_rnhead->rnh_nodes); + return (maskhead->rnh_nodes); + + bzero(addmask_key, RADIX_MAX_KEY_LEN); if (skip > 1) bcopy(rn_ones + 1, addmask_key + 1, skip - 1); - if ((m0 = mlen) > skip) - bcopy(netmask + skip, addmask_key + skip, mlen - skip); + bcopy(netmask + skip, addmask_key + skip, mlen - skip); /* * Trim trailing zeroes. */ for (cp = addmask_key + mlen; (cp > addmask_key) && cp[-1] == 0;) cp--; mlen = cp - addmask_key; - if (mlen <= skip) { - if (m0 >= last_zeroed) - last_zeroed = mlen; - return (mask_rnhead->rnh_nodes); - } - if (m0 < last_zeroed) - bzero(addmask_key + m0, last_zeroed - m0); - *addmask_key = last_zeroed = mlen; - x = rn_search(addmask_key, rn_masktop); + if (mlen <= skip) + return (maskhead->rnh_nodes); + *addmask_key = mlen; + x = rn_search(addmask_key, maskhead->rnh_treetop); if (bcmp(addmask_key, x->rn_key, mlen) != 0) x = 0; if (x || search) return (x); - R_Zalloc(x, struct radix_node *, max_keylen + 2 * sizeof (*x)); + R_Zalloc(x, struct radix_node *, RADIX_MAX_KEY_LEN + 2 * sizeof (*x)); if ((saved_x = x) == 0) return (0); netmask = cp = (caddr_t)(x + 2); bcopy(addmask_key, cp, mlen); - x = rn_insert(cp, mask_rnhead, &maskduplicated, x); + x = rn_insert(cp, maskhead, &maskduplicated, x); if (maskduplicated) { log(LOG_ERR, "rn_addmask: mask impossibly already in tree"); Free(saved_x); @@ -568,6 +563,23 @@ rn_addmask(n_arg, search, skip) return (x); } +struct radix_node * +rn_addmask(void *n_arg, int search, int skip) +{ + struct radix_node *tt; + +#ifdef _KERNEL + mtx_lock(&mask_mtx); +#endif + tt = rn_addmask_r(&mask_rnhead_compat, n_arg, search, skip); + +#ifdef _KERNEL + mtx_unlock(&mask_mtx); +#endif + + return (tt); +} + static int /* XXX: arbitrary ordering for non-contiguous masks */ rn_lexobetter(m_arg, n_arg) void *m_arg, *n_arg; @@ -590,12 +602,12 @@ rn_new_radix_mask(tt, next) { register struct radix_mask *m; - MKGet(m); + R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); if (m == 0) { - log(LOG_ERR, "Mask for route not entered\n"); + log(LOG_ERR, "Failed to allocate route mask\n"); return (0); } - bzero(m, sizeof *m); + bzero(m, sizeof(*m)); m->rm_bit = tt->rn_bit; m->rm_flags = tt->rn_flags; if (tt->rn_flags & RNF_NORMAL) @@ -629,7 +641,8 @@ rn_addroute(v_arg, n_arg, head, treenode * nodes and possibly save time in calculating indices. */ if (netmask) { - if ((x = rn_addmask(netmask, 0, top->rn_offset)) == 0) + x = rn_addmask_r(netmask, head->rnh_masks, 0, top->rn_offset); + if (x == NULL) return (0); b_leaf = x->rn_bit; b = -1 - x->rn_bit; @@ -808,7 +821,8 @@ rn_delete(v_arg, netmask_arg, head) * Delete our route from mask lists. */ if (netmask) { - if ((x = rn_addmask(netmask, 1, head_off)) == 0) + x = rn_addmask_r(netmask, head->rnh_masks, 1, head_off); + if (x == NULL) return (0); netmask = x->rn_key; while (tt->rn_mask != netmask) @@ -841,7 +855,7 @@ rn_delete(v_arg, netmask_arg, head) for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist) if (m == saved_m) { *mp = m->rm_mklist; - MKFree(m); + Free(m); break; } if (m == 0) { @@ -932,7 +946,7 @@ on1: struct radix_mask *mm = m->rm_mklist; x->rn_mklist = 0; if (--(m->rm_refs) < 0) - MKFree(m); + Free(m); m = mm; } if (m) @@ -1128,10 +1142,8 @@ rn_walktree(h, f, w) * bits starting at 'off'. * Return 1 on success, 0 on error. */ -int -rn_inithead(head, off) - void **head; - int off; +static int +rn_inithead_internal(void **head, int off) { register struct radix_node_head *rnh; register struct radix_node *t, *tt, *ttt; @@ -1163,8 +1175,8 @@ rn_inithead(head, off) return (1); } -int -rn_detachhead(void **head) +static void +rn_detachhead_internal(void **head) { struct radix_node_head *rnh; @@ -1176,28 +1188,60 @@ rn_detachhead(void **head) Free(rnh); *head = NULL; +} + +int +rn_inithead(void **head, int off) +{ + struct radix_node_head *rnh; + + if (*head != NULL) + return (1); + + if (rn_inithead_internal(head, off) == 0) + return (0); + + rnh = (struct radix_node_head *)(*head); + + if (rn_inithead_internal((void **)&rnh->rnh_masks, 0) == 0) { + rn_detachhead_internal(head); + return (0); + } + + return (1); +} + +int +rn_detachhead(void **head) +{ + struct radix_node_head *rnh; + + KASSERT((head != NULL && *head != NULL), + ("%s: head already freed", __func__)); + + rnh = *head; + + rn_detachhead_internal((void **)&rnh->rnh_masks); + rn_detachhead_internal(head); return (1); } void rn_init(int maxk) { - char *cp, *cplim; - - max_keylen = maxk; - if (max_keylen == 0) { + if ((maxk <= 0) || (maxk > RADIX_MAX_KEY_LEN)) { log(LOG_ERR, - "rn_init: radix functions require max_keylen be set\n"); + "rn_init: max_keylen must be within 1..%d\n", + RADIX_MAX_KEY_LEN); return; } - R_Malloc(rn_zeros, char *, 3 * max_keylen); - if (rn_zeros == NULL) - panic("rn_init"); - bzero(rn_zeros, 3 * max_keylen); - rn_ones = cp = rn_zeros + max_keylen; - addmask_key = cplim = rn_ones + max_keylen; - while (cp < cplim) - *cp++ = -1; - if (rn_inithead((void **)(void *)&mask_rnhead, 0) == 0) + + /* + * XXX: Compat for old rn_addmask() users + */ + if (rn_inithead((void **)(void *)&mask_rnhead_compat, 0) == 0) panic("rn_init 2"); +#ifdef _KERNEL + mtx_init(&mask_mtx, "radix_mask", NULL, MTX_DEF); +#endif } Modified: stable/10/sys/net/radix.h ============================================================================== --- stable/10/sys/net/radix.h Tue Oct 29 12:34:11 2013 (r257329) +++ stable/10/sys/net/radix.h Tue Oct 29 12:53:23 2013 (r257330) @@ -136,6 +136,7 @@ struct radix_node_head { #ifdef _KERNEL struct rwlock rnh_lock; /* locks entire radix tree */ #endif + struct radix_node_head *rnh_masks; /* Storage for our masks */ }; #ifndef _KERNEL @@ -167,6 +168,7 @@ int rn_detachhead(void **); int rn_refines(void *, void *); struct radix_node *rn_addmask(void *, int, int), + *rn_addmask_r(void *, struct radix_node_head *, int, int), *rn_addroute (void *, void *, struct radix_node_head *, struct radix_node [2]), *rn_delete(void *, void *, struct radix_node_head *), _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 15:25:53 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 4FC0CC3E; Tue, 29 Oct 2013 15:25:53 +0000 (UTC) (envelope-from VenkatKumar.Duvvuru@Emulex.Com) Received: from CMEXEDGE1.ext.emulex.com (cmexedge1.ext.emulex.com [138.239.224.99]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2D2F02E67; Tue, 29 Oct 2013 15:25:52 +0000 (UTC) Received: from CMEXHTCAS1.ad.emulex.com (138.239.115.217) by CMEXEDGE1.ext.emulex.com (138.239.224.99) with Microsoft SMTP Server (TLS) id 14.3.146.0; Tue, 29 Oct 2013 08:11:01 -0700 Received: from CMEXMB1.ad.emulex.com ([169.254.1.123]) by CMEXHTCAS1.ad.emulex.com ([2002:8aef:71b7::8aef:71b7]) with mapi id 14.03.0146.002; Tue, 29 Oct 2013 08:10:44 -0700 From: Venkata Duvvuru To: "freebsd-net@freebsd.org" , "freebsd-current@freebsd.org" Subject: taskqueue_enqueue_fast in freebsd 10.0-current Thread-Topic: taskqueue_enqueue_fast in freebsd 10.0-current Thread-Index: Ac7UuEkxdur08036S5KLJpYxkoqGeQ== Date: Tue, 29 Oct 2013 15:10:44 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [138.239.141.147] MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 15:25:53 -0000 Hi, In Freebsd 10.0-current with Emulex's OCE driver, I observe that the bottom= half is hogging all the CPU which is leading to system sluggishness. I use= d the same hardware to check the behavior on 9.1-RELEASE, everything is fin= e, bottom half is not taking more than 10% of the CPU even at the line rate= speed. But with half the throughput of line rate in Freebsd-10.0-current a= ll the CPUs peak and "top -aSCHIP" shows that it's all bottom half who is e= ating the CPU. Did anything changed in Freebsd-10.0-current that I should b= e careful about? Please clarify. Thanks, Venkat. From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 15:38:00 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id DD1777C5; Tue, 29 Oct 2013 15:38:00 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-qc0-x232.google.com (mail-qc0-x232.google.com [IPv6:2607:f8b0:400d:c01::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8CF8E2F76; Tue, 29 Oct 2013 15:38:00 +0000 (UTC) Received: by mail-qc0-f178.google.com with SMTP id x19so9239qcw.37 for ; Tue, 29 Oct 2013 08:37:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=pAnEiUut7GbyPU+0rRmkwbmWBKTL0AKETdvyn83Mrwo=; b=wXMb6d0VE4dN5Izq3XE8ooVC5TQ6X6DRTf1KAJln74Nhs0mmCNUIKnhj3Q+sDHfBwf NvdDIJE8bAwwPwYneUfa0okMT2e8L7JfXJSqlrOJ8x2jyZR9WJVNfbwti78xvkDc3q0D rLs0QlLdELA9Bi5d9Yg/p3591y8/JtoFqR1rgw7GTHr4Nqrhy5VwF6HMDCq10IXKVQGE CTr0w/qs7zNHG9C3+Lggr7yfYcPh4au8YPx/groUhZNAW6EFQKW8xGtiDZxTd3OCehgN SOHNtRHy9ugW/y1OJzu9V/h9PBzlyNomNcBQI9oeEPP+TDO2bpJVIadSJ5W+RBAe5oMC Mgag== MIME-Version: 1.0 X-Received: by 10.49.62.3 with SMTP id u3mr458231qer.6.1383061079812; Tue, 29 Oct 2013 08:37:59 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.207.66 with HTTP; Tue, 29 Oct 2013 08:37:59 -0700 (PDT) In-Reply-To: References: Date: Tue, 29 Oct 2013 08:37:59 -0700 X-Google-Sender-Auth: Y5sXaxVSz_GBYamW0Scor6GhLzY Message-ID: Subject: Re: taskqueue_enqueue_fast in freebsd 10.0-current From: Adrian Chadd To: Venkata Duvvuru Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: "freebsd-net@freebsd.org" , "freebsd-current@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 15:38:00 -0000 Hi, On 29 October 2013 08:10, Venkata Duvvuru wrote: > Hi, > In Freebsd 10.0-current with Emulex's OCE driver, I observe that the bott= om half is hogging all the CPU which is leading to system sluggishness. I u= sed the same hardware to check the behavior on 9.1-RELEASE, everything is f= ine, bottom half is not taking more than 10% of the CPU even at the line ra= te speed. But with half the throughput of line rate in Freebsd-10.0-current= all the CPUs peak and "top -aSCHIP" shows that it's all bottom half who is= eating the CPU. Did anything changed in Freebsd-10.0-current that I should= be careful about? Please clarify. spin up hwpmc and see what the story is. Which CPU is it? -a From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 18:31:12 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 438C6D5B for ; Tue, 29 Oct 2013 18:31:12 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AB3412BFE for ; Tue, 29 Oct 2013 18:31:11 +0000 (UTC) Received: (qmail 57064 invoked from network); 29 Oct 2013 19:01:41 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 19:01:41 -0000 Message-ID: <526FFED9.1070704@freebsd.org> Date: Tue, 29 Oct 2013 19:30:49 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Randall Stewart , net@freebsd.org Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> In-Reply-To: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 18:31:12 -0000 On 29.10.2013 11:50, Randall Stewart wrote: > Hi: > > As discussed at vBSDcon with andre/emaste and gnn, I am sending > this patch out to all of you ;-) I wasn't at vBSDcon but it's good that you're sending it (again). ;) > I have previously sent it to gnn, andre, jhb, rwatson, and several other > of the usual suspects (as gnn put it) and received dead silence. Sorry 'bout that. Too many things going on recently. > What does this patch do? > > Well it add the ability to do multi-queue at the driver level. Basically > any driver that uses the new interface gets under it N queues (default > is 8) for each physical transmit ring it has. The driver picks up > its queue 0 first, then queue 1 .. up to the max. To make I understand this correctly there are 8 soft-queues for each real transmit ring, correct? And the driver will dequeue the lowest numbered queue for as long as there are packets in it. Like a hierarchical strict queuing discipline. This is prone to head of line blocking and starvation by higher priority queues. May become a big problem under adverse traffic patterns. > This allows you to prioritize packets. Also in here is the start of some > work I will be doing for AQM.. think either Pi or Codel ;-) > > Right now thats pretty simple and just (in a few drivers) as the ability > to limit the amount of data on the ring… which can help reduce buffer > bloat. That needs to be refined into a lot more. We actually have two queues, the soft-queue and the hardware ring which both can be rather large leading to various issues as you mention. I've started work on an FF contract to rethink the whole IFQ* model and to propose and benchmark different approaches. After that to convert all drivers in the tree to the chosen model(s) and get rid of the legacy. In general the choice of model will be done in the driver and no longer by the ifnet layer. One or (most likely) more optimized models will be provided by the kernel for drivers to chose from. The idea that most, if not all drivers use these standard kernel provided models to avoid code duplication. However as the pace of new features is quite high we provide the full discretion for the driver to choose and experiment with their own ways of dealing with it. This is under the assumption that once a now model has been found it is later moved to the kernel side and subsequently used by other drivers as well. > This work is donated by Adara Networks and has been discussed in several > of the past vendor summits. > > I plan on committing this before the IETF unless I hear major objections. There seems to be a couple of white space issues where first there is a tab and then actual whitespace for the second one and others all over the place. There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c, sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c, usr.sbin/ofwdump/ofwdump.c. It would be good to separate out the soft multi-queue changes from the ring depth changes and do each in at least one commit. There are two separate changes to sys/dev/oce/, one is renaming of the lock macros and the other the change to drbr. The changes to sys/kern/subr_bufring.c are not style compliant and we normally don't use Linux "wb()" barriers in FreeBSD native code. The atomics_* should be used instead. Why would we need a multi-consumer dequeue? The new bufring functions on a first glance do seem to be safe on architectures with a more relaxed memory ordering / cache coherency model than x86. The atomic dance in a number of drbr_* functions doesn't seem to make much sense and a single spin-lock may result in atomic operations and bus lock cycles. There is a huge amount of includes pollution in sys/net/drbr.h which we are currently trying to get rid of and to avoid for the future. I like the general conceptual approach but the implementation feels bumpy and not (yet) ready for prime time. In any case I'd like to take forward conceptual parts for the FF sponsored IFQ* rework. -- Andre From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 19:36:03 2013 Return-Path: Delivered-To: net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 17B3672F; Tue, 29 Oct 2013 19:36:03 +0000 (UTC) (envelope-from rrs@lakerest.net) Received: from lakerest.net (lakerest.net [162.235.35.161]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 978112FFB; Tue, 29 Oct 2013 19:36:02 +0000 (UTC) Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162]) (authenticated bits=0) by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TJZeCj074918 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Tue, 29 Oct 2013 15:35:40 -0400 (EDT) (envelope-from rrs@lakerest.net) Subject: Re: MQ Patch. Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=windows-1252 From: Randall Stewart In-Reply-To: <526FFED9.1070704@freebsd.org> Date: Tue, 29 Oct 2013 15:35:40 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> To: Andre Oppermann X-Mailer: Apple Mail (2.1283) Cc: net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 19:36:03 -0000 On Oct 29, 2013, at 2:30 PM, Andre Oppermann wrote: > On 29.10.2013 11:50, Randall Stewart wrote: >> Hi: >>=20 >> As discussed at vBSDcon with andre/emaste and gnn, I am sending >> this patch out to all of you ;-) >=20 > I wasn't at vBSDcon but it's good that you're sending it (again). ;) >=20 >> I have previously sent it to gnn, andre, jhb, rwatson, and several = other >> of the usual suspects (as gnn put it) and received dead silence. >=20 > Sorry 'bout that. Too many things going on recently. >=20 >> What does this patch do? >>=20 >> Well it add the ability to do multi-queue at the driver level. = Basically >> any driver that uses the new interface gets under it N queues = (default >> is 8) for each physical transmit ring it has. The driver picks up >> its queue 0 first, then queue 1 .. up to the max. >=20 > To make I understand this correctly there are 8 soft-queues for each = real > transmit ring, correct? And the driver will dequeue the lowest = numbered > queue for as long as there are packets in it. Like a hierarchical = strict > queuing discipline. >=20 > This is prone to head of line blocking and starvation by higher = priority > queues. May become a big problem under adverse traffic patterns. Thats the whole idea of QOS.. you take and prioritize your traffic if you don't have enough b/w. The guys at the bottom get none..=20 If you don't want it, you can either turn QOS off.. i.e. let everything fall to the bottom bucket. Or even set the number of queues to 1, and then nothing changes 1:1 queues to transmit-ring >=20 >> This allows you to prioritize packets. Also in here is the start of = some >> work I will be doing for AQM.. think either Pi or Codel ;-) >>=20 >> Right now thats pretty simple and just (in a few drivers) as the = ability >> to limit the amount of data on the ring=85 which can help reduce = buffer >> bloat. That needs to be refined into a lot more. >=20 > We actually have two queues, the soft-queue and the hardware ring = which > both can be rather large leading to various issues as you mention. Which is why I first of all set the soft-queue default at 64.. That in some ways is still big. In order to get rid of the hard-queue you really just have to limit how much you put in. I have some hooks in for igb here (and em) that do this but its just a first step. The right thing (long term) is to go to a AQM like Codel or Pi.=20 Pi would give you coverage of both queue's at ingress to the first one = (thinking of a single queue model) Codel can only handle the soft-> hard queue transition. But Pi has the standard Cisco patent so it will probably have to be a loadable module=85 sigh.. >=20 > I've started work on an FF contract to rethink the whole IFQ* model = and What is an FF contract? > to propose and benchmark different approaches. After that to convert = all > drivers in the tree to the chosen model(s) and get rid of the legacy. = In > general the choice of model will be done in the driver and no longer = by > the ifnet layer. One or (most likely) more optimized models will be > provided by the kernel for drivers to chose from. The idea that most, > if not all drivers use these standard kernel provided models to avoid > code duplication. However as the pace of new features is quite high > we provide the full discretion for the driver to choose and experiment > with their own ways of dealing with it. This is under the assumption > that once a now model has been found it is later moved to the kernel > side and subsequently used by other drivers as well. >=20 >> This work is donated by Adara Networks and has been discussed in = several >> of the past vendor summits. >>=20 >> I plan on committing this before the IETF unless I hear major = objections. >=20 > There seems to be a couple of white space issues where first there is = a tab > and then actual whitespace for the second one and others all over the = place. >=20 > There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c, > sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, = sys/kern/subr_bus.c, > usr.sbin/ofwdump/ofwdump.c. >=20 Yeah Fabien Thomas and I have already talked on that. I had some hold over cruft that I had thought I got out. The cesa.c changes I committed this AM and the debug stuff was all reverted out. Plus a couple of other little tweaks. I will resend an updated (cleaned up patch) once my build-universe = completes :-) > It would be good to separate out the soft multi-queue changes from the = ring > depth changes and do each in at least one commit. I am not sure what you are suggesting here.=20 >=20 > There are two separate changes to sys/dev/oce/, one is renaming of the = lock > macros and the other the change to drbr. Yeah I hit that because the LOCK name unfortunately conflicted with = another so on one of my build-universe runs LINT would blow up ;-( That could definitely be done separately.. >=20 > The changes to sys/kern/subr_bufring.c are not style compliant and we = normally > don't use Linux "wb()" barriers in FreeBSD native code. The atomics_* = should > be used instead. >=20 Those are taken *directly* the original code put in by Kip.. I just = moved them over when I was refactoring things. > Why would we need a multi-consumer dequeue? I can think of one reason.. its called lagg=20 R >=20 > The new bufring functions on a first glance do seem to be safe on = architectures > with a more relaxed memory ordering / cache coherency model than x86. >=20 > The atomic dance in a number of drbr_* functions doesn't seem to make = much sense > and a single spin-lock may result in atomic operations and bus lock = cycles. >=20 > There is a huge amount of includes pollution in sys/net/drbr.h which = we are > currently trying to get rid of and to avoid for the future. >=20 >=20 > I like the general conceptual approach but the implementation feels = bumpy and > not (yet) ready for prime time. In any case I'd like to take forward = conceptual > parts for the FF sponsored IFQ* rework. >=20 > --=20 > Andre >=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20 ------------------------------ Randall Stewart 803-317-4952 (cell) From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 19:39:29 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 15AD898D; Tue, 29 Oct 2013 19:39:29 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DC475206F; Tue, 29 Oct 2013 19:39:28 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9TJdSrc046700; Tue, 29 Oct 2013 19:39:28 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9TJdSQV046699; Tue, 29 Oct 2013 19:39:28 GMT (envelope-from linimon) Date: Tue, 29 Oct 2013 19:39:28 GMT Message-Id: <201310291939.r9TJdSQV046699@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org From: linimon@FreeBSD.org Subject: Re: conf/183407: [rc.d] [patch] Routing restart returns non-zero exitcode in case of no extra routing parameter or missing atm/ipx X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 19:39:29 -0000 Old Synopsis: Routing restart returns non-zero exitcode in case of no extra routing parameter or missing atm/ipx New Synopsis: [rc.d] [patch] Routing restart returns non-zero exitcode in case of no extra routing parameter or missing atm/ipx Responsible-Changed-From-To: freebsd-bugs->freebsd-net Responsible-Changed-By: linimon Responsible-Changed-When: Tue Oct 29 19:38:37 UTC 2013 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=183407 From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 19:58:49 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 2420A4BB; Tue, 29 Oct 2013 19:58:49 +0000 (UTC) (envelope-from rizzo.unipi@gmail.com) Received: from mail-la0-x22b.google.com (mail-la0-x22b.google.com [IPv6:2a00:1450:4010:c03::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 312F221D4; Tue, 29 Oct 2013 19:58:48 +0000 (UTC) Received: by mail-la0-f43.google.com with SMTP id el20so311891lab.30 for ; Tue, 29 Oct 2013 12:58:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=F5XCkNO1hZrR3ouPXbNqpGHUu7N7YnzpDaE132j70SY=; b=mId824GIe9gPmPicnQkgsTWczZlFWQeNHFJDjbuHRlMIbD2QDbbfoJn+TCvk4xvIs9 XPJWCDgbbwC2FWD9CABGQ18fyXCbIqzEPOyZcn6pxZ6Yk/yVLpkF+bluDMvSgFnRLYc5 uf7Kh4ayIP8vlzhHq6mB78gaclIA8zpTIy0ycQKyrlBo6NH7eqpmVuXeQXYYD35KtGpa fkU2r/53JTVyJHjL7Oh99pfLPLccZzNfCYM/aziLwHuyJfXsTYOZyiTRxV750N3Ewoi6 OggtCj8vJwkltbYKB8NPDs3BzmZcQvtlPvyO9Yc2Lwd1XfmjBJYqJIKLHuyjEdk1yp5F NPIg== MIME-Version: 1.0 X-Received: by 10.112.235.3 with SMTP id ui3mr1087178lbc.44.1383076726178; Tue, 29 Oct 2013 12:58:46 -0700 (PDT) Sender: rizzo.unipi@gmail.com Received: by 10.114.172.105 with HTTP; Tue, 29 Oct 2013 12:58:46 -0700 (PDT) In-Reply-To: <526FFED9.1070704@freebsd.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> Date: Tue, 29 Oct 2013 12:58:46 -0700 X-Google-Sender-Auth: ASpkNZvaZKzNaCv6n1oyTYCmwMA Message-ID: Subject: Re: MQ Patch. From: Luigi Rizzo To: Andre Oppermann Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 19:58:49 -0000 my short, top-post comment is that I'd rather see some more coordination with Andre, and especially some high level README or other form of documentation explaining the architecture you have in mind before this goes in. To expand my point of view (and please do not read me as negative, i am trying to be constructive and avoid future troubles and volunteer to help with the design and implementation): (i'll omit issues re. style and unrelated patches in the diff because they are premature) 1. Having multiple separate software queues attached to a physical queue makes sense only if we have a clear and documented plan for scheduling traffic from these queues into the hw one. Otherwise it ends up being just another confusing hack that makes it difficult to reason about device drivers. We already have something similar now (with the drbr queue on top used in some cases when the hw ring overflows), the ALTQ hooks, and without documentation this does not seem to improve the current situation. 2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI, but a coherent framework where you can classify/partition traffic into separate queues, apply one of several queue management (taildrop/RED/CODEL/whatever) and scheduling (which queue to serve next) policies in an efficient way. Linux mostly gets this right (they even support hierarchical schedulers). Dummynet has a reasonable architecture although not hierarchical and it operates at the IP level (or possibly at layer 2), which is probably too high (but not necessarily). We can also recycle the components, i.e. the classifier in ipfw and the scheduling algorithms. I am happy to help on this. ALTQ is too old and complex and inefficient and unmaintained to be considered. And i cannot comment on your code because you don't really explain what you want to do and how. Codel/PI are only queue management, not qos; and strict priority is just one (and probably the worse) policy one can have. One comment i can make, however, on the fact that 256 queues are way too few for a proper system. You need the number to be dynamic and much larger (e.g. using flowid as a key). So, to conclude: i fully support any plan to design something that lets us implement scheduling (and qos, if you want to call it this way) in a reasonable way, but what is in your patch now does not really seem to improve the current situation in any way. cheers luigi On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann wrote= : > On 29.10.2013 11:50, Randall Stewart wrote: > >> Hi: >> >> As discussed at vBSDcon with andre/emaste and gnn, I am sending >> this patch out to all of you ;-) >> > > I wasn't at vBSDcon but it's good that you're sending it (again). ;) > > > I have previously sent it to gnn, andre, jhb, rwatson, and several other >> of the usual suspects (as gnn put it) and received dead silence. >> > > Sorry 'bout that. Too many things going on recently. > > > What does this patch do? >> >> Well it add the ability to do multi-queue at the driver level. Basically >> any driver that uses the new interface gets under it N queues (default >> is 8) for each physical transmit ring it has. The driver picks up >> its queue 0 first, then queue 1 .. up to the max. >> > > To make I understand this correctly there are 8 soft-queues for each real > transmit ring, correct? And the driver will dequeue the lowest numbered > queue for as long as there are packets in it. Like a hierarchical strict > queuing discipline. > > This is prone to head of line blocking and starvation by higher priority > queues. May become a big problem under adverse traffic patterns. > > > This allows you to prioritize packets. Also in here is the start of some >> work I will be doing for AQM.. think either Pi or Codel ;-) >> >> Right now thats pretty simple and just (in a few drivers) as the ability >> to limit the amount of data on the ring=85 which can help reduce buffer >> bloat. That needs to be refined into a lot more. >> > > We actually have two queues, the soft-queue and the hardware ring which > both can be rather large leading to various issues as you mention. > > I've started work on an FF contract to rethink the whole IFQ* model and > to propose and benchmark different approaches. After that to convert all > drivers in the tree to the chosen model(s) and get rid of the legacy. In > general the choice of model will be done in the driver and no longer by > the ifnet layer. One or (most likely) more optimized models will be > provided by the kernel for drivers to chose from. The idea that most, > if not all drivers use these standard kernel provided models to avoid > code duplication. However as the pace of new features is quite high > we provide the full discretion for the driver to choose and experiment > with their own ways of dealing with it. This is under the assumption > that once a now model has been found it is later moved to the kernel > side and subsequently used by other drivers as well. > > > This work is donated by Adara Networks and has been discussed in several >> of the past vendor summits. >> >> I plan on committing this before the IETF unless I hear major objections= . >> > > There seems to be a couple of white space issues where first there is a t= ab > and then actual whitespace for the second one and others all over the > place. > > There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c, > sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c, > usr.sbin/ofwdump/ofwdump.c. > > It would be good to separate out the soft multi-queue changes from the ri= ng > depth changes and do each in at least one commit. > > There are two separate changes to sys/dev/oce/, one is renaming of the lo= ck > macros and the other the change to drbr. > > The changes to sys/kern/subr_bufring.c are not style compliant and we > normally > don't use Linux "wb()" barriers in FreeBSD native code. The atomics_* > should > be used instead. > > Why would we need a multi-consumer dequeue? > > The new bufring functions on a first glance do seem to be safe on > architectures > with a more relaxed memory ordering / cache coherency model than x86. > > The atomic dance in a number of drbr_* functions doesn't seem to make muc= h > sense > and a single spin-lock may result in atomic operations and bus lock cycle= s. > > There is a huge amount of includes pollution in sys/net/drbr.h which we a= re > currently trying to get rid of and to avoid for the future. > > > I like the general conceptual approach but the implementation feels bumpy > and > not (yet) ready for prime time. In any case I'd like to take forward > conceptual > parts for the FF sponsored IFQ* rework. > > -- > Andre > > > ______________________________**_________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/**mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@**freebsd.org > " > --=20 -----------------------------------------+------------------------------- Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+------------------------------- From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 20:03:55 2013 Return-Path: Delivered-To: net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 2330389D for ; Tue, 29 Oct 2013 20:03:55 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5403B225C for ; Tue, 29 Oct 2013 20:03:54 +0000 (UTC) Received: (qmail 57437 invoked from network); 29 Oct 2013 20:34:24 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 20:34:24 -0000 Message-ID: <52701494.6050404@freebsd.org> Date: Tue, 29 Oct 2013 21:03:32 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Randall Stewart Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Cc: net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 20:03:55 -0000 On 29.10.2013 20:35, Randall Stewart wrote: > > On Oct 29, 2013, at 2:30 PM, Andre Oppermann wrote: > >> On 29.10.2013 11:50, Randall Stewart wrote: >>> Hi: >>> >>> As discussed at vBSDcon with andre/emaste and gnn, I am sending >>> this patch out to all of you ;-) >> >> I wasn't at vBSDcon but it's good that you're sending it (again). ;) >> >>> I have previously sent it to gnn, andre, jhb, rwatson, and several other >>> of the usual suspects (as gnn put it) and received dead silence. >> >> Sorry 'bout that. Too many things going on recently. >> >>> What does this patch do? >>> >>> Well it add the ability to do multi-queue at the driver level. Basically >>> any driver that uses the new interface gets under it N queues (default >>> is 8) for each physical transmit ring it has. The driver picks up >>> its queue 0 first, then queue 1 .. up to the max. >> >> To make I understand this correctly there are 8 soft-queues for each real >> transmit ring, correct? And the driver will dequeue the lowest numbered >> queue for as long as there are packets in it. Like a hierarchical strict >> queuing discipline. >> >> This is prone to head of line blocking and starvation by higher priority >> queues. May become a big problem under adverse traffic patterns. > > Thats the whole idea of QOS.. you take and prioritize your traffic > if you don't have enough b/w. That is understood. In most cases it's done on a WFQ basis though and strict priority is limited to realtime (VoIP) traffic and also bound overall not to monopolize the entire link if something goes wrong. Almost all documentation from C and J recommends against unbounded strict priority scheduling for that reason. > The guys at the bottom get none.. I wonder how useful an 8 level strict priority actually can be under load for everything below level 1. Normally strategic packet loss as in RED or its more efficient variants together with some WFQ scheme signals the senders not to increase pace, or actually to slow down a bit if the link is at capacity. In practice I've never seen a case where full starvation of lower classes made any sense. You'd want at least some packets go through every now and then even in scavenger class. > If you don't want it, you can either turn QOS off.. i.e. let > everything fall to the bottom bucket. Or even set the number > of queues to 1, and then nothing changes 1:1 queues to transmit-ring The default setting probably should be the lowest priority available and then only have the more important stuff get a higher level rather than the other way around. >>> This allows you to prioritize packets. Also in here is the start of some >>> work I will be doing for AQM.. think either Pi or Codel ;-) >>> >>> Right now thats pretty simple and just (in a few drivers) as the ability >>> to limit the amount of data on the ring… which can help reduce buffer >>> bloat. That needs to be refined into a lot more. >> >> We actually have two queues, the soft-queue and the hardware ring which >> both can be rather large leading to various issues as you mention. > > > Which is why I first of all set the soft-queue default at 64.. That in > some ways is still big. If it's MTU sized packets it should be manageable. If it's TSO chains though... > In order to get rid of the hard-queue you really just have to limit > how much you put in. I have some hooks in for igb here (and em) that > do this but its just a first step. The right thing (long term) is > to go to a AQM like Codel or Pi. I actually wonder if there is any benefit in soft-queuing at all, except for the multiple-writer concurrency situation. The DMA rings are deep enough already. If they are full just drop the packet without tacking another soft-queue at the back of it. > Pi would give you coverage of both queue's at ingress to the first one (thinking > of a single queue model) > > Codel can only handle the soft-> hard queue transition. Yup. > But Pi has the standard Cisco patent so it will probably have to be > a loadable module… sigh.. Haven't looked at Pi yet. Do you have a pointer to a sufficiently detailed paper on it? >> I've started work on an FF contract to rethink the whole IFQ* model and > > What is an FF contract? FreeBSD Foundation. >> to propose and benchmark different approaches. After that to convert all >> drivers in the tree to the chosen model(s) and get rid of the legacy. In >> general the choice of model will be done in the driver and no longer by >> the ifnet layer. One or (most likely) more optimized models will be >> provided by the kernel for drivers to chose from. The idea that most, >> if not all drivers use these standard kernel provided models to avoid >> code duplication. However as the pace of new features is quite high >> we provide the full discretion for the driver to choose and experiment >> with their own ways of dealing with it. This is under the assumption >> that once a now model has been found it is later moved to the kernel >> side and subsequently used by other drivers as well. >> >>> This work is donated by Adara Networks and has been discussed in several >>> of the past vendor summits. >>> >>> I plan on committing this before the IETF unless I hear major objections. >> >> There seems to be a couple of white space issues where first there is a tab >> and then actual whitespace for the second one and others all over the place. >> >> There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c, >> sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c, >> usr.sbin/ofwdump/ofwdump.c. >> > > Yeah Fabien Thomas and I have already talked on that. > > I had some hold over cruft that I had thought I got out. > > The cesa.c changes I committed this AM and the debug stuff was > all reverted out. > > Plus a couple of other little tweaks. > > I will resend an updated (cleaned up patch) once my build-universe completes :-) OK. >> It would be good to separate out the soft multi-queue changes from the ring >> depth changes and do each in at least one commit. > > I am not sure what you are suggesting here. The multi-queue and the ring-depth changes in igb(4) et al should be separate commits because they are distinct features. The driver maintainer should sign off on them too before committing. >> There are two separate changes to sys/dev/oce/, one is renaming of the lock >> macros and the other the change to drbr. > Yeah I hit that because the LOCK name unfortunately conflicted with another so > on one of my build-universe runs LINT would blow up ;-( > > That could definitely be done separately.. Please do so. All separate function units should be done as individual commits to better track it and also to be able to back them out if there's a problem with one of them. >> The changes to sys/kern/subr_bufring.c are not style compliant and we normally >> don't use Linux "wb()" barriers in FreeBSD native code. The atomics_* should >> be used instead. >> > > Those are taken *directly* the original code put in by Kip.. I just moved > them over when I was refactoring things. Ugh... >> Why would we need a multi-consumer dequeue? > > I can think of one reason.. its called lagg Lagg should be hash based so there it could process down through to the real interface instead of doing such a dance which only re-orders the packets of the same stream. -- Andre > R > > >> >> The new bufring functions on a first glance do seem to be safe on architectures >> with a more relaxed memory ordering / cache coherency model than x86. >> >> The atomic dance in a number of drbr_* functions doesn't seem to make much sense >> and a single spin-lock may result in atomic operations and bus lock cycles. >> >> There is a huge amount of includes pollution in sys/net/drbr.h which we are >> currently trying to get rid of and to avoid for the future. >> >> >> I like the general conceptual approach but the implementation feels bumpy and >> not (yet) ready for prime time. In any case I'd like to take forward conceptual >> parts for the FF sponsored IFQ* rework. > >> >> -- >> Andre >> >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >> > > ------------------------------ > Randall Stewart > 803-317-4952 (cell) > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > > From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 20:20:36 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 94E81245; Tue, 29 Oct 2013 20:20:36 +0000 (UTC) (envelope-from rrs@lakerest.net) Received: from lakerest.net (lakerest.net [162.235.35.161]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 02ABD23C6; Tue, 29 Oct 2013 20:20:35 +0000 (UTC) Received: from [10.1.1.103] (bsd4.lakerest.net [162.235.35.162]) (authenticated bits=0) by lakerest.net (8.14.4/8.14.3) with ESMTP id r9TKK8eU075478 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Tue, 29 Oct 2013 16:20:19 -0400 (EDT) (envelope-from rrs@lakerest.net) Subject: Re: MQ Patch. Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=windows-1252 From: Randall Stewart In-Reply-To: Date: Tue, 29 Oct 2013 16:20:08 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> To: Luigi Rizzo X-Mailer: Apple Mail (2.1283) Cc: Andre Oppermann , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 20:20:36 -0000 Lugi: comments in line.. On Oct 29, 2013, at 3:58 PM, Luigi Rizzo wrote: > my short, top-post comment is that I'd rather see some more > coordination with Andre, and especially some high level README > or other form of documentation explaining the architecture > you have in mind before this goes in. >=20 > To expand my point of view (and please do not read me as negative, > i am trying to be constructive and avoid future troubles and > volunteer to help with the design and implementation): >=20 > (i'll omit issues re. style and unrelated patches in the diff > because they are premature) >=20 > 1. Having multiple separate software queues attached to a physical = queue > makes sense only if we have a clear and documented plan > for scheduling traffic from these queues into the hw one. > Otherwise it ends up being just another confusing hack > that makes it difficult to reason about device drivers. >=20 > We already have something similar now (with the drbr queue on top > used in some cases when the hw ring overflows), the ALTQ hooks, > and without documentation this does not seem to improve the > current situation. >=20 Well I can't get Adara to give up how it uses these in its product.. I = was lucky to get them to give back the low level work. The problem with ALTQ is that it is really broken if you want to do any = sort of decent performance with queueing. However with a small bit of work = (aka throw away the altq queues themselves and set ALTQ to place the ac_qos number = in here and queue the packet) you could have ALTQ able to transmit at line-rate = and have proper QOS. > 2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI, > but a coherent framework where you can classify/partition traffic > into separate queues, apply one of several queue management > (taildrop/RED/CODEL/whatever) and scheduling (which queue to serve = next) > policies in an efficient way. >=20 > Linux mostly gets this right (they even support hierarchical = schedulers). Which is also what ALTq attempts to do as well. Again I can't get Adara to give there top level code.. but someone *could* hint hint hook altq = up to this and be able to have a reasonable performance model with altq... >=20 > Dummynet has a reasonable architecture although not hierarchical > and it operates at the IP level (or possibly at layer 2), > which is probably too high (but not necessarily). > We can also recycle the components, i.e. the classifier in ipfw > and the scheduling algorithms. I am happy to help on this. >=20 > ALTQ is too old and complex and inefficient and unmaintained to be = considered. Exactly.. >=20 > And i cannot comment on your code because you don't really explain > what you want to do and how. Codel/PI are only queue management, > not qos; and strict priority is just one (and probably the worse) = policy > one can have. Of course but you need them if you want to prevent buffer-bloat. >=20 > One comment i can make, however, on the fact that 256 queues are > way too few for a proper system. You need the number to be > dynamic and much larger (e.g. using flowid as a key). >=20 > So, to conclude: i fully support any plan to design something that = lets us > implement scheduling (and qos, if you want to call it this way) > in a reasonable way, but what is in your patch now does not really > seem to improve the current situation in any way. >=20 Its a step towards fixing that I am allowed to give. I can see why Company's get frustrated with trying to give anything to the = project. R > cheers > luigi >=20 >=20 >=20 > On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann = wrote: > On 29.10.2013 11:50, Randall Stewart wrote: > Hi: >=20 > As discussed at vBSDcon with andre/emaste and gnn, I am sending > this patch out to all of you ;-) >=20 > I wasn't at vBSDcon but it's good that you're sending it (again). ;) >=20 >=20 > I have previously sent it to gnn, andre, jhb, rwatson, and several = other > of the usual suspects (as gnn put it) and received dead silence. >=20 > Sorry 'bout that. Too many things going on recently. >=20 >=20 > What does this patch do? >=20 > Well it add the ability to do multi-queue at the driver level. = Basically > any driver that uses the new interface gets under it N queues (default > is 8) for each physical transmit ring it has. The driver picks up > its queue 0 first, then queue 1 .. up to the max. >=20 > To make I understand this correctly there are 8 soft-queues for each = real > transmit ring, correct? And the driver will dequeue the lowest = numbered > queue for as long as there are packets in it. Like a hierarchical = strict > queuing discipline. >=20 > This is prone to head of line blocking and starvation by higher = priority > queues. May become a big problem under adverse traffic patterns. >=20 >=20 > This allows you to prioritize packets. Also in here is the start of = some > work I will be doing for AQM.. think either Pi or Codel ;-) >=20 > Right now thats pretty simple and just (in a few drivers) as the = ability > to limit the amount of data on the ring=85 which can help reduce = buffer > bloat. That needs to be refined into a lot more. >=20 > We actually have two queues, the soft-queue and the hardware ring = which > both can be rather large leading to various issues as you mention. >=20 > I've started work on an FF contract to rethink the whole IFQ* model = and > to propose and benchmark different approaches. After that to convert = all > drivers in the tree to the chosen model(s) and get rid of the legacy. = In > general the choice of model will be done in the driver and no longer = by > the ifnet layer. One or (most likely) more optimized models will be > provided by the kernel for drivers to chose from. The idea that most, > if not all drivers use these standard kernel provided models to avoid > code duplication. However as the pace of new features is quite high > we provide the full discretion for the driver to choose and experiment > with their own ways of dealing with it. This is under the assumption > that once a now model has been found it is later moved to the kernel > side and subsequently used by other drivers as well. >=20 >=20 > This work is donated by Adara Networks and has been discussed in = several > of the past vendor summits. >=20 > I plan on committing this before the IETF unless I hear major = objections. >=20 > There seems to be a couple of white space issues where first there is = a tab > and then actual whitespace for the second one and others all over the = place. >=20 > There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c, > sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, = sys/kern/subr_bus.c, > usr.sbin/ofwdump/ofwdump.c. >=20 > It would be good to separate out the soft multi-queue changes from the = ring > depth changes and do each in at least one commit. >=20 > There are two separate changes to sys/dev/oce/, one is renaming of the = lock > macros and the other the change to drbr. >=20 > The changes to sys/kern/subr_bufring.c are not style compliant and we = normally > don't use Linux "wb()" barriers in FreeBSD native code. The atomics_* = should > be used instead. >=20 > Why would we need a multi-consumer dequeue? >=20 > The new bufring functions on a first glance do seem to be safe on = architectures > with a more relaxed memory ordering / cache coherency model than x86. >=20 > The atomic dance in a number of drbr_* functions doesn't seem to make = much sense > and a single spin-lock may result in atomic operations and bus lock = cycles. >=20 > There is a huge amount of includes pollution in sys/net/drbr.h which = we are > currently trying to get rid of and to avoid for the future. >=20 >=20 > I like the general conceptual approach but the implementation feels = bumpy and > not (yet) ready for prime time. In any case I'd like to take forward = conceptual > parts for the FF sponsored IFQ* rework. >=20 > --=20 > Andre >=20 >=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20 >=20 >=20 > --=20 > = -----------------------------------------+------------------------------- > Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. = dell'Informazione > http://www.iet.unipi.it/~luigi/ . Universita` di Pisa > TEL +39-050-2211611 . via Diotisalvi 2 > Mobile +39-338-6809875 . 56122 PISA (Italy) > = -----------------------------------------+------------------------------- ------------------------------ Randall Stewart 803-317-4952 (cell) From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 20:42:10 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 213012F4 for ; Tue, 29 Oct 2013 20:42:10 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3DC992533 for ; Tue, 29 Oct 2013 20:42:08 +0000 (UTC) Received: (qmail 57695 invoked from network); 29 Oct 2013 21:12:39 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 21:12:39 -0000 Message-ID: <52701D8B.8050907@freebsd.org> Date: Tue, 29 Oct 2013 21:41:47 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 20:42:10 -0000 Let me jump in here and explain roughly the ideas/path I'm exploring in creating and eventually implementing a big picture for drivers, queues, queue management, various QoS and so on: Situation: We're still mostly based on the old 4.4BSD IFQ model with a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we have in tree aren't helpful at all. Steps: 1. take the soft-queuing method out of the ifnet layer and make it a property of the driver, so that the upper stack (or actually protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) without any queuing at that point. It then is up to the driver to decide how it multiplexes multi-core access to its queue(s) and how they are configured. Some hardware supports multiple queues and some even support WFQ models among these queues in hardware. In that case any soft-queue layer would be omitted. For the other cases the kernel will provide one or two proven and optimized soft-queue and multi-writer access implementations to be used by the drivers. Drivers should avoid having their own soft-queue implementations but they can if they really want to. 2. make flowid's (or hashes) an integral part of the network stack. The mbuf header fully supports it. If the hardware provides a flowid (toeplitz for example) use it, otherwise compute a hash a bit up the stack for incoming packets. Outgoing packets get their hash based on the inpcb or whatever. In- and outbound directions are totally separate and don't have to use the same hash, it only has to be constant with a flow. In theory it could be randomly chosen at flow setup time (eg. tcp connect). This way the load can be distributed among multiple hw queues or interfaces in the case of lagg(4) with a single mbuf header lookup. When we can make sure that every packet has a flowid many things become possible and even easy. Again drivers should not invent their own software implementations and rely on the kernel to provide it. 3. make QoS/CoS an integral part of the network stack. The first step is done with the qoscos field in the mbuf header. It is eight bits wide and its use/semantics haven't been fully established yet. However the idea is to have a classifier tag the packet when it enters the network stack, either by coming in on an interface or by being generated within the stack. The qoscos tag can be taken from layer2 information (vlan header) or chosen based on more complex rules through a packet filter such as ipfw, pf or ipf. There won't be any separate classifier as in ALTQ anymore. This is also the path OpenBSD has taken. Depending on the ingress/egress encapsulation the range of qos/cos information may be more limited than the 8 bits we have in the mbuf header. In that case the larger range has to be mapped into the smaller range by putting neighboring bins together. This is how it is done in all routers and routing switches by various vendors. The administrator decides how the mapping is done and where it is taken from. 4. adjust the stack and drivers to do all of the above and to optimally make use of the hardware capabilities. If a hardware supports multi-queue and SP/WFQ at once (ie. ixgbe(4)) then there is no need for any soft-queuing. Otherwise the various queuing and queue management disciplines will hook into (*if_transmit) and do their magic before the packet reaches the DMA ring. To reach this level a bit of infrastructure work has to be done first, for example the DMA ring depth needs to be adjustable through a generic mechanism for all drivers, and the new-ALTQ should be able to hook into the drivers TX completion interrupt to clock out the packets. This should give a rough outline of the path(s) to be explored in the next weeks. -- Andre On 29.10.2013 20:58, Luigi Rizzo wrote: > my short, top-post comment is that I'd rather see some more > coordination with Andre, and especially some high level README > or other form of documentation explaining the architecture > you have in mind before this goes in. > > To expand my point of view (and please do not read me as negative, > i am trying to be constructive and avoid future troubles and > volunteer to help with the design and implementation): > > (i'll omit issues re. style and unrelated patches in the diff > because they are premature) > > 1. Having multiple separate software queues attached to a physical queue > makes sense only if we have a clear and documented plan > for scheduling traffic from these queues into the hw one. > Otherwise it ends up being just another confusing hack > that makes it difficult to reason about device drivers. > > We already have something similar now (with the drbr queue on top > used in some cases when the hw ring overflows), the ALTQ hooks, > and without documentation this does not seem to improve the > current situation. > > 2. QoS is not just priority scheduling or AQM a-la RED/CODEL/PI, > but a coherent framework where you can classify/partition traffic > into separate queues, apply one of several queue management > (taildrop/RED/CODEL/whatever) and scheduling (which queue to serve next) > policies in an efficient way. > > Linux mostly gets this right (they even support hierarchical schedulers). > > Dummynet has a reasonable architecture although not hierarchical > and it operates at the IP level (or possibly at layer 2), > which is probably too high (but not necessarily). > We can also recycle the components, i.e. the classifier in ipfw > and the scheduling algorithms. I am happy to help on this. > > ALTQ is too old and complex and inefficient and unmaintained to be considered. > > And i cannot comment on your code because you don't really explain > what you want to do and how. Codel/PI are only queue management, > not qos; and strict priority is just one (and probably the worse) policy > one can have. > > One comment i can make, however, on the fact that 256 queues are > way too few for a proper system. You need the number to be > dynamic and much larger (e.g. using flowid as a key). > > So, to conclude: i fully support any plan to design something that lets us > implement scheduling (and qos, if you want to call it this way) > in a reasonable way, but what is in your patch now does not really > seem to improve the current situation in any way. > > cheers > luigi > > > > On Tue, Oct 29, 2013 at 11:30 AM, Andre Oppermann > wrote: > > On 29.10.2013 11:50, Randall Stewart wrote: > > Hi: > > As discussed at vBSDcon with andre/emaste and gnn, I am sending > this patch out to all of you ;-) > > > I wasn't at vBSDcon but it's good that you're sending it (again). ;) > > > I have previously sent it to gnn, andre, jhb, rwatson, and several other > of the usual suspects (as gnn put it) and received dead silence. > > > Sorry 'bout that. Too many things going on recently. > > > What does this patch do? > > Well it add the ability to do multi-queue at the driver level. Basically > any driver that uses the new interface gets under it N queues (default > is 8) for each physical transmit ring it has. The driver picks up > its queue 0 first, then queue 1 .. up to the max. > > > To make I understand this correctly there are 8 soft-queues for each real > transmit ring, correct? And the driver will dequeue the lowest numbered > queue for as long as there are packets in it. Like a hierarchical strict > queuing discipline. > > This is prone to head of line blocking and starvation by higher priority > queues. May become a big problem under adverse traffic patterns. > > > This allows you to prioritize packets. Also in here is the start of some > work I will be doing for AQM.. think either Pi or Codel ;-) > > Right now thats pretty simple and just (in a few drivers) as the ability > to limit the amount of data on the ring… which can help reduce buffer > bloat. That needs to be refined into a lot more. > > > We actually have two queues, the soft-queue and the hardware ring which > both can be rather large leading to various issues as you mention. > > I've started work on an FF contract to rethink the whole IFQ* model and > to propose and benchmark different approaches. After that to convert all > drivers in the tree to the chosen model(s) and get rid of the legacy. In > general the choice of model will be done in the driver and no longer by > the ifnet layer. One or (most likely) more optimized models will be > provided by the kernel for drivers to chose from. The idea that most, > if not all drivers use these standard kernel provided models to avoid > code duplication. However as the pace of new features is quite high > we provide the full discretion for the driver to choose and experiment > with their own ways of dealing with it. This is under the assumption > that once a now model has been found it is later moved to the kernel > side and subsequently used by other drivers as well. > > > This work is donated by Adara Networks and has been discussed in several > of the past vendor summits. > > I plan on committing this before the IETF unless I hear major objections. > > > There seems to be a couple of white space issues where first there is a tab > and then actual whitespace for the second one and others all over the place. > > There seem to be a number of unrelated changes in sys/dev/cesa/cesa.c, > sys/dev/fdt/fdt_common.c, sys/dev/fdt/simplebus.c, sys/kern/subr_bus.c, > usr.sbin/ofwdump/ofwdump.c. > > It would be good to separate out the soft multi-queue changes from the ring > depth changes and do each in at least one commit. > > There are two separate changes to sys/dev/oce/, one is renaming of the lock > macros and the other the change to drbr. > > The changes to sys/kern/subr_bufring.c are not style compliant and we normally > don't use Linux "wb()" barriers in FreeBSD native code. The atomics_* should > be used instead. > > Why would we need a multi-consumer dequeue? > > The new bufring functions on a first glance do seem to be safe on architectures > with a more relaxed memory ordering / cache coherency model than x86. > > The atomic dance in a number of drbr_* functions doesn't seem to make much sense > and a single spin-lock may result in atomic operations and bus lock cycles. > > There is a huge amount of includes pollution in sys/net/drbr.h which we are > currently trying to get rid of and to avoid for the future. > > > I like the general conceptual approach but the implementation feels bumpy and > not (yet) ready for prime time. In any case I'd like to take forward conceptual > parts for the FF sponsored IFQ* rework. > > -- > Andre > > > _________________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/__mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@__freebsd.org > " > > > > > -- > -----------------------------------------+------------------------------- > Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione > http://www.iet.unipi.it/~luigi/ . Universita` di Pisa > TEL +39-050-2211611 . via Diotisalvi 2 > Mobile +39-338-6809875 . 56122 PISA (Italy) > -----------------------------------------+------------------------------- From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 20:50:28 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 77851A24 for ; Tue, 29 Oct 2013 20:50:28 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DB88625EE for ; Tue, 29 Oct 2013 20:50:27 +0000 (UTC) Received: (qmail 57757 invoked from network); 29 Oct 2013 21:20:57 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 21:20:57 -0000 Message-ID: <52701F7E.2060604@freebsd.org> Date: Tue, 29 Oct 2013 21:50:06 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Randall Stewart , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> In-Reply-To: <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 20:50:28 -0000 On 29.10.2013 21:20, Randall Stewart wrote: >> So, to conclude: i fully support any plan to design something that lets us >> implement scheduling (and qos, if you want to call it this way) >> in a reasonable way, but what is in your patch now does not really >> seem to improve the current situation in any way. > > Its a step towards fixing that I am allowed to give. I can see > why Company's get frustrated with trying to give anything to the project. Well, that we have a problem in that area is known and acknowledged and there is active work in this area going on. It would be very problematic if every vendor were just to through some stuff over the fence and have it integrated as is. It would quickly become very messy. In many specific purpose geared products a number of shortcuts can be taken that may not be appropriate for a general purpose OS that does more than routing. I believe we value the contribution by Adara and you but at the same time want to integrate it into a bigger picture for the entire kernel. When you pull up your product to FreeBSD 11 in the future it should be easy to stack your functionality again on the new base infrastructure without many/any modifications. -- Andre From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 21:02:40 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B67EF430; Tue, 29 Oct 2013 21:02:40 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-qc0-x231.google.com (mail-qc0-x231.google.com [IPv6:2607:f8b0:400d:c01::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 64751272B; Tue, 29 Oct 2013 21:02:40 +0000 (UTC) Received: by mail-qc0-f177.google.com with SMTP id u18so280575qcx.36 for ; Tue, 29 Oct 2013 14:02:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=869yhwA5JFgzW7eN27uMblMpWHeA+FnAVb49v/fEe20=; b=eSUvg/Rsio9unPDX7nyGqYBsAgeRR65Z6u8Yoa+VfqWR1sB27iWfAx3dHfGMoHnhC/ N/9Jjcp9orVYbAslXS3Zw5EilGz0NpNS17xG3+XWxZ1YQygcTDms39pdrav+s/Wm1bBY J92gWeZCQ8kW3+ap8NMXC5oHA5S+yC/nMJmLblix1C49rxThmuqCf2SLms6Uw0w/zj3e vhC8f0y+JQ7ty5umirJj2aVUPT/5lZLg5xUXupEYt5EUu8Mth3wngLVWfUxqp+CBln89 ucyW7rv8cyB0e3+nfHWxEywTSIiKznd2OzUiCW/wyR27cKR+lnhoeLKog2VzI8wpnPTX iQLg== MIME-Version: 1.0 X-Received: by 10.49.12.14 with SMTP id u14mr2335894qeb.74.1383080559519; Tue, 29 Oct 2013 14:02:39 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.207.66 with HTTP; Tue, 29 Oct 2013 14:02:39 -0700 (PDT) In-Reply-To: <52701F7E.2060604@freebsd.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> <52701F7E.2060604@freebsd.org> Date: Tue, 29 Oct 2013 14:02:39 -0700 X-Google-Sender-Auth: CM5CFZliHd3rd1Ywv_53dQLgg3I Message-ID: Subject: Re: MQ Patch. From: Adrian Chadd To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 Cc: Luigi Rizzo , Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 21:02:40 -0000 [snip everything] ok, I've reviewed the work. TL;DR - it's a clearly correct step in the right direction, but I think we need to just think it through a tad bit more first. There have been queue discipline and queue management discussions in the past. Randall's work is a good step in that direction. I think though that we can take a step back up a little further. * In terms of queuing frames into multiple queues - yes, we absolutely should have an if_transmit() path to the driver that obeys "a" QoS field in the mbuf and pushes it into the relevant queue - with randalls work, it's in the driver, but it doesn't _have_ to be; * In terms of queue servicing and management - we likely need to have a variety of queue plugins that determine which frame from which queue gets chosen next to hand to the hardware. The hardware may have multiple queues! The hardware may have one queue! The application developer may only want to use one queue! That should be flexible and easy to plug into things. * Then we need to support dropping frames during queue and dropping frames during dequeue (ie, on its way to the hardware). That way we can implement the currently interesting kinds of queue disciplines (eg CODEL, etc.) * Should this be done at the driver layer (ie it's a library that each driver creates and owns), or as a layer above it, controlling the network device (ie, the linux queue discipline method.) So, my comments: * I don't like how it's hard-coding drbr's into the drivers. Yes, the underlying state should be a drbr for now. But I'd rather we have a queue discipline plugin API that drivers create an instance of. * It'll have methods to init/flush the rings, queue a frame into a ring, dequeue a frame from a ring, be notified of transmit completions so more work can be done, etc. * For people who do latency-sensitive things, they can just bypass this entirely and go straight to the hardware queues without going through this kind of intermediary queue layer. Randall - I think we can take your work and turn it into a net library that implements your queue management routines. That way we can start enabling people to tinker with it and replace it if they need to. What do you think? From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 21:03:45 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 98DF956C; Tue, 29 Oct 2013 21:03:45 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: from mail-pb0-x22e.google.com (mail-pb0-x22e.google.com [IPv6:2607:f8b0:400e:c01::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 6961F2748; Tue, 29 Oct 2013 21:03:45 +0000 (UTC) Received: by mail-pb0-f46.google.com with SMTP id un4so402292pbc.33 for ; Tue, 29 Oct 2013 14:03:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=GX8gLD4SxWt6kl9FF0AgLSK5vSzpNxvKaob1H5XX4mA=; b=r6PfG3Mwne9Xc2whqETNi5BgbVvnkR2JQt/MRbLhNhQ3Wl5pmTos9mcHRPjgR6dlpW QrzKrk8iIp+GKf0P5+RBE9nu4UWouAf7GsfEwlzc/WYGKxpL5XulywYwdpLoXe9NNr5x Mc80LtdMKsqrbFjijFfdMgUDwJrBlxtYgP5pHaCgXytkP4CktaVhrBz9VNfy+aCeBwUJ Jg2UfGIVCfMXuaMZ+eEQyO7jXcziOG5br6tPPXmpC5p0K2dkBjg6/bnHUEX4hCBzERyj RoXYIgFK29JEc899HG8Nt9Gm2ClzgE9sU/KCwFm1nM3Fw90lsEIHtF8IgypUpD9AZLvL pIBQ== X-Received: by 10.66.233.69 with SMTP id tu5mr2467394pac.78.1383080624103; Tue, 29 Oct 2013 14:03:44 -0700 (PDT) Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58]) by mx.google.com with ESMTPSA id v4sm36857732pbq.31.2013.10.29.14.03.42 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Oct 2013 14:03:43 -0700 (PDT) Sender: Navdeep Parhar Message-ID: <527022AC.4030502@FreeBSD.org> Date: Tue, 29 Oct 2013 14:03:40 -0700 From: Navdeep Parhar User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Andre Oppermann , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> In-Reply-To: <52701D8B.8050907@freebsd.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 21:03:45 -0000 On 10/29/13 13:41, Andre Oppermann wrote: > Let me jump in here and explain roughly the ideas/path I'm exploring > in creating and eventually implementing a big picture for drivers, > queues, queue management, various QoS and so on: > > Situation: We're still mostly based on the old 4.4BSD IFQ model with > a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we > have in tree aren't helpful at all. > > Steps: > > 1. take the soft-queuing method out of the ifnet layer and make it > a property of the driver, so that the upper stack (or actually > protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) > without any queuing at that point. It then is up to the driver > to decide how it multiplexes multi-core access to its queue(s) > and how they are configured. It would work out much better if the kernel was aware of the number of tx queues of a multiq driver and explicitly selected one in if_transmit. The driver has no information on the CPU affinity etc. of the applications generating the traffic; the kernel does. In general, the kernel has a much better "global view" of the system and some of the stuff currently in the drivers really should move up into the stack. Regards, Navdeep From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 21:25:56 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 897865A0 for ; Tue, 29 Oct 2013 21:25:56 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EAD65291A for ; Tue, 29 Oct 2013 21:25:55 +0000 (UTC) Received: (qmail 57950 invoked from network); 29 Oct 2013 21:56:25 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 21:56:25 -0000 Message-ID: <527027CE.5040806@freebsd.org> Date: Tue, 29 Oct 2013 22:25:34 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Navdeep Parhar , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> In-Reply-To: <527022AC.4030502@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 21:25:56 -0000 On 29.10.2013 22:03, Navdeep Parhar wrote: > On 10/29/13 13:41, Andre Oppermann wrote: >> Let me jump in here and explain roughly the ideas/path I'm exploring >> in creating and eventually implementing a big picture for drivers, >> queues, queue management, various QoS and so on: >> >> Situation: We're still mostly based on the old 4.4BSD IFQ model with >> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we >> have in tree aren't helpful at all. >> >> Steps: >> >> 1. take the soft-queuing method out of the ifnet layer and make it >> a property of the driver, so that the upper stack (or actually >> protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) >> without any queuing at that point. It then is up to the driver >> to decide how it multiplexes multi-core access to its queue(s) >> and how they are configured. > > It would work out much better if the kernel was aware of the number of > tx queues of a multiq driver and explicitly selected one in if_transmit. > The driver has no information on the CPU affinity etc. of the > applications generating the traffic; the kernel does. In general, the > kernel has a much better "global view" of the system and some of the > stuff currently in the drivers really should move up into the stack. I've been thinking a lot about this and come to the preliminary conclusion that the upper stack should not tell the driver which queue to use. There are way to many possible and depending on the use-case, better or worse performing approaches. Also we have a big problem with cores vs. queues mismatches either way (more cores than queues or more queues than cores, though the latter is much less of problem). For now I see these primary multi-hardware-queue approaches to be implemented first: a) the drivers (*if_transmit) takes the flowid from the mbuf header and selects one of the N hardware DMA rings based on it. Each of the DMA rings is protected by a lock. Here the assumption is that by having enough DMA rings the contention on each of them will be relatively low and ideally a flow and ring sort of sticks to a core that sends lots of packets into that flow. Of course it is a statistical certainty that some bouncing will be going on. b) the driver assigns the DMA rings to particular cores which by that, through a critnest++ can drive them lockless. The drivers (*if_transmit) will look up the core it got called on and push the traffic out on that DMA ring. The problem is the actual upper stacks affinity which is not guaranteed. This has to consequences: there may be reordering of packets of the same flow because the protocols send function happens to be called from a different core the second time. Or the drivers (*if_transmit) has to switch to the right core to complete the transmit for this flow if the upper stack migrated/bounced around. It is rather difficult to assure full affinity from userspace down through the upper stack and then to the driver. c) non-multi-queue capable hardware uses a kernel provided set of functions to manage the contention for the single resource of a DMA ring. The point here is that the driver is the right place to make these decisions because the upper stack lacks (and shouldn't care about) the actual available hardware and its capabilities. All necessary information is available to the driver as well through the appropriate mbuf header fields and the core it is called on. -- Andre From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 21:35:34 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 96C1ABC2; Tue, 29 Oct 2013 21:35:34 +0000 (UTC) (envelope-from rizzo.unipi@gmail.com) Received: from mail-la0-x235.google.com (mail-la0-x235.google.com [IPv6:2a00:1450:4010:c03::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CDD9629E1; Tue, 29 Oct 2013 21:35:33 +0000 (UTC) Received: by mail-la0-f53.google.com with SMTP id eo20so388122lab.40 for ; Tue, 29 Oct 2013 14:35:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=UJbPHBJ5U+JO635o1N/VBaC+cD0OeJweC3LtPpfS8XE=; b=y4eWumh4Ll8thrCUKpXW3ffd74f2+ccEKrVOL5HXLtiDedEP0A2bEx1LBixV0TUFax ZWPb0tbPwCXvbiIiG895zkWkDXe+MGLZdpP8l75JgvObH0Lfyxcecv88CF/3YITj8p8/ +QFQt3khQEKa8BxelP4cefX3OVTHncV9JowoDaLRhNwuHpSiwFPVX12m5COEHeBvIT+e z8H407LrDOz2lhshBO2+7fjjZYZDsnRY3yQ/A6hp4kdijd+PqginDOkOTgkxTaG/c5q6 j8ioPEPGRp3x54/vQLwpiymAfzl5OR1uAAu1rHkWYSSwEUrdK4UCvZSrRDkh40oCDW0+ WNoQ== MIME-Version: 1.0 X-Received: by 10.112.167.99 with SMTP id zn3mr1317789lbb.34.1383082531605; Tue, 29 Oct 2013 14:35:31 -0700 (PDT) Sender: rizzo.unipi@gmail.com Received: by 10.114.172.105 with HTTP; Tue, 29 Oct 2013 14:35:31 -0700 (PDT) In-Reply-To: <52701F7E.2060604@freebsd.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> <52701F7E.2060604@freebsd.org> Date: Tue, 29 Oct 2013 14:35:31 -0700 X-Google-Sender-Auth: LWJQwASElfs9xLH9VsySXAdty9I Message-ID: Subject: Re: MQ Patch. From: Luigi Rizzo To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 21:35:34 -0000 On Tue, Oct 29, 2013 at 1:50 PM, Andre Oppermann wrote: > On 29.10.2013 21:20, Randall Stewart wrote: > >> So, to conclude: i fully support any plan to design something that lets us >>> implement scheduling (and qos, if you want to call it this way) >>> in a reasonable way, but what is in your patch now does not really >>> seem to improve the current situation in any way. >>> >> >> Its a step towards fixing that I am allowed to give. I can see >> why Company's get frustrated with trying to give anything to the project. >> > > Well, that we have a problem in that area is known and acknowledged and > there is active work in this area going on. > > It would be very problematic if every vendor were just to through some > stuff over the fence and have it integrated as is. It would quickly > become very messy. In many specific purpose geared products a number > of shortcuts can be taken that may not be appropriate for a general > purpose OS that does more than routing. > that is exactly the issue. It is not just FreeBSD that has strict policies on what gets accepted. Several times (though mostly in the past) I myself have been suggested to reconsider submissions that were too intrusive or lacking from an architectural point of view. And as much i could have been annoyed, i have to recognise that the criticism was legitimate and eventually led to better implementations. Of course one has much more freedom when playing with a standalone component (say netmap, or a device driver, or SCTP...) which does not interfere with the rest of the kernel, and possibly even fills a hole in the OS. But this is not one of those cases. cheers luigi From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 21:45:02 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id AA544F53 for ; Tue, 29 Oct 2013 21:45:02 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0CF542A9C for ; Tue, 29 Oct 2013 21:45:01 +0000 (UTC) Received: (qmail 58032 invoked from network); 29 Oct 2013 22:15:31 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Oct 2013 22:15:31 -0000 Message-ID: <52702C48.3010706@freebsd.org> Date: Tue, 29 Oct 2013 22:44:40 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> <52701F7E.2060604@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Luigi Rizzo , Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 21:45:02 -0000 On 29.10.2013 22:02, Adrian Chadd wrote: > [snip everything] > > ok, I've reviewed the work. > > TL;DR - it's a clearly correct step in the right direction, but I > think we need to just think it through a tad bit more first. > > There have been queue discipline and queue management discussions in > the past. Randall's work is a good step in that direction. > > I think though that we can take a step back up a little further. > > * In terms of queuing frames into multiple queues - yes, we absolutely > should have an if_transmit() path to the driver that obeys "a" QoS > field in the mbuf and pushes it into the relevant queue - with > randalls work, it's in the driver, but it doesn't _have_ to be; Only the driver can know how much it can do in hardware and how much has to be emulated in software. The kernel should provide a couple of optimized software emulation to driver should link into. > * In terms of queue servicing and management - we likely need to have > a variety of queue plugins that determine which frame from which queue > gets chosen next to hand to the hardware. The hardware may have > multiple queues! The hardware may have one queue! The application > developer may only want to use one queue! That should be flexible and > easy to plug into things. We have to get rid of the current (and mostly mental) model of a software queue. The software queue only exists a) for historical reasons as the first interface didn't have any DMA rings at all; b) to manage concurrent access to a single or limited shared resource. In reality the DMA ring is deep enough and *all the queue* we need. > * Then we need to support dropping frames during queue and dropping > frames during dequeue (ie, on its way to the hardware). That way we > can implement the currently interesting kinds of queue disciplines (eg > CODEL, etc.) DMA rings by definition are tail drop. If you want to do active QoS and queue management you trade the DMA ring size for a software queue size. However this is only really an issue for routing types of traffic. With TCP getting an ENOBUFS on a send attempt is perfectly valid and the send socket buffer works as our queue. No need to deep buffer yet once more in software before the DMA ring. The only thing is that TCP needs some polish in that area to prevent it from thinking about a loss event. Maybe Lawrence can audit and adjust the relevant parts of tcp_output()s error handling. It should simply try again a few milliseconds later without waiting for a retransmit timeout or the ACK clocking again. > * Should this be done at the driver layer (ie it's a library that each > driver creates and owns), or as a layer above it, controlling the > network device (ie, the linux queue discipline method.) If the hardware actually supports it, then it should be done in the driver. Otherwise the qos and queue management would get shimmed in and highjack the (*if_transmit) function pointer to do the stuff in software and ticking out the packets through TX complete callbacks (or alternatively a timer as in dummynet). > So, my comments: > > * I don't like how it's hard-coding drbr's into the drivers. Yes, the > underlying state should be a drbr for now. But I'd rather we have a > queue discipline plugin API that drivers create an instance of. Full ACK. That's the plan. > * It'll have methods to init/flush the rings, queue a frame into a > ring, dequeue a frame from a ring, be notified of transmit completions > so more work can be done, etc. Pretty much. Drivers will be required to implement certain functionality to manage the DMA ring depth and to provide a TX completion callback into the software qos/queue shim but not the upper stack. > * For people who do latency-sensitive things, they can just bypass > this entirely and go straight to the hardware queues without going > through this kind of intermediary queue layer. IMHO this should be the default anyways with some provision to manage contention by multiple cores. For example by having a single packet slot for each core in case the DMA ring is already locked by another core. > Randall - I think we can take your work and turn it into a net library > that implements your queue management routines. That way we can start > enabling people to tinker with it and replace it if they need to. Moving struct ifnet and the drivers into the new model and making ifnet opaque has already been signed up for by Gleb and me. When that is in place in the next weeks any kind of queue model can be implemented at the drivers discretion, including Randalls. -- Andre From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 22:03:14 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 3518960A; Tue, 29 Oct 2013 22:03:14 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: from mail-pb0-x233.google.com (mail-pb0-x233.google.com [IPv6:2607:f8b0:400e:c01::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 052802BBC; Tue, 29 Oct 2013 22:03:13 +0000 (UTC) Received: by mail-pb0-f51.google.com with SMTP id wz7so465047pbc.10 for ; Tue, 29 Oct 2013 15:03:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=n5pZqujjMbW1gsx4qa8sGOM3f+lVa4umij4bRjcFZKM=; b=huuCOQ4dMOx5dzw3I/jgJ4+t8H6GakygCfCdcTknnlGKka26pWxE7uKmJHoG80X3HB j1Tzmw74N9PUY7/bLEuto5UXWqsoz+pDeAbUh8W+CR/gJoqUSpi7mvtsFvRn6AEb9Qs9 XObSNe/F0CfjyouAiPNLj6VRxcNCzzPF20feax+y7TJxfJWZe8Bqom2mVz2O7UUe3YUZ nrLQuaLbvpO7RVIhfsiAhXu0mSVucynUXhFvc1mnxL6p2XD0H5WGCuQUgmohVAjvf/wI Rzj1/HGUgClmd0oXRKPSMRnciy7XDcjUolFnuajaVTu4GgLGk3wuhkrhJVjhjXw8oj9S KG2w== X-Received: by 10.68.228.138 with SMTP id si10mr838544pbc.13.1383084193322; Tue, 29 Oct 2013 15:03:13 -0700 (PDT) Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58]) by mx.google.com with ESMTPSA id qp10sm44953730pab.13.2013.10.29.15.03.11 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Oct 2013 15:03:12 -0700 (PDT) Sender: Navdeep Parhar Message-ID: <5270309E.5090403@FreeBSD.org> Date: Tue, 29 Oct 2013 15:03:10 -0700 From: Navdeep Parhar User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Andre Oppermann , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> In-Reply-To: <527027CE.5040806@freebsd.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 22:03:14 -0000 On 10/29/13 14:25, Andre Oppermann wrote: > On 29.10.2013 22:03, Navdeep Parhar wrote: >> On 10/29/13 13:41, Andre Oppermann wrote: >>> Let me jump in here and explain roughly the ideas/path I'm exploring >>> in creating and eventually implementing a big picture for drivers, >>> queues, queue management, various QoS and so on: >>> >>> Situation: We're still mostly based on the old 4.4BSD IFQ model with >>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we >>> have in tree aren't helpful at all. >>> >>> Steps: >>> >>> 1. take the soft-queuing method out of the ifnet layer and make it >>> a property of the driver, so that the upper stack (or actually >>> protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) >>> without any queuing at that point. It then is up to the driver >>> to decide how it multiplexes multi-core access to its queue(s) >>> and how they are configured. >> >> It would work out much better if the kernel was aware of the number of >> tx queues of a multiq driver and explicitly selected one in if_transmit. >> The driver has no information on the CPU affinity etc. of the >> applications generating the traffic; the kernel does. In general, the >> kernel has a much better "global view" of the system and some of the >> stuff currently in the drivers really should move up into the stack. > > I've been thinking a lot about this and come to the preliminary conclusion > that the upper stack should not tell the driver which queue to use. There > are way to many possible and depending on the use-case, better or worse > performing approaches. Also we have a big problem with cores vs. queues > mismatches either way (more cores than queues or more queues than cores, > though the latter is much less of problem). > > For now I see these primary multi-hardware-queue approaches to be > implemented > first: > > a) the drivers (*if_transmit) takes the flowid from the mbuf header and > selects one of the N hardware DMA rings based on it. Each of the DMA > rings is protected by a lock. Here the assumption is that by having > enough DMA rings the contention on each of them will be relatively low > and ideally a flow and ring sort of sticks to a core that sends lots > of packets into that flow. Of course it is a statistical certainty that > some bouncing will be going on. > > b) the driver assigns the DMA rings to particular cores which by that, > through > a critnest++ can drive them lockless. The drivers (*if_transmit) > will look > up the core it got called on and push the traffic out on that DMA ring. > The problem is the actual upper stacks affinity which is not guaranteed. > This has to consequences: there may be reordering of packets of the same > flow because the protocols send function happens to be called from a > different core the second time. Or the drivers (*if_transmit) has to > switch to the right core to complete the transmit for this flow if the > upper stack migrated/bounced around. It is rather difficult to assure > full affinity from userspace down through the upper stack and then to > the driver. > > c) non-multi-queue capable hardware uses a kernel provided set of functions > to manage the contention for the single resource of a DMA ring. > > The point here is that the driver is the right place to make these > decisions > because the upper stack lacks (and shouldn't care about) the actual > available > hardware and its capabilities. All necessary information is available > to the > driver as well through the appropriate mbuf header fields and the core > it is > called on. > I mildly disagree with most of this, specifically with the part that the driver is the right place to make these decisions. But you did say this was a "preliminary conclusion" so there's hope yet ;-) Let's wait till you have an early implementation and we are all able to experiment with it. To be continued... Regards, Navdeep From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 23:35:36 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 46E73198 for ; Tue, 29 Oct 2013 23:35:36 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7875920E7 for ; Tue, 29 Oct 2013 23:35:35 +0000 (UTC) Received: (qmail 58447 invoked from network); 30 Oct 2013 00:05:57 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 00:05:57 -0000 Message-ID: <5270462B.8050305@freebsd.org> Date: Wed, 30 Oct 2013 00:35:07 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Navdeep Parhar , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> In-Reply-To: <5270309E.5090403@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 23:35:36 -0000 On 29.10.2013 23:03, Navdeep Parhar wrote: > On 10/29/13 14:25, Andre Oppermann wrote: >> On 29.10.2013 22:03, Navdeep Parhar wrote: >>> On 10/29/13 13:41, Andre Oppermann wrote: >>>> Let me jump in here and explain roughly the ideas/path I'm exploring >>>> in creating and eventually implementing a big picture for drivers, >>>> queues, queue management, various QoS and so on: >>>> >>>> Situation: We're still mostly based on the old 4.4BSD IFQ model with >>>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we >>>> have in tree aren't helpful at all. >>>> >>>> Steps: >>>> >>>> 1. take the soft-queuing method out of the ifnet layer and make it >>>> a property of the driver, so that the upper stack (or actually >>>> protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) >>>> without any queuing at that point. It then is up to the driver >>>> to decide how it multiplexes multi-core access to its queue(s) >>>> and how they are configured. >>> >>> It would work out much better if the kernel was aware of the number of >>> tx queues of a multiq driver and explicitly selected one in if_transmit. >>> The driver has no information on the CPU affinity etc. of the >>> applications generating the traffic; the kernel does. In general, the >>> kernel has a much better "global view" of the system and some of the >>> stuff currently in the drivers really should move up into the stack. >> >> I've been thinking a lot about this and come to the preliminary conclusion >> that the upper stack should not tell the driver which queue to use. There >> are way to many possible and depending on the use-case, better or worse >> performing approaches. Also we have a big problem with cores vs. queues >> mismatches either way (more cores than queues or more queues than cores, >> though the latter is much less of problem). >> >> For now I see these primary multi-hardware-queue approaches to be >> implemented >> first: >> >> a) the drivers (*if_transmit) takes the flowid from the mbuf header and >> selects one of the N hardware DMA rings based on it. Each of the DMA >> rings is protected by a lock. Here the assumption is that by having >> enough DMA rings the contention on each of them will be relatively low >> and ideally a flow and ring sort of sticks to a core that sends lots >> of packets into that flow. Of course it is a statistical certainty that >> some bouncing will be going on. >> >> b) the driver assigns the DMA rings to particular cores which by that, >> through >> a critnest++ can drive them lockless. The drivers (*if_transmit) >> will look >> up the core it got called on and push the traffic out on that DMA ring. >> The problem is the actual upper stacks affinity which is not guaranteed. >> This has to consequences: there may be reordering of packets of the same >> flow because the protocols send function happens to be called from a >> different core the second time. Or the drivers (*if_transmit) has to >> switch to the right core to complete the transmit for this flow if the >> upper stack migrated/bounced around. It is rather difficult to assure >> full affinity from userspace down through the upper stack and then to >> the driver. >> >> c) non-multi-queue capable hardware uses a kernel provided set of functions >> to manage the contention for the single resource of a DMA ring. >> >> The point here is that the driver is the right place to make these >> decisions >> because the upper stack lacks (and shouldn't care about) the actual >> available >> hardware and its capabilities. All necessary information is available >> to the >> driver as well through the appropriate mbuf header fields and the core >> it is >> called on. >> > > I mildly disagree with most of this, specifically with the part that the > driver is the right place to make these decisions. But you did say this > was a "preliminary conclusion" so there's hope yet ;-) I've mostly arrived at this conclusion as the least evil place to do it because of the complexity that would otherwise hit the ifnet boundary. Having to deal with simple one DMA ring only cards and high end cards that support 64 times 8 QoS WFQ classes DMA rings in one place is messy to properly abstract. Also supporting API/ABI forward and backwards compatibility would likely be nightmarish. The driver isn't really making the decision, it is acting upon the mbuf header information (flowid, qoscos) and using it together with its intimate knowledge of the hardware capabilities to get a hopefully close to optimal result. The holy grail so to say would be to run the entire stack with full affinity up and down. That is certainly possible, provided the application is fully aware of it as well. In typical mixed load cases this is unlikely the case and the application(s) are floating around. A full affinity stack then would have to switch to the right core when the kernel is entered. This has its own drawbacks again. However nothing in the new implementations should prevent us from running the stack in full affinity mode. > Let's wait till you have an early implementation and we are all able to > experiment with it. To be continued... By all means feel free to bring up your own ideas and experiences from other implementations as well, either in public or private. I'm more than happy to discuss and include other ideas. In the end the cold hard numbers and the suitability for a general purpose OS. My goal is to be good to very good in > 90% of all common use cases, while providing all necessary knobs, and be it in the form of KLDs with a well defined API, to push particular workloads to the full 99.9%. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 01:24:24 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 3BC9CB9D for ; Wed, 30 Oct 2013 01:24:24 +0000 (UTC) (envelope-from www-data@modersmal.skolverket.se) Received: from modersmal.skolverket.se (dns.skolverket.se [62.13.78.2]) by mx1.freebsd.org (Postfix) with ESMTP id 02E342659 for ; Wed, 30 Oct 2013 01:24:23 +0000 (UTC) Received: by modersmal.skolverket.se (Postfix, from userid 33) id 11125BA82B; Wed, 30 Oct 2013 02:10:22 +0100 (CET) To: freebsd-net@freebsd.org Subject: Re: Assalam X-PHP-Originating-Script: 33:247@abu.php From: Mohamad Hassan MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 8bit Message-Id: <20131030011510.11125BA82B@modersmal.skolverket.se> Date: Wed, 30 Oct 2013 02:10:22 +0100 (CET) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: mohamad_hassan@rediffmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 01:24:24 -0000 Assalamalaikum Wr Wb I hope in the name of ALLAH that I have the right person who will assist me. I got your contact through a web directory. I want to transfer my family's money into your country/ business for investment purposes and to secure the future of my 3 children because we are uncertain of the future of this country; as such I would like to make contact with you residing in that country for assistance. Note these funds are already in a security company which has branches around the world for safe keeping. I would have done this myself but my present health condition will not warrant me to do so. Kindly help with this because I cannot travel out of libya at the moment due to some certain conditions and great difficulties added to the fact that am disabled on a wheel chair due to a bombing that occurred in Benghazi I will explain more to you when I am certain that I can trust you. The fall of Muammar Gaddafi came with a lot of destruction / Hell to our great country Libya and everything is practically difficult now and opportunities are closing up, the new government is trying to frustrate our life. Please if you accept this offer of assistance you are required to give me your Name, age, occupation, address also enclosing your telephone fax numbers. What I now need from you are as follows: 1. You will help me receive and secure the funds from the security company on my family's behalf and open a Bank account for my children in your country with the credentials i will give you. 2. You will be entitled to 30% of the total sum involved for your assistance. 3. As soon as you confirm to me by e-mail your readiness to assist with this, I will give you more details as regards claiming the funds from the security company. 4. Please note that this project is 100% risk free but you must keep it very secret and confidential with strong assurance that you will not let me down at all. Regards, Mohamad Hassan al-Rida From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 01:43:23 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 1381C552; Wed, 30 Oct 2013 01:43:23 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-qc0-x230.google.com (mail-qc0-x230.google.com [IPv6:2607:f8b0:400d:c01::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A4D73290E; Wed, 30 Oct 2013 01:43:22 +0000 (UTC) Received: by mail-qc0-f176.google.com with SMTP id s19so440471qcw.21 for ; Tue, 29 Oct 2013 18:43:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=D0YUnNrTReQQQDH3znp0Xh5/TZKggRchWybc/mEmT8Y=; b=zNb9G375f3L6FybkPfedwnwZliOQsx0NqkUwu/HqUOzcGjLgDTopLVmnn7cjWJLpic Th21QNzAag3rcve/U3HZgSCnV6PvuCG2rj6Vr5cVELlTUTeJ4eGDTAcFHd8Oh7+zz41S fp6Zg86n5MiqrqCCiZnVBDdsq/cs/L+Se3Ebh5jOyJPQxFiKv/Q1YgNRV1cvc2c0G/Re vItz6M+AYGOuJRvNJyLiYO17fP+I1/EPtxQqUCh036AMh7N5Ffmf3cZbWkngVck98eI7 kmI5XUNykJB6BkrES6zSlbxsXHe4Ikn/C0vC16iPL0lsrq+IP/k9i0zYEr/xwPZXfWNM qeOA== MIME-Version: 1.0 X-Received: by 10.224.37.198 with SMTP id y6mr4756827qad.104.1383097401726; Tue, 29 Oct 2013 18:43:21 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.207.66 with HTTP; Tue, 29 Oct 2013 18:43:21 -0700 (PDT) In-Reply-To: <5270462B.8050305@freebsd.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> Date: Tue, 29 Oct 2013 18:43:21 -0700 X-Google-Sender-Auth: H-5o5ybupz8gIqhONyo4Y6qTIv4 Message-ID: Subject: Re: MQ Patch. From: Adrian Chadd To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 01:43:23 -0000 Hi, We can't assume the hardware has deep queues _and_ we can't just hand packets to the DMA engine. Why? Because once you've pushed it into the transmit ring, you can't guarantee / impose any ordering on things. You can't guarantee that you can abort a frame that has been queued because it now breaks the queue rules. That's why we don't want to just have a light wrapper around hardware transmit queues. We give up way too much useful control. I've seen this both when doing wifi (where I absolutely have to have per-node, per-TID queues, far before it hits the hardware) and doing WAN style optimisation, where I want to ensure I only queue a handful of milliseconds of frames to the hardware so I can ensure I can hit QoS requirements (eg there being a large amount of bulk data, then I want to inject some voice traffic that should go out sooner..) Thanks, -adrian From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 03:16:45 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 3287F79B; Wed, 30 Oct 2013 03:16:45 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 012B12E4D; Wed, 30 Oct 2013 03:16:44 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-253-246.lns20.per2.internode.on.net [121.45.253.246]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id r9U3Gcrv021556 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 29 Oct 2013 20:16:41 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <52707A10.6040105@freebsd.org> Date: Wed, 30 Oct 2013 11:16:32 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Adrian Chadd , Andre Oppermann Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <13BF1F55-EC13-482B-AF7D-59AE039F877D@lakerest.net> <52701F7E.2060604@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Luigi Rizzo , Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 03:16:45 -0000 On 10/30/13, 5:02 AM, Adrian Chadd wrote: > [snip everything] > > > Randall - I think we can take your work and turn it into a net library > that implements your queue management routines. That way we can start > enabling people to tinker with it and replace it if they need to. to make a point on Randall's comment on contributing code.. The advantage to you (adara) is that even if we don't put your code in directly we now are on notice that whatever we do must take into account your requirements so that in 11 while it may not be a 'coding-free' upgrade.. it should at worst be a 'trivial coding' upgrade. > > What do you think? > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 03:30:52 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 589DDA8A; Wed, 30 Oct 2013 03:30:52 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 27DAA2F03; Wed, 30 Oct 2013 03:30:51 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-253-246.lns20.per2.internode.on.net [121.45.253.246]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id r9U3UjXT021610 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 29 Oct 2013 20:30:48 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <52707D60.1070001@freebsd.org> Date: Wed, 30 Oct 2013 11:30:40 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Andre Oppermann , Navdeep Parhar , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> In-Reply-To: <5270462B.8050305@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 03:30:52 -0000 On 10/30/13, 7:35 AM, Andre Oppermann wrote: > > The holy grail so to say would be to run the entire stack with full > affinity up and down. That is certainly possible, provided the > application > is fully aware of it as well. In typical mixed load cases this is > unlikely > the case and the application(s) are floating around. with multithreaded apps it's *most likely* that writes will be coming from several differnent CPUs.. From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 04:59:21 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 86E1CF4D; Wed, 30 Oct 2013 04:59:21 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id EDC382329; Wed, 30 Oct 2013 04:59:16 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 1B4267300A; Wed, 30 Oct 2013 06:00:56 +0100 (CET) Date: Wed, 30 Oct 2013 06:00:56 +0100 From: Luigi Rizzo To: Adrian Chadd , Andre Oppermann , Navdeep Parhar , Randall Stewart , "freebsd-net@freebsd.org" Subject: [long] Network stack -> NIC flow (was Re: MQ Patch.) Message-ID: <20131030050056.GA84368@onelab2.iet.unipi.it> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 04:59:21 -0000 On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote: > Hi, > > We can't assume the hardware has deep queues _and_ we can't just hand > packets to the DMA engine. > [Adrian explains why] i have the feeling that the variuos folks who stepped into this discussion seem to have completely different (and orthogonal) goals and as such these goals should be discussed separately. Below is the architecture i have in mind and how i would implement it (and it would be extremely simple since we have most of the pieces in place). It would be useful if people could discuss what problem they are addressing before coming up with patches. --- The architecture i think we should pursue is this (which happens to be what linux implements, and also what dummynet implements, except that the output is to a dummynet pipe or to ether_output() or to ip_output() depending on the configuration): 1. multiple (one per core) concurrent transmitters t_c which use ether_output_frame() to send to 2. multiple disjoint queues q_j (one per traffic group, can be *a lot*, say 10^6) which are scheduled with a scheduler S (iterate step 2 for hierarchical schedulers) and 3. eventually feed ONE transmit ring R_j on the NIC. Once a packet reaches R_j, for all practical purpose is on the wire. We cannot intercept extractions, we cannot interfere with the scheduler in the NIC in case of multiqueue NICs. The most we can do (and should, as in Linux) is notify the owner of the packet once its transmission is complete. Just to set the terminology: QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES . This is what implements DROPTAIL (also improperly called FIFO), RED, CODEL. Note that for CODEL you need to intercept extractions from the queue, whereas DROPTAIL and RED only act on insertions. SCHEDULER is the entity which decides which queue to serve among the many possible ones. It is called on INSERTIONS and EXTRACTIONS from a queue, and passes packets to the NIC's queue. The decision on which queue and ring (Q_i and R_j) to use should be made by a classifier at the beginning of step 2 (or once per iteration, if using a hierarchical scheduler). Of course they can be precomputed (e.g. with annotations in the mbuf coming from the socket). Now when it comes to implementing the above, we have three cases (or different optimization levels, if you like) -- 1. THE SIMPLE CASE --- In the simplest possible case we have can let the NIC do everything. Necessary conditions are: - queue management policies acting only on insertions (e.g. DROPTAIL or RED or similar); - # of traffic classes <= # number of NIC rings - scheduling policy S equal to the one implemented in the NIC (trivial case: one queue, one ring, no scheduler) All these cases match exactly what the hardware provides, so we can just use the NIC ring(s) without extra queue(s), and possibly use something like buf_ring to manage insertions (but note that insertions in an empty queue will end up requiring a lock; and i think the same happens even now with the extra drbr queue in front of the ring). -- 2. THE INTERMEDIATE CASE --- If we do not care about a scheduler but want a more complex QUEUE MANAGEMENT, such as CODEL, that acts on extractions, we _must_ implement an intermediate queue Q_i before the NIC ring. This is our only chance to act on extractions from the queue (which CODEL requires). Note that we DO NOT NEED to create multiple queues for each ring. -- 3. THE COMPLETE CASE --- This is when the scheduler we want (DRR, WFQ variants, PRIORITY...) is not implemented in the NIC, or we have more queues than those available in the NIC. In this case we need to invoke this extra block before passing packets to the NIC. Remember that dummynet implements exactly #3, and it comes with a set of pretty efficient schedulers (i have made extensive measurements on them, see links to papers on my research page http://info.iet.unipi.it/~luigi/research.html ). They are by no means a performance bottleneck (scheduling takes 50..200ns depending on the circumstances) in the cases where it matters to have a scheduler (which is, when the sender is faster than the NIC, which in turn only happens with large packets which take 1..30us to get through at the very least.. --- IMPLEMENTATION --- Apart from ALTQ (which is very slow and has inefficient schedulers and i don't think anybody wants to maintain), and with the exception of dummynet which I'll discuss later, at the moment FreeBSD do not support schedulers in the tx path of the device driver. So we can only deal with cases 1 and 2, and for them the software queue + ring suffices to implement any QUEUE MANAGEMENT policy (but we don't implement anything). If we want support the generic case (#3), we should do the following: 1. device drivers export a function to transmit on an individual ring, basically the current if_transmit(), and a hook to play with the corresponding queue lock (the scheduler needs to run under lock, and we can as well use the ring lock for that). Note that the ether_output_frame does not always need to call the scheduler: if a packet enters a non-empty queue, we are done. 2. device drivers also export the number of tx queues, and some (advisory) information on queue status 3. ether_output_frame() runs the classifier (if needed), invokes the scheduler (if needed) and possibly falls through into if_transmit() for the specific ring. 4. on transmit completions (*_txeof(), typically), a callback invokes the scheduler to feed the NIC ring with more packets I mentioned dummynet: it already implements ALL of this, including the completion callback in #4. There is a hook in ether_output_frame(), and the hook was called (up to 8.0 i believe) if_tx_rdy(). You can see wat it does in RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy() http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup if_tx_rdy() does not exist anymore because almost nobody used it, but it is trivial to reimplement, and can be called by device drivers when *_txeof() finds that is running low on packets _and_ the specific NIC needs to implement the "complete" scheduling. The way it worked in dummynet (I think i used it in on 'tun' and 'ed') is also documented in the manpage: define a pipe whose bandwidth is set as a the device name instead of a number. Then you can attach a scheduler to the pipe, queues to the scheduler, and you are done. Example: // this is the scheduler's configuration ipfw pipe 10 config bw 'em2' sched ipfw sched 10 config type drr // deficit round robin ipfw queue 1 config weight 30 sched 10 // important ipfw queue 2 config weight 5 sched 10 // less important ipfw queue 3 config weight 1 sched 10 // who cares... // and this is the classifier, which you can skip if the // packets are already pre-classified. // The infrastructure is already there to implement per-interface // configurations. ipfw add queue 1 src-port 53 ipfw add queue 2 src-port 22 ipfw add queue 2 ip from any to any Now, surely we can replace the implementation of packet queues in dummynet from the current TAILQ to something resembling buf_ring to improve write parallelism; and a bit of glue code is needed to attach per-interface ipfw instances to each interface, and some smarts in the configuration commands is needed to figure out when we can bypass everything or not. But this seems to me a much more viable approach to achieve proper QoS support in our architecture. cheers luigi cheers luigi From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 05:47:52 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 050E7EF7; Wed, 30 Oct 2013 05:47:52 +0000 (UTC) (envelope-from jfvogel@gmail.com) Received: from mail-ve0-x231.google.com (mail-ve0-x231.google.com [IPv6:2607:f8b0:400c:c01::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8FE9F2569; Wed, 30 Oct 2013 05:47:51 +0000 (UTC) Received: by mail-ve0-f177.google.com with SMTP id oz11so633720veb.22 for ; Tue, 29 Oct 2013 22:47:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=hxuYzFcbtk8SsafKaASTm9NEz2rfB6I7OVicnZGVURk=; b=vfArYMQNx4x+0Nlgms6of/MOqm1r50EOCMWkdBtFmJISGzJjPSnWzuLMnkngq1sO2G bz88RwfPKvX626NtLtJMkkO1bS2emGjgIPJDkdthMTtswN5ElWZIp7nefq6LqQfugehy lrKh44qt90vB5M30kmvSoHMQu9YaajEwKlpH5r4rS+y/TzumCD//ZXQIIgKyeuWgvfoU AQJ8CbkUI7y3cxIeit4Z6edRczGmnZLrme9gKAKDXgHjGEXkzSjdXFgEWZUSsCGIHg3m 2bccNVbU2/NMSSh1iOxbkjBI8RiTrTBlspj2+YqybAXDISZlXBiajHQq3GiAU2Qxng5r +Mwg== MIME-Version: 1.0 X-Received: by 10.52.119.198 with SMTP id kw6mr97706vdb.47.1383112070199; Tue, 29 Oct 2013 22:47:50 -0700 (PDT) Received: by 10.220.155.148 with HTTP; Tue, 29 Oct 2013 22:47:50 -0700 (PDT) In-Reply-To: <5270309E.5090403@FreeBSD.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> Date: Tue, 29 Oct 2013 22:47:50 -0700 Message-ID: Subject: Re: MQ Patch. From: Jack Vogel To: Navdeep Parhar Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Luigi Rizzo , Andre Oppermann , Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 05:47:52 -0000 I find myself agreeing with Navdeep, what Windows does might provide a hint (my god did I say that :)), the driver provides hints to the kernel, but its from "above" that the ultimate decisions are made based on what the hardware hints are. So, its not either or, its both and.... Jack On Tue, Oct 29, 2013 at 3:03 PM, Navdeep Parhar wrote: > On 10/29/13 14:25, Andre Oppermann wrote: > > On 29.10.2013 22:03, Navdeep Parhar wrote: > >> On 10/29/13 13:41, Andre Oppermann wrote: > >>> Let me jump in here and explain roughly the ideas/path I'm exploring > >>> in creating and eventually implementing a big picture for drivers, > >>> queues, queue management, various QoS and so on: > >>> > >>> Situation: We're still mostly based on the old 4.4BSD IFQ model with > >>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we > >>> have in tree aren't helpful at all. > >>> > >>> Steps: > >>> > >>> 1. take the soft-queuing method out of the ifnet layer and make it > >>> a property of the driver, so that the upper stack (or actually > >>> protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) > >>> without any queuing at that point. It then is up to the driver > >>> to decide how it multiplexes multi-core access to its queue(s) > >>> and how they are configured. > >> > >> It would work out much better if the kernel was aware of the number of > >> tx queues of a multiq driver and explicitly selected one in if_transmit. > >> The driver has no information on the CPU affinity etc. of the > >> applications generating the traffic; the kernel does. In general, the > >> kernel has a much better "global view" of the system and some of the > >> stuff currently in the drivers really should move up into the stack. > > > > I've been thinking a lot about this and come to the preliminary > conclusion > > that the upper stack should not tell the driver which queue to use. > There > > are way to many possible and depending on the use-case, better or worse > > performing approaches. Also we have a big problem with cores vs. queues > > mismatches either way (more cores than queues or more queues than cores, > > though the latter is much less of problem). > > > > For now I see these primary multi-hardware-queue approaches to be > > implemented > > first: > > > > a) the drivers (*if_transmit) takes the flowid from the mbuf header and > > selects one of the N hardware DMA rings based on it. Each of the DMA > > rings is protected by a lock. Here the assumption is that by having > > enough DMA rings the contention on each of them will be relatively low > > and ideally a flow and ring sort of sticks to a core that sends lots > > of packets into that flow. Of course it is a statistical certainty > that > > some bouncing will be going on. > > > > b) the driver assigns the DMA rings to particular cores which by that, > > through > > a critnest++ can drive them lockless. The drivers (*if_transmit) > > will look > > up the core it got called on and push the traffic out on that DMA > ring. > > The problem is the actual upper stacks affinity which is not > guaranteed. > > This has to consequences: there may be reordering of packets of the > same > > flow because the protocols send function happens to be called from a > > different core the second time. Or the drivers (*if_transmit) has to > > switch to the right core to complete the transmit for this flow if the > > upper stack migrated/bounced around. It is rather difficult to assure > > full affinity from userspace down through the upper stack and then to > > the driver. > > > > c) non-multi-queue capable hardware uses a kernel provided set of > functions > > to manage the contention for the single resource of a DMA ring. > > > > The point here is that the driver is the right place to make these > > decisions > > because the upper stack lacks (and shouldn't care about) the actual > > available > > hardware and its capabilities. All necessary information is available > > to the > > driver as well through the appropriate mbuf header fields and the core > > it is > > called on. > > > > I mildly disagree with most of this, specifically with the part that the > driver is the right place to make these decisions. But you did say this > was a "preliminary conclusion" so there's hope yet ;-) > > Let's wait till you have an early implementation and we are all able to > experiment with it. To be continued... > > Regards, > Navdeep > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 06:41:13 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id EE720F6B; Wed, 30 Oct 2013 06:41:13 +0000 (UTC) (envelope-from jmg@h2.funkthat.com) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AB73D27E2; Wed, 30 Oct 2013 06:41:13 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id r9U6f6WC024909 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 29 Oct 2013 23:41:07 -0700 (PDT) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id r9U6f502024907; Tue, 29 Oct 2013 23:41:05 -0700 (PDT) (envelope-from jmg) Date: Tue, 29 Oct 2013 23:41:05 -0700 From: John-Mark Gurney To: Andre Oppermann Subject: Re: MQ Patch. Message-ID: <20131030064105.GV58155@funkthat.com> Mail-Followup-To: Andre Oppermann , Navdeep Parhar , Luigi Rizzo , Randall Stewart , "freebsd-net@freebsd.org" References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <527027CE.5040806@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Tue, 29 Oct 2013 23:41:07 -0700 (PDT) Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 06:41:14 -0000 Andre Oppermann wrote this message on Tue, Oct 29, 2013 at 22:25 +0100: > b) the driver assigns the DMA rings to particular cores which by that, > through > a critnest++ can drive them lockless. The drivers (*if_transmit) will > look > up the core it got called on and push the traffic out on that DMA ring. > The problem is the actual upper stacks affinity which is not guaranteed. > This has to consequences: there may be reordering of packets of the same > flow because the protocols send function happens to be called from a > different core the second time. Or the drivers (*if_transmit) has to > switch to the right core to complete the transmit for this flow if the > upper stack migrated/bounced around. It is rather difficult to assure > full affinity from userspace down through the upper stack and then to > the driver. I'll point you to the paper: http://arxiv.org/abs/1106.0443 Please don't reorder packets. Binding TX queues to cores seems not very useful, sure you can do a lockless implementation, but is running the scheduler to change cpu's really cheaper than paying the cost of migrating the lock? I'll admit I haven't run benchmarks, but I doubt it. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 10:40:53 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id D8A1D2B7 for ; Wed, 30 Oct 2013 10:40:53 +0000 (UTC) (envelope-from dyr@smartspb.net) Received: from quix.smartspb.net (quix.smartspb.net [217.119.16.133]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 967EA25A7 for ; Wed, 30 Oct 2013 10:40:53 +0000 (UTC) Received: from dyr.smartspb.net ([217.119.16.26] helo=[127.0.0.1]) by quix.smartspb.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.61 (FreeBSD)) (envelope-from ) id 1VbTCk-000I97-Ub for freebsd-net@freebsd.org; Wed, 30 Oct 2013 14:40:51 +0400 Message-ID: <5270E22C.1060408@smartspb.net> Date: Wed, 30 Oct 2013 14:40:44 +0400 From: Dennis Yusupoff User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: [Feature Request] (ng_)netflow additional X-Enigmail-Version: 1.6 X-Antivirus: avast! (VPS 131029-1, 30.10.2013), Outbound message X-Antivirus-Status: Clean Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 10:40:53 -0000 Good day everyone. To be brief: 1. It would be really usefull for CGNAT providers have ability to record customers IPs in traffic before and after NAT, as it already has done in ipt_NETFLOW under Linux or in the Cisco ASA series. === begin of cut https://github.com/aabc/ipt-netflow/blob/master/README === natevents=1 - Collect and send NAT translation events as NetFlow Event Logging (NEL) for NetFlow v9/IPFIX, or as dummy flows compatible with NetFlow v5. Default is 0 (don't send). For NetFlow v5 protocol meaning of fields in dummy flows is such: Src IP, Src Port is Pre-nat source address. Dst IP, Dst Port is Post-nat destination address. - These two fields made equal to data flows catched in FORWARD chain. Nexthop, Src AS is Post-nat source address for SNAT. Or, Nexthop, Dst AS is Pre-nat destination address for DNAT. TCP Flags is SYN+SCK for start event, RST+FIN for stop event. Pkt/Traffic size is 0 (zero), so it won't interfere with accounting. === end of cut === 2. Is it possible to specify by user some field in Netflow v9, for example /IF_DESC/ or /APPLICATION DESCRIPTION/, according to http://www.cisco.com/en/US/technologies/tk648/tk362/technologies_white_paper09186a00800a3db9_ps6601_Products_White_Paper.html? If no, it would be really nice to see. Using example: customers requested other ip on a interface, where we collect netflow traffic so when we should to give traffic report we haven't any *unique* identifier in netflow flows, which can be helpful. It's a real pity. Thank you for your consideration! -- Best regards, Dennis Yusupoff, network engineer of Smart-Telecom ISP Russia, Saint-Petersburg From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 11:44:25 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 744B33F3 for ; Wed, 30 Oct 2013 11:44:25 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BDF5129AA for ; Wed, 30 Oct 2013 11:44:24 +0000 (UTC) Received: (qmail 61448 invoked from network); 30 Oct 2013 12:14:47 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 12:14:47 -0000 Message-ID: <5270F101.6020701@freebsd.org> Date: Wed, 30 Oct 2013 12:44:01 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 11:44:25 -0000 On 30.10.2013 02:43, Adrian Chadd wrote: > Hi, [Meta: following your replies is often difficult because you're omitting context and citations] > We can't assume the hardware has deep queues _and_ we can't just hand > packets to the DMA engine. > > Why? > > Because once you've pushed it into the transmit ring, you can't > guarantee / impose any ordering on things. You can't guarantee that > you can abort a frame that has been queued because it now breaks the > queue rules. > > That's why we don't want to just have a light wrapper around hardware > transmit queues. We give up way too much useful control. The stack can't possibly know about all these differences in current and future technologies and requirements. That's why this decision should be pushed into the L3/L2 mapping/encapsulation and driver layer. Only those actually know about the requirements and constraints of any given technology. For wired ethernet there isn't any control over a packet once it has been inserted into the DMA ring and the packets are going to be processed sequentially. In that case the driver likely will chose a rather light wrapper to protect concurrent access to the DMA ring. An optimized version of such a wrapper will be provided by the kernel for the driver to link to. For other kinds of interfaces a very different strategy may be chosen. In your case with ieee80211 a more elaborate transmit scheme can be implemented without having to hack the kernel. In fact that's what you already mostly do there with the frame fragmentation, priority and retransmission code if I'm reading it correctly. The only difference in future being that the upper stack wont enforce any of the old IFQ, bufring or drbr handoff on you. You can chose one of the stock models or develop your own specially optimized version. > I've seen this both when doing wifi (where I absolutely have to have > per-node, per-TID queues, far before it hits the hardware) and doing > WAN style optimisation, where I want to ensure I only queue a handful > of milliseconds of frames to the hardware so I can ensure I can hit > QoS requirements (eg there being a large amount of bulk data, then I > want to inject some voice traffic that should go out sooner..) Sure. The ideas is to make it even easier for you to implement that without having to work around anything above ifnet. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 11:51:10 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B27ED5AB for ; Wed, 30 Oct 2013 11:51:10 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2283C2A4C for ; Wed, 30 Oct 2013 11:51:09 +0000 (UTC) Received: (qmail 61485 invoked from network); 30 Oct 2013 12:21:32 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 12:21:32 -0000 Message-ID: <5270F297.4090001@freebsd.org> Date: Wed, 30 Oct 2013 12:50:47 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Navdeep Parhar , Luigi Rizzo , Randall Stewart , "freebsd-net@freebsd.org" Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <20131030064105.GV58155@funkthat.com> In-Reply-To: <20131030064105.GV58155@funkthat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 11:51:10 -0000 On 30.10.2013 07:41, John-Mark Gurney wrote: > Andre Oppermann wrote this message on Tue, Oct 29, 2013 at 22:25 +0100: >> b) the driver assigns the DMA rings to particular cores which by that, >> through >> a critnest++ can drive them lockless. The drivers (*if_transmit) will >> look >> up the core it got called on and push the traffic out on that DMA ring. >> The problem is the actual upper stacks affinity which is not guaranteed. >> This has to consequences: there may be reordering of packets of the same >> flow because the protocols send function happens to be called from a >> different core the second time. Or the drivers (*if_transmit) has to >> switch to the right core to complete the transmit for this flow if the >> upper stack migrated/bounced around. It is rather difficult to assure >> full affinity from userspace down through the upper stack and then to >> the driver. > > I'll point you to the paper: > http://arxiv.org/abs/1106.0443 > > Please don't reorder packets. > > Binding TX queues to cores seems not very useful, sure you can do a > lockless implementation, but is running the scheduler to change cpu's > really cheaper than paying the cost of migrating the lock? > > I'll admit I haven't run benchmarks, but I doubt it. Don't worry. My list was about the possible ways of dealing with it and their constrains/disadvantage. Packet reordering is one part of it that pretty much makes approach b) non-viable as you correctly point out. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 14:14:36 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 4F6B9DC0 for ; Wed, 30 Oct 2013 14:14:36 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 08B5D24CD for ; Wed, 30 Oct 2013 14:14:35 +0000 (UTC) Received: from jre-mbp.elischer.org (ppp121-45-246-96.lns20.per2.internode.on.net [121.45.246.96]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id r9UEEUkS023605 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 30 Oct 2013 07:14:33 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <52711440.5060405@freebsd.org> Date: Wed, 30 Oct 2013 22:14:24 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Dennis Yusupoff , freebsd-net@freebsd.org Subject: Re: [Feature Request] (ng_)netflow additional References: <5270E22C.1060408@smartspb.net> In-Reply-To: <5270E22C.1060408@smartspb.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 14:14:36 -0000 On 10/30/13, 6:40 PM, Dennis Yusupoff wrote: > Good day everyone. > > To be brief: > > 1. It would be really usefull for CGNAT providers have ability to record > customers IPs in traffic before and after NAT, as it already has done in > ipt_NETFLOW under Linux or in the Cisco ASA series. > > === begin of cut https://github.com/aabc/ipt-netflow/blob/master/README === > natevents=1 > - Collect and send NAT translation events as NetFlow Event Logging > (NEL) > for NetFlow v9/IPFIX, or as dummy flows compatible with NetFlow v5. > Default is 0 (don't send). > > For NetFlow v5 protocol meaning of fields in dummy flows is such: > Src IP, Src Port is Pre-nat source address. > Dst IP, Dst Port is Post-nat destination address. > - These two fields made equal to data flows catched in > FORWARD chain. > Nexthop, Src AS is Post-nat source address for SNAT. Or, > Nexthop, Dst AS is Pre-nat destination address for DNAT. > TCP Flags is SYN+SCK for start event, RST+FIN for stop event. > Pkt/Traffic size is 0 (zero), so it won't interfere with > accounting. I think this would be very hard because the netflow module looks at the packets at one place. Eihter it is before or after NAT but not during.. so the information is not available.. we would have to add a netflow source into the NAT code to do this (and then the other net flow code would need to be turned off if NAT was on.. but since netgraph is like lego, and no part of it knows abut any other part of it, it would be quite a challenge as to how this could be done.) > === end of cut === > > 2. Is it possible to specify by user some field in Netflow v9, for > example /IF_DESC/ or /APPLICATION DESCRIPTION/, according to > http://www.cisco.com/en/US/technologies/tk648/tk362/technologies_white_paper09186a00800a3db9_ps6601_Products_White_Paper.html? > If no, it would be really nice to see. Using example: customers > requested other ip on a interface, where we collect netflow traffic so > when we should to give traffic report we haven't any *unique* identifier > in netflow flows, which can be helpful. It's a real pity. I leave this to the people who know more about netflow... > Thank you for your consideration! > > From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 16:10:03 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id DCA622BB for ; Wed, 30 Oct 2013 16:10:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BC94F2D82 for ; Wed, 30 Oct 2013 16:10:02 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9UGA2HP037946 for ; Wed, 30 Oct 2013 16:10:02 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9UGA2EE037945; Wed, 30 Oct 2013 16:10:02 GMT (envelope-from gnats) Date: Wed, 30 Oct 2013 16:10:02 GMT Message-Id: <201310301610.r9UGA2EE037945@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: dfilter@FreeBSD.ORG (dfilter service) Subject: Re: kern/134531: commit references a PR X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: dfilter service List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 16:10:03 -0000 The following reply was made to PR kern/134531; it has been noted by GNATS. From: dfilter@FreeBSD.ORG (dfilter service) To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/134531: commit references a PR Date: Wed, 30 Oct 2013 16:08:42 +0000 (UTC) Author: melifaro Date: Wed Oct 30 16:08:27 2013 New Revision: 257389 URL: http://svnweb.freebsd.org/changeset/base/257389 Log: MFC r256624: Fix long-standing issue with incorrect radix mask calculation. Usual symptoms are messages like rn_delete: inconsistent annotation rn_addmask: mask impossibly already in tree routing daemon constantly deleting IPv6 default route or inability to flush/delete particular prefix in ipfw table. Changes: * Assume 32 bytes as maximum radix key length * Remove rn_init() * Statically allocate rn_ones/rn_zeroes * Make separate mask tree for each "normal" tree instead of system global one * Remove "optimization" on masks reusage and key zeroying * Change rn_addmask() arguments to accept tree pointer (no users in base) MFC changes: * keep rn_init() * create global mask tree, protected with mutex, for old rn_addmask users (currently 0 in base) * Add new rn_addmask_r() function (rn_addmask in head) with additional argument to accept tree pointer PR: kern/182851, kern/169206, kern/135476, kern/134531 Found by: Slawa Olhovchenkov Reviewed by: glebius (previous versions) Sponsored by: Yandex LLC Modified: stable/9/sys/net/radix.c stable/9/sys/net/radix.h Directory Properties: stable/9/sys/ (props changed) stable/9/sys/net/ (props changed) Modified: stable/9/sys/net/radix.c ============================================================================== --- stable/9/sys/net/radix.c Wed Oct 30 15:46:50 2013 (r257388) +++ stable/9/sys/net/radix.c Wed Oct 30 16:08:27 2013 (r257389) @@ -66,27 +66,27 @@ static struct radix_node *rn_search(void *, struct radix_node *), *rn_search_m(void *, struct radix_node *, void *); -static int max_keylen; -static struct radix_mask *rn_mkfreelist; -static struct radix_node_head *mask_rnhead; +static void rn_detachhead_internal(void **head); +static int rn_inithead_internal(void **head, int off); + +#define RADIX_MAX_KEY_LEN 32 + +static char rn_zeros[RADIX_MAX_KEY_LEN]; +static char rn_ones[RADIX_MAX_KEY_LEN] = { + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, +}; + /* - * Work area -- the following point to 3 buffers of size max_keylen, - * allocated in this order in a block of memory malloc'ed by rn_init. - * rn_zeros, rn_ones are set in rn_init and used in readonly afterwards. - * addmask_key is used in rn_addmask in rw mode and not thread-safe. + * XXX: Compat stuff for old rn_addmask() users */ -static char *rn_zeros, *rn_ones, *addmask_key; - -#define MKGet(m) { \ - if (rn_mkfreelist) { \ - m = rn_mkfreelist; \ - rn_mkfreelist = (m)->rm_mklist; \ - } else \ - R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); } - -#define MKFree(m) { (m)->rm_mklist = rn_mkfreelist; rn_mkfreelist = (m);} +static struct radix_node_head *mask_rnhead_compat; +#ifdef _KERNEL +static struct mtx mask_mtx; +#endif -#define rn_masktop (mask_rnhead->rnh_treetop) static int rn_lexobetter(void *m_arg, void *n_arg); static struct radix_mask * @@ -230,7 +230,8 @@ rn_lookup(v_arg, m_arg, head) caddr_t netmask = 0; if (m_arg) { - x = rn_addmask(m_arg, 1, head->rnh_treetop->rn_offset); + x = rn_addmask_r(m_arg, head->rnh_masks, 1, + head->rnh_treetop->rn_offset); if (x == 0) return (0); netmask = x->rn_key; @@ -489,53 +490,47 @@ on1: } struct radix_node * -rn_addmask(n_arg, search, skip) - int search, skip; - void *n_arg; +rn_addmask_r(void *arg, struct radix_node_head *maskhead, int search, int skip) { - caddr_t netmask = (caddr_t)n_arg; + caddr_t netmask = (caddr_t)arg; register struct radix_node *x; register caddr_t cp, cplim; register int b = 0, mlen, j; - int maskduplicated, m0, isnormal; + int maskduplicated, isnormal; struct radix_node *saved_x; - static int last_zeroed = 0; + char addmask_key[RADIX_MAX_KEY_LEN]; - if ((mlen = LEN(netmask)) > max_keylen) - mlen = max_keylen; + if ((mlen = LEN(netmask)) > RADIX_MAX_KEY_LEN) + mlen = RADIX_MAX_KEY_LEN; if (skip == 0) skip = 1; if (mlen <= skip) - return (mask_rnhead->rnh_nodes); + return (maskhead->rnh_nodes); + + bzero(addmask_key, RADIX_MAX_KEY_LEN); if (skip > 1) bcopy(rn_ones + 1, addmask_key + 1, skip - 1); - if ((m0 = mlen) > skip) - bcopy(netmask + skip, addmask_key + skip, mlen - skip); + bcopy(netmask + skip, addmask_key + skip, mlen - skip); /* * Trim trailing zeroes. */ for (cp = addmask_key + mlen; (cp > addmask_key) && cp[-1] == 0;) cp--; mlen = cp - addmask_key; - if (mlen <= skip) { - if (m0 >= last_zeroed) - last_zeroed = mlen; - return (mask_rnhead->rnh_nodes); - } - if (m0 < last_zeroed) - bzero(addmask_key + m0, last_zeroed - m0); - *addmask_key = last_zeroed = mlen; - x = rn_search(addmask_key, rn_masktop); + if (mlen <= skip) + return (maskhead->rnh_nodes); + *addmask_key = mlen; + x = rn_search(addmask_key, maskhead->rnh_treetop); if (bcmp(addmask_key, x->rn_key, mlen) != 0) x = 0; if (x || search) return (x); - R_Zalloc(x, struct radix_node *, max_keylen + 2 * sizeof (*x)); + R_Zalloc(x, struct radix_node *, RADIX_MAX_KEY_LEN + 2 * sizeof (*x)); if ((saved_x = x) == 0) return (0); netmask = cp = (caddr_t)(x + 2); bcopy(addmask_key, cp, mlen); - x = rn_insert(cp, mask_rnhead, &maskduplicated, x); + x = rn_insert(cp, maskhead, &maskduplicated, x); if (maskduplicated) { log(LOG_ERR, "rn_addmask: mask impossibly already in tree"); Free(saved_x); @@ -568,6 +563,23 @@ rn_addmask(n_arg, search, skip) return (x); } +struct radix_node * +rn_addmask(void *n_arg, int search, int skip) +{ + struct radix_node *tt; + +#ifdef _KERNEL + mtx_lock(&mask_mtx); +#endif + tt = rn_addmask_r(&mask_rnhead_compat, n_arg, search, skip); + +#ifdef _KERNEL + mtx_unlock(&mask_mtx); +#endif + + return (tt); +} + static int /* XXX: arbitrary ordering for non-contiguous masks */ rn_lexobetter(m_arg, n_arg) void *m_arg, *n_arg; @@ -590,12 +602,12 @@ rn_new_radix_mask(tt, next) { register struct radix_mask *m; - MKGet(m); + R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); if (m == 0) { - log(LOG_ERR, "Mask for route not entered\n"); + log(LOG_ERR, "Failed to allocate route mask\n"); return (0); } - bzero(m, sizeof *m); + bzero(m, sizeof(*m)); m->rm_bit = tt->rn_bit; m->rm_flags = tt->rn_flags; if (tt->rn_flags & RNF_NORMAL) @@ -629,7 +641,8 @@ rn_addroute(v_arg, n_arg, head, treenode * nodes and possibly save time in calculating indices. */ if (netmask) { - if ((x = rn_addmask(netmask, 0, top->rn_offset)) == 0) + x = rn_addmask_r(netmask, head->rnh_masks, 0, top->rn_offset); + if (x == NULL) return (0); b_leaf = x->rn_bit; b = -1 - x->rn_bit; @@ -808,7 +821,8 @@ rn_delete(v_arg, netmask_arg, head) * Delete our route from mask lists. */ if (netmask) { - if ((x = rn_addmask(netmask, 1, head_off)) == 0) + x = rn_addmask_r(netmask, head->rnh_masks, 1, head_off); + if (x == NULL) return (0); netmask = x->rn_key; while (tt->rn_mask != netmask) @@ -841,7 +855,7 @@ rn_delete(v_arg, netmask_arg, head) for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist) if (m == saved_m) { *mp = m->rm_mklist; - MKFree(m); + Free(m); break; } if (m == 0) { @@ -932,7 +946,7 @@ on1: struct radix_mask *mm = m->rm_mklist; x->rn_mklist = 0; if (--(m->rm_refs) < 0) - MKFree(m); + Free(m); m = mm; } if (m) @@ -1128,10 +1142,8 @@ rn_walktree(h, f, w) * bits starting at 'off'. * Return 1 on success, 0 on error. */ -int -rn_inithead(head, off) - void **head; - int off; +static int +rn_inithead_internal(void **head, int off) { register struct radix_node_head *rnh; register struct radix_node *t, *tt, *ttt; @@ -1163,8 +1175,8 @@ rn_inithead(head, off) return (1); } -int -rn_detachhead(void **head) +static void +rn_detachhead_internal(void **head) { struct radix_node_head *rnh; @@ -1176,28 +1188,60 @@ rn_detachhead(void **head) Free(rnh); *head = NULL; +} + +int +rn_inithead(void **head, int off) +{ + struct radix_node_head *rnh; + + if (*head != NULL) + return (1); + + if (rn_inithead_internal(head, off) == 0) + return (0); + + rnh = (struct radix_node_head *)(*head); + + if (rn_inithead_internal((void **)&rnh->rnh_masks, 0) == 0) { + rn_detachhead_internal(head); + return (0); + } + + return (1); +} + +int +rn_detachhead(void **head) +{ + struct radix_node_head *rnh; + + KASSERT((head != NULL && *head != NULL), + ("%s: head already freed", __func__)); + + rnh = *head; + + rn_detachhead_internal((void **)&rnh->rnh_masks); + rn_detachhead_internal(head); return (1); } void rn_init(int maxk) { - char *cp, *cplim; - - max_keylen = maxk; - if (max_keylen == 0) { + if ((maxk <= 0) || (maxk > RADIX_MAX_KEY_LEN)) { log(LOG_ERR, - "rn_init: radix functions require max_keylen be set\n"); + "rn_init: max_keylen must be within 1..%d\n", + RADIX_MAX_KEY_LEN); return; } - R_Malloc(rn_zeros, char *, 3 * max_keylen); - if (rn_zeros == NULL) - panic("rn_init"); - bzero(rn_zeros, 3 * max_keylen); - rn_ones = cp = rn_zeros + max_keylen; - addmask_key = cplim = rn_ones + max_keylen; - while (cp < cplim) - *cp++ = -1; - if (rn_inithead((void **)(void *)&mask_rnhead, 0) == 0) + + /* + * XXX: Compat for old rn_addmask() users + */ + if (rn_inithead((void **)(void *)&mask_rnhead_compat, 0) == 0) panic("rn_init 2"); +#ifdef _KERNEL + mtx_init(&mask_mtx, "radix_mask", NULL, MTX_DEF); +#endif } Modified: stable/9/sys/net/radix.h ============================================================================== --- stable/9/sys/net/radix.h Wed Oct 30 15:46:50 2013 (r257388) +++ stable/9/sys/net/radix.h Wed Oct 30 16:08:27 2013 (r257389) @@ -136,6 +136,7 @@ struct radix_node_head { #ifdef _KERNEL struct rwlock rnh_lock; /* locks entire radix tree */ #endif + struct radix_node_head *rnh_masks; /* Storage for our masks */ }; #ifndef _KERNEL @@ -167,6 +168,7 @@ int rn_detachhead(void **); int rn_refines(void *, void *); struct radix_node *rn_addmask(void *, int, int), + *rn_addmask_r(void *, struct radix_node_head *, int, int), *rn_addroute (void *, void *, struct radix_node_head *, struct radix_node [2]), *rn_delete(void *, void *, struct radix_node_head *), _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 17:48:32 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id EB720F4D; Wed, 30 Oct 2013 17:48:32 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-qe0-x232.google.com (mail-qe0-x232.google.com [IPv6:2607:f8b0:400d:c02::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8690F259A; Wed, 30 Oct 2013 17:48:32 +0000 (UTC) Received: by mail-qe0-f50.google.com with SMTP id 1so1043614qee.37 for ; Wed, 30 Oct 2013 10:48:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=zXkuCpJ6VkRxJpBShr9XM63P1+xOmVxiX2y3yJCdXcE=; b=zt+iL6mD3emcJmGfqsKFTJK6HTrqfbfGotsKUlBU1yqzuK6u5qk8UF3Mx/Uzsah17c vXP2/Cugblb43Ypkl9I80KCJfpPAOf1kYJAqxV/eXpWwpVTk3YaHShVnuTptBcXcjFGL w9ztQvHKy04r2D3dOPAPM/48Svxt/Gg8x/52yQoDv3f6+wA1elINachcmgykmHF7PXdi PYVyFoD9DzV9GPN0qav8+XlqBguHz5IDPBMSrZreFc90pMeGIkjWFApp+ngjEHUgmRa0 ON/kFOOryfBjOOJNF/dlTGBA9OARGyp45tsKrD8qUOwgKrk4W/s/GdvHRFwcDMm3dKm2 35+g== MIME-Version: 1.0 X-Received: by 10.49.59.115 with SMTP id y19mr8596891qeq.8.1383155311679; Wed, 30 Oct 2013 10:48:31 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.207.66 with HTTP; Wed, 30 Oct 2013 10:48:31 -0700 (PDT) In-Reply-To: <5270F101.6020701@freebsd.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <5270F101.6020701@freebsd.org> Date: Wed, 30 Oct 2013 10:48:31 -0700 X-Google-Sender-Auth: ERXLSL7s9c9TbRE1KgK-ujhtSl4 Message-ID: Subject: Re: MQ Patch. From: Adrian Chadd To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 17:48:33 -0000 On 30 October 2013 04:44, Andre Oppermann wrote: >> We can't assume the hardware has deep queues _and_ we can't just hand >> packets to the DMA engine. > >> >> >> Why? >> >> Because once you've pushed it into the transmit ring, you can't >> guarantee / impose any ordering on things. You can't guarantee that >> you can abort a frame that has been queued because it now breaks the >> queue rules. >> >> That's why we don't want to just have a light wrapper around hardware >> transmit queues. We give up way too much useful control. > > > The stack can't possibly know about all these differences in current > and future technologies and requirements. That's why this decision > should be pushed into the L3/L2 mapping/encapsulation and driver layer. That's why you split it. You allow the upper layers (things like altq) to track things like per-IP, per-traffic-class traffic and tag things appropriate. You then let some software queue implement the queue discipline and only drain frames to the hardware at a rate that's fast enough to keep up with the hardware, and no faster. Why? Because if you have new traffic come along from a new client, it may be higher priority than the traffic queued to the hardware. But it's at the same QoS level as what's currently queued to the hardware, or map to the same physical queue. So yes, we do need that split for a lot of cases. There will be bare-metal cases for highly low latency but if we implement the correct queue API here it'll just collapse down to either NULL, or just the existing software queue in front of the DMA rings to avoid locking overhead. Thanks, -adrian From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 21:24:31 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BE5A42F2 for ; Wed, 30 Oct 2013 21:24:31 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 22BD824CA for ; Wed, 30 Oct 2013 21:24:30 +0000 (UTC) Received: (qmail 64106 invoked from network); 30 Oct 2013 21:54:49 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 21:54:49 -0000 Message-ID: <527178F7.1070800@freebsd.org> Date: Wed, 30 Oct 2013 22:24:07 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <5270F101.6020701@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 21:24:31 -0000 On 30.10.2013 18:48, Adrian Chadd wrote: > On 30 October 2013 04:44, Andre Oppermann wrote: > >>> We can't assume the hardware has deep queues _and_ we can't just hand >>> packets to the DMA engine. >> >>> >>> >>> Why? >>> >>> Because once you've pushed it into the transmit ring, you can't >>> guarantee / impose any ordering on things. You can't guarantee that >>> you can abort a frame that has been queued because it now breaks the >>> queue rules. >>> >>> That's why we don't want to just have a light wrapper around hardware >>> transmit queues. We give up way too much useful control. >> >> >> The stack can't possibly know about all these differences in current >> and future technologies and requirements. That's why this decision >> should be pushed into the L3/L2 mapping/encapsulation and driver layer. > > That's why you split it. > > You allow the upper layers (things like altq) to track things like > per-IP, per-traffic-class traffic and tag things appropriate. Any QoS scheme is split into two distinct steps: a) the classifier; b) the queuing and packet scheduler. The classification is totally taken out of ifnet/IFQ* and done a) through a packet filter, ipfw, pf, ipf; b) taken from the PCB if the packet is locally generated; c) on ingress packet from a vlan or IP header. The last for example is typically done in MPLS network where classification only happens at the edges and the way all brand name routers work, with the option of doing a) as well. The queuing and scheduling happens after L3/L2 mapping/encapsulation and before the packets are put onto the DMA ring. Please not that this is somewhat independent from additional pre-DMA queuing as in ieee80211 and comes before it. > You then let some software queue implement the queue discipline and > only drain frames to the hardware at a rate that's fast enough to keep > up with the hardware, and no faster. For a QoS queue/scheduler to be fully effective the DMA ring should be as small as reasonable to keep the interface busy, but not more. All queuing then happens in software with appropriately sized queues. > Why? > > Because if you have new traffic come along from a new client, it may > be higher priority than the traffic queued to the hardware. But it's > at the same QoS level as what's currently queued to the hardware, or > map to the same physical queue. When a packet has been handed to the DMA ring there's no stopping it anymore and the order is fixed. That's why in a QoS setup the DMA ring should be as small as it can be to barely keep the interface busy. Everything else happens in software and is subject to packet scheduler decisions. If a higher priority packet arrives before the next packet scheduler run it will be dequeued first (subject to WFQ or other fair scheduling disciplines to prevent total starvation). You may find this presentation I did some time back at SWINOG helpful: http://www.networx.ch/Understanding%20QoS%20by%20Andre%20Oppermann%20-%2020090402.pdf When QoS is active there can be only one active DMA ring per interface unless the hardware supports the necessary scheduling discipline among the DMA rings. Most multi DMA ring NICs employ a simple round-robin algorithm on a per-packet basis. With TSO these packets can be very large. Any such multi DMA ring setup would render any software QoS attempts futile. Hence only one DMA ring can be used/active with QoS. As far as I'm aware the only NIC that officially supports multi DMA rings including WFQ among them is the Intel ixgbe(4). Other 10G cards may support it but their datasheets are not public. > So yes, we do need that split for a lot of cases. There will be > bare-metal cases for highly low latency but if we implement the > correct queue API here it'll just collapse down to either NULL, or > just the existing software queue in front of the DMA rings to avoid > locking overhead. The L3/L2 mapping/encapsulation step may or may not need any locking depending on what it has to do. However its locking requirements may be totally different from the DMA ring protection. If there is no QoS enabled/active on an interface the packet after the L3/L2 step goes straight through to the driver. If there are multiple DMA rings the driver looks at the flowid field in the mbuf header and selects one of the DMA rings. These DMA rings naturally have to be protected by a (spin) lock to prevent concurrent access by multiple cores. Unless there is contention software queuing doesn't happen and the DMA rings are sufficiently deep. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 21:30:35 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 6EC105C8 for ; Wed, 30 Oct 2013 21:30:35 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id C5AEC253F for ; Wed, 30 Oct 2013 21:30:34 +0000 (UTC) Received: (qmail 64140 invoked from network); 30 Oct 2013 22:00:52 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 22:00:52 -0000 Message-ID: <52717A62.7040600@freebsd.org> Date: Wed, 30 Oct 2013 22:30:10 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Luigi Rizzo , Adrian Chadd , Navdeep Parhar , Randall Stewart , "freebsd-net@freebsd.org" Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 21:30:35 -0000 On 30.10.2013 06:00, Luigi Rizzo wrote: > On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote: >> Hi, >> >> We can't assume the hardware has deep queues _and_ we can't just hand >> packets to the DMA engine. >> [Adrian explains why] > > i have the feeling that the variuos folks who stepped into this > discussion seem to have completely different (and orthogonal) goals > and as such these goals should be discussed separately. It looks like it and it is great to have this discussion. :) > Below is the architecture i have in mind and how i would implement it > (and it would be extremely simple since we have most of the pieces > in place). [Omitted citation further down of good and throughout qos description, to be replied to separately] > It would be useful if people could discuss what problem they are > addressing before coming up with patches. Right now Glebius and I are working on the struct ifnet abstraction which has severely bloated and blurred over the years. The goal is to make is opaque to the drivers for better API/ABI stability in the first step. When looking at struct ifnet and its place in the kernel then it becomes evident that it's actual purpose is to serve as abstraction of a logical layer 3 protocol interface towards the layer 2 mapping and encapsulation, and eventually and sort of tangentially the real hardware. Now ifnet has become very complex and large and should be brought back to its original purpose of the being the logical layer 3 interface abstraction. There isn't necessarily a 1:1 mapping from one ifnet instance to one hardware interface. In fact there are pure logical ifnets (gre, tun, ...), direct hardware ifnets (simple network interfaces like fxp(4)), and multiple logic interfaces on top a single hardware (vlan, lagg, ...). Depending on the ifnets purpose the backend can be very different. Thus I want to decouple the current implicit notion of ifnet==hardware with associated queuing and such. Instead it should become a layer 3 abstraction inside the kernel again and delegate all lower layers to appropriate protocol, layer 2, and hardware specific implementations. From this comes the following *rough* implementation approach to be tested (ignore naming for now): /* Function pointers for packets descending into layer 2 */ (*if_l2map)(ifnet, mbuf, sockaddr, [route]); /* from upper stack */ (*if_tx)(ifnet, mbuf); /* to driver or qos */ (*if_txframe)(ifnet, mbuf); /* to driver */ (*if_txframedone)(ifnet); /* callback to qos */ /* Function pointers for packets coming up from layer 1 */ (*if_l2demap)(ifnet, mbuf); /* l2/l3 unmapping */ When a packet comes down that stack (*if_l2map) gets called to map and encapsulate a layer 3 packet into an appropriate layer 2 frame. For IP this would be ether_output() together with ARP and so on. The result of that step is the ethernet header in front of the IP packet. Ether_output() then calls (*if_tx) to have the frame sent out on the wire(less) which is the driver handoff point for DMA ring addition. Normally (*if_tx) and (*if_txframe) are the same and the job is done. When software QoS is active (*if_tx) points into the soft-qos enqueue implementation and will eventually use (*if_txframe) to push out those packets onto the wire it sees fit. In addition the drivers have to expose functions to manage the number and depth of their DMA rings, or rather the number/size of packets that can be enqueued onto them. And the (*if_txframedone) callback to clock out packets from a soft-queue or QoS discipline. When QoS is active it probably wants to make the DMA rings small and the software queue(s) large to be effective. As default setup and when running a server no QoS will be active or inserted. No or only very small software queues exist to handle concurrency (except for ieee80211 to do sophisticated frame management inside *if_txframe). Whenever the DMA ring is full there is no point in queuing up more packets. Instead the socket buffers act as buffers and also ensure flow control and backpressure up to userspace to limit kernel memory usage from double and triple buffering. How the packets are efficiently pushed out onto the wire is up to the drivers and depends on the hardware capabilities. It can be multiple hardware DMA rings, or just a single ring with an efficient concurrent access method. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 21:53:24 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B686E212; Wed, 30 Oct 2013 21:53:24 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-qa0-x22e.google.com (mail-qa0-x22e.google.com [IPv6:2607:f8b0:400d:c00::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 530B026B2; Wed, 30 Oct 2013 21:53:24 +0000 (UTC) Received: by mail-qa0-f46.google.com with SMTP id j15so4105948qaq.12 for ; Wed, 30 Oct 2013 14:53:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=YbkY7Urr5B8bLFKIzNB0U8SrzTyjLAS2IhuUCMcPyuw=; b=fh3YpF9i/4PgegyNcJ/PjigtGunCGuGNKErshEhm3k6JQQdWbrqvZVPZsHgBdax97q Dc/4C6CENlZgyZOhx12ETqBKcemGVQZAdQNKC+f4GoAp7JCuvCD8bwFBeKRbLJSY/eiR Ii/hZtnzfJtfz2CGQmAwLiHwLNZLMi9aCKkA0LwQ2ojmCUxM8tmbBb3Ork+xPgcWAte4 UGVGlKRtvj8yJcEfsvmMp6zcQo/uSkvLD7bBHA0KEihjmgZ3ItlhCYXFr8SRu41F+KX5 CzwX5PBd4pCirRfeRlRgOABBjHIc4obm12qCSUE4kDhOBa5sd2FTm1EymhFGDFLcaKEk TG8g== MIME-Version: 1.0 X-Received: by 10.224.113.199 with SMTP id b7mr980525qaq.4.1383170003477; Wed, 30 Oct 2013 14:53:23 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.207.66 with HTTP; Wed, 30 Oct 2013 14:53:23 -0700 (PDT) In-Reply-To: <52717A62.7040600@freebsd.org> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> <52717A62.7040600@freebsd.org> Date: Wed, 30 Oct 2013 14:53:23 -0700 X-Google-Sender-Auth: E1dx_NzsQWZ0OYjPA511t5MA6GI Message-ID: Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) From: Adrian Chadd To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 21:53:24 -0000 On 30 October 2013 14:30, Andre Oppermann wrote: > As default setup and when running a server no QoS will be active > or inserted. No or only very small software queues exist to handle > concurrency (except for ieee80211 to do sophisticated frame management > inside *if_txframe). Whenever the DMA ring is full there is no point > in queuing up more packets. Instead the socket buffers act as buffers > and also ensure flow control and backpressure up to userspace to limit > kernel memory usage from double and triple buffering. .. and what about for LAN<->WAN traffic, where there's no socket buffers? -adrian From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 22:02:17 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B23FD5CF for ; Wed, 30 Oct 2013 22:02:17 +0000 (UTC) (envelope-from garmitage@swin.edu.au) Received: from gpo3.cc.swin.edu.au (gpo3.cc.swin.edu.au [136.186.1.32]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 470F02749 for ; Wed, 30 Oct 2013 22:02:16 +0000 (UTC) Received: from [136.186.229.37] (garmitage.caia.swin.edu.au [136.186.229.37]) by gpo3.cc.swin.edu.au (8.14.3/8.14.3) with ESMTP id r9UM1isa021721 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 31 Oct 2013 09:02:04 +1100 Message-ID: <527181C8.3040502@swin.edu.au> Date: Thu, 31 Oct 2013 09:01:44 +1100 From: grenville armitage User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:16.0) Gecko/20121107 Thunderbird/16.0.2 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 22:02:17 -0000 On 10/30/2013 16:00, Luigi Rizzo wrote: [..] > Just to set the terminology: > QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT > or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES . > This is what implements DROPTAIL (also improperly called FIFO), > RED, CODEL. Note that for CODEL you need to intercept extractions > from the queue, whereas DROPTAIL and RED only act on > insertions. > > SCHEDULER is the entity which decides which queue to serve among > the many possible ones. It is called on INSERTIONS and > EXTRACTIONS from a queue, and passes packets to the NIC's queue. > > The decision on which queue and ring (Q_i and R_j) to use should be made > by a classifier at the beginning of step 2 (or once per iteration, > if using a hierarchical scheduler). Of course they can be precomputed > (e.g. with annotations in the mbuf coming from the socket). I'd like to give a big +1 to the above. Crucial additional points about the per-hop processing for QoS: - Classification is any decision of the form "to what class does this frame belong", where the answer is intended to drive the frame into the appropriate queue. (Which implies the notion of 'class' is very much context-dependent, and classification is something that may occur on L3 tuples, MPLS headers, other L2 fields, other local in-kernel context,etc.) - Queuing and schedule must happen where bottlenecks form, and are irrelevant at points in the data path where no bottleneck exists. cheers, gja From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 22:17:19 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id D9771D6F for ; Wed, 30 Oct 2013 22:17:19 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EEA9F281D for ; Wed, 30 Oct 2013 22:17:18 +0000 (UTC) Received: (qmail 64356 invoked from network); 30 Oct 2013 22:47:36 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 22:47:36 -0000 Message-ID: <52718556.9010808@freebsd.org> Date: Wed, 30 Oct 2013 23:16:54 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Luigi Rizzo , Adrian Chadd , Navdeep Parhar , Randall Stewart , "freebsd-net@freebsd.org" Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 22:17:19 -0000 On 30.10.2013 06:00, Luigi Rizzo wrote: > On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote: >> Hi, >> >> We can't assume the hardware has deep queues _and_ we can't just hand >> packets to the DMA engine. >> [Adrian explains why] [skipping things replied to in other email] > The architecture i think we should pursue is this (which happens to be > what linux implements, and also what dummynet implements, except > that the output is to a dummynet pipe or to ether_output() or to > ip_output() depending on the configuration): > > 1. multiple (one per core) concurrent transmitters t_c That's simply the number of cores that in theory could try to send a packet at the time? Or is it supposed to be an actual structure? > which use ether_output_frame() to send to > > 2. multiple disjoint queues q_j > (one per traffic group, can be *a lot*, say 10^6) Whooo, that looks a bit excessive. So many traffic groups would effectively be one per flow? Most of the time traffic is distributed into 4-8 classes with strict priority for the highest class (VoIP) and some sort of proportional WFQ for the others. At least that's the standard setup for carrier/ISP networks. > which are scheduled with a scheduler S > (iterate step 2 for hierarchical schedulers) > and Makes sense. > 3. eventually feed ONE transmit ring R_j on the NIC. Agreed, more than one wouldn't work because otherwise the NIC would do poor man's RR among the queues. > Once a packet reaches R_j, for all practical purpose > is on the wire. We cannot intercept extractions, > we cannot interfere with the scheduler in the NIC in > case of multiqueue NICs. The most we can do (and should, > as in Linux) is notify the owner of the packet once its > transmission is complete. Per packet notification probably has a high overhead on high pps systems. The coalesced TX complete interrupt should do for QoS purposes as well to keep the DMA ring fed. We do not track who generated the packet and thus can't have the notification bubble up to the PCB (if any). > Just to set the terminology: > QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT > or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES . > This is what implements DROPTAIL (also improperly called FIFO), > RED, CODEL. Note that for CODEL you need to intercept extractions > from the queue, whereas DROPTAIL and RED only act on > insertions. Ack. > SCHEDULER is the entity which decides which queue to serve among > the many possible ones. It is called on INSERTIONS and > EXTRACTIONS from a queue, and passes packets to the NIC's queue. Ack. > The decision on which queue and ring (Q_i and R_j) to use should be made > by a classifier at the beginning of step 2 (or once per iteration, > if using a hierarchical scheduler). Of course they can be precomputed > (e.g. with annotations in the mbuf coming from the socket). IMHO that is the job of a packet filter, or in simple cases can be transposed into the mbuf header from vlan header cos or IP header tos fields. > Now when it comes to implementing the above, we have three > cases (or different optimization levels, if you like) -- 0. THE NO QOS CASE --- No qos is done and multi DMA rings are selected based on the flowid to reduce contention while avoiding packet reordering. > -- 1. THE SIMPLE CASE --- > > In the simplest possible case we have can let the NIC do everything. > Necessary conditions are: > - queue management policies acting only on insertions > (e.g. DROPTAIL or RED or similar); > - # of traffic classes <= # number of NIC rings > - scheduling policy S equal to the one implemented in the NIC > (trivial case: one queue, one ring, no scheduler) > > All these cases match exactly what the hardware provides, so we can just > use the NIC ring(s) without extra queue(s), and possibly use something > like buf_ring to manage insertions (but note that insertions in > an empty queue will end up requiring a lock; and i think the > same happens even now with the extra drbr queue in front of the ring). Agreed. A lock on the DMA ring is always required to protect the ring structure and NIC doorbell. Software queuing or buf_ring shouldn't be necessary at all. Only some mechanism to make concurrent access/backoff to the same DMA ring more efficient may be good. For example having one packet slot per core instead of spinning. > -- 2. THE INTERMEDIATE CASE --- > > If we do not care about a scheduler but want a more complex QUEUE > MANAGEMENT, such as CODEL, that acts on extractions, we _must_ > implement an intermediate queue Q_i before the NIC ring. This is > our only chance to act on extractions from the queue (which CODEL > requires). Note that we DO NOT NEED to create multiple queues for > each ring. As long as the NIC doesn't implement fair RR or interleaving among multiple DMA rings any sort of queue management is futile. Whenever queue management is active only one DMA ring may be used and it should be as small as possible to give maximum decision latitude to the queue management. > -- 3. THE COMPLETE CASE --- > > This is when the scheduler we want (DRR, WFQ variants, PRIORITY...) > is not implemented in the NIC, or we have more queues than those > available in the NIC. In this case we need to invoke this extra > block before passing packets to the NIC. Again the same as in 2. applies, just with a more complex soft queue and scheduler. > Remember that dummynet implements exactly #3, and it comes with a > set of pretty efficient schedulers (i have made extensive measurements > on them, see links to papers on my research page > http://info.iet.unipi.it/~luigi/research.html ). > They are by no means a performance bottleneck (scheduling takes > 50..200ns depending on the circumstances) in the cases where > it matters to have a scheduler (which is, when the sender is > faster than the NIC, which in turn only happens with large packets > which take 1..30us to get through at the very least.. Thanks for the information. > --- IMPLEMENTATION --- > > Apart from ALTQ (which is very slow and has inefficient schedulers > and i don't think anybody wants to maintain), and with the exception > of dummynet which I'll discuss later, at the moment FreeBSD do not > support schedulers in the tx path of the device driver. I haven't really dug into ALTQ/dummynet yet, however from looking over you seems to be very much right. The basis for fresh generic QoS implementation should be dummynet (in parallel to keep it intact). > So we can only deal with cases 1 and 2, and for them the software > queue + ring suffices to implement any QUEUE MANAGEMENT policy > (but we don't implement anything). > > If we want support the generic case (#3), we should do the following: > > 1. device drivers export a function to transmit on an individual ring, > basically the current if_transmit(), and a hook to play with the > corresponding queue lock (the scheduler needs to run under lock, > and we can as well use the ring lock for that). > Note that the ether_output_frame does not always need to > call the scheduler: if a packet enters a non-empty queue, we are done. OK. > 2. device drivers also export the number of tx queues, and > some (advisory) information on queue status OK. > 3. ether_output_frame() runs the classifier (if needed), invokes > the scheduler (if needed) and possibly falls through into if_transmit() > for the specific ring. OK. > 4. on transmit completions (*_txeof(), typically), a callback invokes > the scheduler to feed the NIC ring with more packets Ack. > I mentioned dummynet: it already implements ALL of this, > including the completion callback in #4. There is a hook > in ether_output_frame(), and the hook was called (up to 8.0 > i believe) if_tx_rdy(). You can see wat it does in > RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy() > > http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup > > if_tx_rdy() does not exist anymore because almost nobody used it, > but it is trivial to reimplement, and can be called by device drivers > when *_txeof() finds that is running low on packets _and_ the > specific NIC needs to implement the "complete" scheduling. Yup. > The way it worked in dummynet (I think i used it in on 'tun' and 'ed') > is also documented in the manpage: > define a pipe whose bandwidth is set as a the device name instead > of a number. Then you can attach a scheduler to the pipe, queues > to the scheduler, and you are done. Example: > > // this is the scheduler's configuration > ipfw pipe 10 config bw 'em2' sched > ipfw sched 10 config type drr // deficit round robin > ipfw queue 1 config weight 30 sched 10 // important > ipfw queue 2 config weight 5 sched 10 // less important > ipfw queue 3 config weight 1 sched 10 // who cares... > > // and this is the classifier, which you can skip if the > // packets are already pre-classified. > // The infrastructure is already there to implement per-interface > // configurations. > ipfw add queue 1 src-port 53 > ipfw add queue 2 src-port 22 > ipfw add queue 2 ip from any to any > > Now, surely we can replace the implementation of packet queues in dummynet > from the current TAILQ to something resembling buf_ring to improve > write parallelism; and a bit of glue code is needed to attach > per-interface ipfw instances to each interface, and some smarts in > the configuration commands is needed to figure out when we can > bypass everything or not. I'll experiment with variantions thereof. > But this seems to me a much more viable approach to achieve proper QoS > support in our architecture. Indeed. Let me get some code and prototypes going in the next weeks and then pick up the discussion from there again. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 22:23:40 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 02707F4A for ; Wed, 30 Oct 2013 22:23:40 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 64118288C for ; Wed, 30 Oct 2013 22:23:38 +0000 (UTC) Received: (qmail 64385 invoked from network); 30 Oct 2013 22:53:56 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 22:53:56 -0000 Message-ID: <527186D3.7090307@freebsd.org> Date: Wed, 30 Oct 2013 23:23:15 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> <52717A62.7040600@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-net@freebsd.org" , Luigi Rizzo , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 22:23:40 -0000 On 30.10.2013 22:53, Adrian Chadd wrote: > On 30 October 2013 14:30, Andre Oppermann wrote: > >> As default setup and when running a server no QoS will be active >> or inserted. No or only very small software queues exist to handle >> concurrency (except for ieee80211 to do sophisticated frame management >> inside *if_txframe). Whenever the DMA ring is full there is no point >> in queuing up more packets. Instead the socket buffers act as buffers >> and also ensure flow control and backpressure up to userspace to limit >> kernel memory usage from double and triple buffering. > > .. and what about for LAN<->WAN traffic, where there's no socket buffers? When the DMA ring is full (in case of a deep ring, or the software queue for small DMA rings) additional packets get dropped as it is today. Instead of tail dropping an active queue management algorithm like RED may be used. The is no point in ultra deep buffering ending up in tens or hundreds of milliseconds (see bufferbloat). If there is more egress traffic destined for an interface than it can handle there is no way to avoid packet drops. It's actually a good thing because for TCP packet drops are the primary feedback for its sending behavior. -- Andre From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 22:32:09 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 4AA0316F for ; Wed, 30 Oct 2013 22:32:09 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 92D152910 for ; Wed, 30 Oct 2013 22:32:08 +0000 (UTC) Received: (qmail 64443 invoked from network); 30 Oct 2013 23:02:26 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 23:02:26 -0000 Message-ID: <527188D1.2070905@freebsd.org> Date: Wed, 30 Oct 2013 23:31:45 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: grenville armitage , freebsd-net@freebsd.org Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> <527181C8.3040502@swin.edu.au> In-Reply-To: <527181C8.3040502@swin.edu.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 22:32:09 -0000 On 30.10.2013 23:01, grenville armitage wrote: > On 10/30/2013 16:00, Luigi Rizzo wrote: > [..] >> Just to set the terminology: >> QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT >> or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES . >> This is what implements DROPTAIL (also improperly called FIFO), >> RED, CODEL. Note that for CODEL you need to intercept extractions >> from the queue, whereas DROPTAIL and RED only act on >> insertions. >> >> SCHEDULER is the entity which decides which queue to serve among >> the many possible ones. It is called on INSERTIONS and >> EXTRACTIONS from a queue, and passes packets to the NIC's queue. >> >> The decision on which queue and ring (Q_i and R_j) to use should be made >> by a classifier at the beginning of step 2 (or once per iteration, >> if using a hierarchical scheduler). Of course they can be precomputed >> (e.g. with annotations in the mbuf coming from the socket). > > I'd like to give a big +1 to the above. Crucial additional points > about the per-hop processing for QoS: > > - Classification is any decision of the form "to what class does > this frame belong", where the answer is intended to drive the frame > into the appropriate queue. (Which implies the notion of 'class' is > very much context-dependent, and classification is something that may > occur on L3 tuples, MPLS headers, other L2 fields, other local in-kernel > context,etc.) Full ack. When the class information is present (and trusted) on ingress packets in the vlan header, IP tos and other such well-defined fields we can map it directly to the mbuf header qoscos field. Everything more complex has to be done in a packet filter that has access to and can parse L3 and higher layers in the packet. On egress only the mbuf header is looked at to determine the class and queue it should be put into. > - Queuing and schedule must happen where bottlenecks form, and > are irrelevant at points in the data path where no bottleneck exists. Very well put and *the* one crucial thing to understand to make any kind of QoS work in practice. -- Andre From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 00:32:55 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id CF1DF140; Thu, 31 Oct 2013 00:32:55 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 6EB092FD0; Thu, 31 Oct 2013 00:32:52 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 0EC437300A; Thu, 31 Oct 2013 01:34:38 +0100 (CET) Date: Thu, 31 Oct 2013 01:34:38 +0100 From: Luigi Rizzo To: Andre Oppermann Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) Message-ID: <20131031003438.GA10518@onelab2.iet.unipi.it> References: <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> <52718556.9010808@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52718556.9010808@freebsd.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Adrian Chadd , "freebsd-net@freebsd.org" , Navdeep Parhar , Randall Stewart X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 00:32:55 -0000 On Wed, Oct 30, 2013 at 11:16:54PM +0100, Andre Oppermann wrote: > On 30.10.2013 06:00, Luigi Rizzo wrote: ... > [skipping things replied to in other email] likewise, and let me thank you for the detailed comments. I am adding a few comments myself below > > The architecture i think we should pursue is this (which happens to be > > what linux implements, and also what dummynet implements, except > > that the output is to a dummynet pipe or to ether_output() or to > > ip_output() depending on the configuration): > > > > 1. multiple (one per core) concurrent transmitters t_c > > That's simply the number of cores that in theory could try to send > a packet at the time? Or is it supposed to be an actual structure? it is just the number of cores that could potentially compete at any time in using one scheduler > > which use ether_output_frame() to send to > > > > 2. multiple disjoint queues q_j > > (one per traffic group, can be *a lot*, say 10^6) > > Whooo, that looks a bit excessive. So many traffic groups would > effectively be one per flow? It depends on what you define as "flow", and i explictly did not use the term as it is ambiguous. For me a traffic group is whatever a classifier decides to put together. The point of aiming for large number of classes is to avoid making assumptions that will limit us in the future, eg. reserving a too small field to represent the queue id, or statically allocating queues, and the like. Most schedulers in dummynet scale as O(1) with the number of classes, so the only issue is having enough memory; and in any case the actual max number of classes depends on the output of your classifier. A lot of dummynet configurations (driving the upstream link for a leaf netwrork, so right in front of bottleneck) use a handful of groups _per host_: say one for voip, one for dns/ssh, one for bulk traffic, assigning different weights. A QFQ scheduler can easily end up with a few thousands of queues and still efficiently achieve fair sharing of bandwidth. > Most of the time traffic is distributed into 4-8 classes with > strict priority for the highest class (VoIP) and some sort of > proportional WFQ for the others. At least that's the standard > setup for carrier/ISP networks. This is for two reasons: - the ISP does not need to care about individual hosts within the customer's network, but only (possibly) on the coarse classification that the customer has made via TOS/COS bits. - boxes that only have a handful of queues handled with priority cost infinitely less than decent ones, so ISPs have an incentive in not separating individual customers (which they should do) especially if the SLA is "your upstream bandwidth is 1 Mbit/s, but the guaranteed bandwidth is 30 Kbit/s" (typical ADSL in italy). But again, it is important that we support large sets of classes, we do not necessarily have to use them. > > Once a packet reaches R_j, for all practical purpose > > is on the wire. We cannot intercept extractions, > > we cannot interfere with the scheduler in the NIC in > > case of multiqueue NICs. The most we can do (and should, > > as in Linux) is notify the owner of the packet once its > > transmission is complete. > > Per packet notification probably has a high overhead on high pps > systems. The coalesced TX complete interrupt should do for QoS > purposes as well to keep the DMA ring fed. We do not track who > generated the packet and thus can't have the notification bubble > up to the PCB (if any). I know we don't do it now, but linux does and performance is not impacted badly. Notifications can be easily batched and in the end they only cause a selwakeup() . Anyways this can be retrofitted if we have a reference from the mbuf to the owner/socket, and a pointer to a callback. > > The decision on which queue and ring (Q_i and R_j) to use should be made > > by a classifier at the beginning of step 2 (or once per iteration, > > if using a hierarchical scheduler). Of course they can be precomputed > > (e.g. with annotations in the mbuf coming from the socket). > > IMHO that is the job of a packet filter, or in simple cases can be > transposed into the mbuf header from vlan header cos or IP header > tos fields. we are on sync here, just terminology differs. A classifier is the first half of a packet filter (which first classifies and then applies an action). And yes the classification info can come from the headers. cheers luigi From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 02:45:01 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BE20060C; Thu, 31 Oct 2013 02:45:01 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9379926F1; Thu, 31 Oct 2013 02:45:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9V2j1Yi092704; Thu, 31 Oct 2013 02:45:01 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9V2j1P1092703; Thu, 31 Oct 2013 02:45:01 GMT (envelope-from linimon) Date: Thu, 31 Oct 2013 02:45:01 GMT Message-Id: <201310310245.r9V2j1P1092703@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org From: linimon@FreeBSD.org Subject: Re: kern/183390: [ixgbe] 10gigabit networking problems X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 02:45:01 -0000 Old Synopsis: 10gigabit networking problems New Synopsis: [ixgbe] 10gigabit networking problems Responsible-Changed-From-To: freebsd-bugs->freebsd-net Responsible-Changed-By: linimon Responsible-Changed-When: Thu Oct 31 02:43:11 UTC 2013 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=183390 From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 02:46:10 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 65E9B6FB; Thu, 31 Oct 2013 02:46:10 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3A7B12709; Thu, 31 Oct 2013 02:46:10 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r9V2kAPF092779; Thu, 31 Oct 2013 02:46:10 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r9V2kADn092778; Thu, 31 Oct 2013 02:46:10 GMT (envelope-from linimon) Date: Thu, 31 Oct 2013 02:46:10 GMT Message-Id: <201310310246.r9V2kADn092778@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org From: linimon@FreeBSD.org Subject: Re: kern/183391: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 02:46:10 -0000 Old Synopsis: 10gigabit networking problems with Emulex OCE 11102 CNA New Synopsis: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA Responsible-Changed-From-To: freebsd-bugs->freebsd-net Responsible-Changed-By: linimon Responsible-Changed-When: Thu Oct 31 02:45:10 UTC 2013 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=183391 From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 07:41:01 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 7756F31B for ; Thu, 31 Oct 2013 07:41:01 +0000 (UTC) (envelope-from s.khanchi@gmail.com) Received: from mail-wg0-x22b.google.com (mail-wg0-x22b.google.com [IPv6:2a00:1450:400c:c00::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id F377B265E for ; Thu, 31 Oct 2013 07:41:00 +0000 (UTC) Received: by mail-wg0-f43.google.com with SMTP id b13so2347475wgh.10 for ; Thu, 31 Oct 2013 00:40:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:message-id:subject:to:content-type; bh=1lT3QG3BPmFIjAQw9yhGT3wvVMnR4AcC+e/raZ5wPMk=; b=QdFYg5yetQl6S3jTc1YYdBNwC0TPZbOOjwFZlqLS+1PLJ+GBVEnmHLVDqX1m0N6U5T 2DcbQgYUucO9j1XcR/LBsZ93R7NHbgfBCcPu95D85b5nRi+30pWnaBJlI87o99tr/Jwm 0uZd+7ef/yjOX4Yi34QJxL7tlg0SL4BTAi8oQSKLeu7Mw1CTwU+MWuLgkCZG1puwejgv WK75Ffb0Vmq9VNqsVVKJ3VoZ+BaofccQsYRFpXVf6JDbQ52CP/h7EgpEOy1q5XvUKBvF XaFNlgGHf141vNwzFUdYY+Z1cLwmXnfY6YrYWC2bbKmMD3LQ4PvAvooQP8KMtbtZJ+QL /atw== X-Received: by 10.194.250.6 with SMTP id yy6mr1392705wjc.13.1383205259515; Thu, 31 Oct 2013 00:40:59 -0700 (PDT) MIME-Version: 1.0 Sender: s.khanchi@gmail.com Received: by 10.194.119.73 with HTTP; Thu, 31 Oct 2013 00:40:39 -0700 (PDT) From: h bagade Date: Thu, 31 Oct 2013 11:10:39 +0330 X-Google-Sender-Auth: 32HW-scoFN_V6fdWFkTWzY7HZCs Message-ID: Subject: Errors on running kipfw with vale switches To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 07:41:01 -0000 Hi all, I want to run userland ipfw with netmap support(kipfw). When I try to follow the example to test kipfw, it encounters an error on following command: # connect the firewall to two vale switches ./kipfw valeA:f valeB:f & command output: root@zharf-bsd-9:/ipfw-user # ./kipfw valeA:f valeB:f & [1] 2278 [ 10.971878] missing.c:callout_startup [356] start init_children mod_idx value 9 +++ start module 0 ipfw ipfw at 0x61dc60 order 0x1 +++ start module 1 sy_ipfw SYSINIT at 0x0 order 0x2 ipfw2 initialized, divert loadable, nat loadable, rule-based forwarding disabled, default to accept, logging disabled +++ start module 2 sy_Vnet_ipfw SYSINIT at 0x0 order 0x3 [ 10.971944] missing.c:callout_init [303] c 0x61e380 mpsafe 8 [ 10.971949] missing.c:pfil_head_get [86] called [ 10.971952] missing.c:pfil_add_hook [93] called +++ start module 3 dummynet dummynet at 0x61dca0 order 0x4 DUMMYNET 0x0 with IPv6 initialized (100409) [ 10.971966] missing.c:taskqueue_create [422] start dummynet fn 0x414ba0 ctx 0x61e400 [ 10.971970] missing.c:taskqueue_start_threads [430] tqp 0x61e400 count 1 (dummy) [ 10.971973] missing.c:callout_init [303] c 0x61e4a0 mpsafe 8 +++ start module 4 dn_fifo dn_fifo at 0x61dcf0 order 0x5 [ 10.971982] ip_dummynet.c:load_dn_sched [2250] dn_sched FIFO loaded +++ start module 5 dn_wf2qp dn_wf2qp at 0x61ddd0 order 0x6 [ 10.971989] ip_dummynet.c:load_dn_sched [2250] dn_sched WF2Q+ loaded +++ start module 6 dn_rr dn_rr at 0x61deb0 order 0x7 [ 10.971995] ip_dummynet.c:load_dn_sched [2250] dn_sched RR loaded +++ start module 7 dn_qfq dn_qfq at 0x61df90 order 0x8 [ 10.972000] ip_dummynet.c:load_dn_sched [2250] dn_sched QFQ loaded +++ start module 8 dn_prio dn_prio at 0x61e070 order 0x9 [ 10.972005] ip_dummynet.c:load_dn_sched [2250] dn_sched PRIO loaded *** Global Sysctl Table entries = 39, total size = 2052 *** [ 10.972055] session.c:do_server [531] +++ listening tcp 127.0.0.1:5555 [ 10.972065] netmap_io.c:netmap_add_port [272] opening netmap device valeA:f netmap_open [131] /dev/netmap opened ok netmap_open [139] cannot get info on valeA:f, errno 6 ver 3 [ 10.972098] netmap_io.c:netmap_add_port [283] error opening valeA:f [ 10.972103] netmap_io.c:netmap_add_port [272] opening netmap device valeB:f netmap_open [131] /dev/netmap opened ok netmap_open [139] cannot get info on valeB:f, errno 6 ver 3 [ 13.019760] netmap_io.c:netmap_add_port [283] error opening valeB:f [ 13.019779] session.c:do_server [531] +++ listening tcp 127.0.0.1:5556 [ 13.021023] missing.c:callout_run [373] running 0x61e4a0 due at 1 now 2049 [ 13.021035] missing.c:callout_run [373] running 0x61e380 due at 1000 now 2049 I am running firewall on FreeBSD 9.2-stable. It seems that there is some problem with vale but I don't know what it is! Is it possible that my netmap module doesn't support vale? From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 16:56:00 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 6CAC7356 for ; Thu, 31 Oct 2013 16:56:00 +0000 (UTC) (envelope-from CGuadall@nexica.com) Received: from relay3.mail.nexica.com (relay3.mail.nexica.com [217.13.116.92]) by mx1.freebsd.org (Postfix) with ESMTP id CD40A2ED7 for ; Thu, 31 Oct 2013 16:55:59 +0000 (UTC) Received: from relay3.mail.nexica.com (zeus02nex.noc.nexica.com [10.2.0.151]) by batchmail3.noc.nexica.com (Postfix) with ESMTP id CD370DD3D3 for ; Thu, 31 Oct 2013 17:11:06 +0100 (CET) Received: from cl3-smtp.mail.nexica.com (zeus02nex.noc.nexica.com [10.2.0.151]) by relay3.noc.nexica.com (Postfix) with ESMTP id 174A2B4522 for ; Thu, 31 Oct 2013 17:11:00 +0100 (CET) Received: from vnxbcnex02.bcn.nexica.com (unknown [212.92.38.69]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: relay.nexica.com) by cl3-smtp.mail.nexica.com (Postfix) with ESMTP id 0D23712B685 for ; Thu, 31 Oct 2013 17:11:00 +0100 (CET) Received: from vnxbcnex01.bcn.nexica.com (192.168.1.158) by vnxbcnex02.bcn.nexica.com (192.168.1.159) with Microsoft SMTP Server (TLS) id 8.1.436.0; Thu, 31 Oct 2013 17:10:59 +0100 Received: from vnxbcnex01.bcn.nexica.com ([172.16.30.68]) by vnxbcnex01.bcn.nexica.com ([172.16.30.68]) with mapi; Thu, 31 Oct 2013 17:10:59 +0100 From: Carles Guadall To: "freebsd-net@freebsd.org" Date: Thu, 31 Oct 2013 17:10:57 +0100 Subject: LACP+VLAN with 10G NIC not working Thread-Topic: LACP+VLAN with 10G NIC not working Thread-Index: Ac7WU4XKuc41LagWQKWzRhhNFTXOLg== Message-ID: <7A75BE7326F9D34D83FDF03CA8B02155174F56A242@vnxbcnex01.bcn.nexica.com> Accept-Language: ca-ES, es-ES Content-Language: ca-ES X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: ca-ES, es-ES X-TM-AS-Product-Ver: SMEX-10.2.0.2087-7.000.1014-20258.003 X-TM-AS-Result: No--3.015000-8.000000-31 X-TM-AS-User-Approved-Sender: No X-TM-AS-User-Blocked-Sender: No Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 16:56:00 -0000 I configured a lacp (lagg0) with two 10G-Intel NICs. Latter I created 2 VLA= N over lagg0. When trying to ping from host vlan to any other hosts doesn't work.=20 I ran tcpdump on each interface, ix[0|1], lagg0 and vlan[52|908]. - On each physical interface I see packets coming from network. I can see m= ainly broadcasts, correctly tagged, etc. - On lagg0 interface I also see packets coming from network. When running t= cpdump on vlanXX I only can see ARP requests from localhost. It's seems packets doesn't "flow" from/to vnic and lagg. Inbound=20 ( network ) --> [ix0] --> [lagg0] ---X---> [vlan52] ( network ) --> [ix0] --> [lagg0] ---X---> [vlan52] Outbound=20 ( network ) <-- [ix0] <-- [lagg0] <---X--- [vlan52] ( network ) <-- [ix0] <-- [lagg0] <---X--- [vlan52] Any idea what's wrong?? System info # uname -a FreeBSD XXX-hostname-XXX 9.1-STABLE FreeBSD 9.1-STABLE #0 r+16f6355: Tue Au= g 27 00:38:40 PDT 2013 root@build.ixsystems.com:/tank/home/jkh/src/free= nas/os-base/amd64/tank/home/jkh/src/freenas/FreeBSD/src/sys/FREENAS.amd64 = amd64 # sysctl kern.osreldate kern.osreldate: 901505 # dmesg |grep -i intel ix0: port 0xbc00-0xbc1f mem 0xf9f80000-0xf9ffffff,0xf9f7c000-0xf9f7ffff ir= q 16 at device 0.0 on pci1 ix1: port 0xb880-0xb89f mem 0xf9e80000-0xf9efffff,0xf9e7c000-0xf9e7ffff ir= q 17 at device 0.1 on pci1 # pciconf -lv | grep -B3 network ix0@pci0:1:0:0: class=3D0x020000 card=3D0x061115d9 chip=3D0x10fb8086 rev=3D= 0x01 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82599EB 10-Gigabit SFI/SFP+ Network Connection' class =3D network -- ix1@pci0:1:0:1: class=3D0x020000 card=3D0x061115d9 chip=3D0x10fb8086 rev=3D= 0x01 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82599EB 10-Gigabit SFI/SFP+ Network Connection' class =3D network # ifconfig ix0: flags=3D8843 metric 0 mtu 1500 options=3D407bb ether 00:25:90:c3:da:82 inet6 fe80::225:90ff:fec3:da82%ix0 prefixlen 64 scopeid 0x1 nd6 options=3D29 media: Ethernet autoselect (autoselect ) status: active ix1: flags=3D8843 metric 0 mtu 1500 options=3D407bb ether 00:25:90:c3:da:82 inet6 fe80::225:90ff:fec3:da83%ix1 prefixlen 64 scopeid 0x2 nd6 options=3D29 media: Ethernet autoselect (autoselect ) status: active lagg0: flags=3D8843 metric 0 mtu 15= 00 options=3D407bb ether 00:25:90:c3:da:82 inet 192.168.100.100 netmask 0xffffff00 broadcast 192.168.100.255 inet6 fe80::225:90ff:fec3:da82%lagg0 prefixlen 64 scopeid 0x9 nd6 options=3D29 media: Ethernet autoselect status: active laggproto lacp lagghash l2,l3,l4 laggport: ix1 flags=3D18 laggport: ix0 flags=3D18 vlan52: flags=3D8843 metric 0 mtu 1= 500 options=3D303 ether 00:25:90:c3:da:82 inet 10.52.0.9 netmask 0xffffff00 broadcast 10.52.0.255 inet6 fe80::225:90ff:fec3:da82%vlan52 prefixlen 64 scopeid 0xa nd6 options=3D29 media: Ethernet autoselect status: active vlan: 52 parent interface: lagg0 vlan908: flags=3D8843 metric 0 mtu = 1500 options=3D303 ether 00:25:90:c3:da:82 inet 10.21.0.9 netmask 0xffffff00 broadcast 10.21.0.255 inet6 fe80::225:90ff:fec3:da82%vlan908 prefixlen 64 scopeid 0xb nd6 options=3D29 media: Ethernet autoselect status: active vlan: 908 parent interface: lagg0 Thank you Carles Guadall From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 18:07:19 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id DA92AFC1 for ; Thu, 31 Oct 2013 18:07:19 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 9D3272500 for ; Thu, 31 Oct 2013 18:07:19 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 0AAB57300A; Thu, 31 Oct 2013 19:09:07 +0100 (CET) Date: Thu, 31 Oct 2013 19:09:07 +0100 From: Luigi Rizzo To: h bagade Subject: Re: Errors on running kipfw with vale switches Message-ID: <20131031180907.GB62132@onelab2.iet.unipi.it> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 18:07:19 -0000 On Thu, Oct 31, 2013 at 11:10:39AM +0330, h bagade wrote: > Hi all, > > I want to run userland ipfw with netmap support(kipfw). When I try to > follow the example to test kipfw, it encounters an error on following > command: i suspect that stable/9 has an old version of the netmap code so the argument to the ioctl fails. In fact, I don't even remember if the code in stable/9 supports VALE. Please wait for a few days, we am going to push a newer version of netmap to both HEAD and stable/9 soon cheers luigi From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 18:08:31 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id DA83317B for ; Thu, 31 Oct 2013 18:08:31 +0000 (UTC) (envelope-from raitech@gmail.com) Received: from mail-pd0-x231.google.com (mail-pd0-x231.google.com [IPv6:2607:f8b0:400e:c02::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id B6F102527 for ; Thu, 31 Oct 2013 18:08:31 +0000 (UTC) Received: by mail-pd0-f177.google.com with SMTP id p10so2730704pdj.22 for ; Thu, 31 Oct 2013 11:08:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=A3SBVrP6kwfWgTI7CboON3o+ndvUScOUYBNH8UGYAis=; b=NRM9f1TM7xrd1qKEFeZia8UVx3cYUMagFMHSX+APNpjxybVSaBx/w/SYNGfWrqKc9D dO7f8uaTT2ZmAnzuEfYi9SdgkkewdieMlKLGpGJnIWiG5cNAFuj6z+5WgV/EqAGQpTAD VbLr5Kwb2JVMwSwnedy5Gh5inlEdf58Tt28olszTpPDBWwiZd3lfGNdcpMplC/Ow/no9 CJbBc96s8uGA5vOw/YSfOOI3ZXlQQmeWkg76bXXEw0Efk7UK+5Sxu+/XQZCDesDPkBtP QaEynvjIAtZlHuRDoStxonU1rcT2xNyextvtdXN+fG6wD5BmC7AP4WfGpeINvWpnd4UD +vuA== X-Received: by 10.68.164.165 with SMTP id yr5mr3240711pbb.146.1383242911387; Thu, 31 Oct 2013 11:08:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.70.101.70 with HTTP; Thu, 31 Oct 2013 11:08:11 -0700 (PDT) From: Raimundo Santos Date: Thu, 31 Oct 2013 16:08:11 -0200 Message-ID: Subject: MPD PPTP seting 0 on net.inet.ip.forwarding To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 18:08:31 -0000 Hello! I was experimenting with set ipcp ranges 0.0.0.0 172.16.1.20 to see if I well understood concepts on MPD5.7 docs, but when I try to connect to PPTP server with 0.0.0.0 as local address, net.inet.ip.forwarding gets to 0, and the PPP does not connect. But changing it to set ipcp ranges 172.16.1.19 172.16.1.20 the same strange net.inet.ip.forwarding going to 0, but it connects the PPP link. And by using the mpd.conf.sample ippool example, just changing the IPs to correspond to my network, the same strange thing. What a strange behaving. Using MPD 5.7 and FreeBSD 9.2-RELEASE. What could be wrong? Here is my mpd.conf: startup: # configure mpd users set user foo bar admin set user foo1 bar1 # configure the console set console self 127.0.0.1 5005 set console open # configure the web server set web self 0.0.0.0 5006 set web open default: load pptp_server pptp_server: set ippool add pool1 172.16.1.20 172.16.1.100 create bundle template B set iface enable proxy-arp set iface idle 1800 set iface enable tcpmssfix set ipcp yes vjcomp set ipcp ranges 172.16.1.19/32 ippool pool1 #set ipcp dns 192.168.1.3 #set ipcp nbns 192.168.1.4 set bundle enable compression set ccp yes mppc set mppc yes e40 set mppc yes e128 set mppc yes stateless create link template L pptp set link action bundle B set link enable multilink set link yes acfcomp protocomp set link no pap chap eap set link enable chap set link keep-alive 10 60 set link mtu 1460 set pptp self 192.168.0.2 set link enable incoming log +all And here is my rc.conf: hostname="rtcprime" ifconfig_alc0=" inet 192.168.0.2 netmask 255.255.255.0" defaultrouter="192.168.0.1" sshd_enable="YES" ntpd_enable="YES" powerd_enable="YES" dumpdev="AUTO" zfs_enable="YES" noip_enable="YES" samba_enable="YES" mpd_enable="YES" As you can see, there is no gateway_enable="YES", but there is net.inet.ip.forwarding=1 in /etc/sysctl.conf Thank you for your attention. Raimundo Santos From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 18:39:54 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id C617FBA for ; Thu, 31 Oct 2013 18:39:54 +0000 (UTC) (envelope-from raitech@gmail.com) Received: from mail-pa0-x236.google.com (mail-pa0-x236.google.com [IPv6:2607:f8b0:400e:c03::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A08442756 for ; Thu, 31 Oct 2013 18:39:54 +0000 (UTC) Received: by mail-pa0-f54.google.com with SMTP id fa1so2974791pad.13 for ; Thu, 31 Oct 2013 11:39:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=D/TXGFZfJwut2mbSh3PXedyv8nuHm7UH0uo97f3R6Cg=; b=bjLOMp2VqxS+nO/lWCHLEBujjIY9e6vjgdfcSZ5kOWjye1YVU5HznbUD+TxvPQEQNH +PqXEGaioiEc3uW8N026W0taZ6Gu91jIB0BNL0j3OMo5nBZANCUl6ZR2CzpfKW32AC7+ TZ4vCVd6V9c1oMIM/LazqkaFX6s5cqeCKBNOtMe+NtUy2pdP0RvDxGBfQ9fJWLhmDhc6 rCgzhZr/dsX5kNNQyJr2hLFD0VUPIAqoUAA4aqrzyoifTxcTZhcfdVK5XTrHSFIgRQEC QGtHTSoeiBeZEZfmHIXbYO8eA1g7RFKL7l1QVyOG6ksGLv2y2PwVg3yJy9CvO0ZeX83U 00og== X-Received: by 10.67.30.100 with SMTP id kd4mr5422876pad.24.1383244794107; Thu, 31 Oct 2013 11:39:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.70.101.70 with HTTP; Thu, 31 Oct 2013 11:39:33 -0700 (PDT) In-Reply-To: References: From: Raimundo Santos Date: Thu, 31 Oct 2013 16:39:33 -0200 Message-ID: Subject: Re: MPD PPTP seting 0 on net.inet.ip.forwarding To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 18:39:54 -0000 Ok, I have found some weird thing: On 31 October 2013 16:08, Raimundo Santos wrote: > > > As you can see, there is no gateway_enable="YES", but there is > net.inet.ip.forwarding=1 in /etc/sysctl.conf > > MPD do not respect my configuration in sysctl.conf, only the one in rc.conf. To test: * put net.inet.ip.forwarding and net.inet6.ip6.forwarding = 1 in sysctl.conf * put gateway_enable="YES" in rc.conf * connect to PPTP server You will see that net.inet.ip.forwarding, after PPTP connection are stablished, remains 1, but net.inet6.ip6.forwarding goes to 0! Is that behaviour expected? Am I worng when setting a router without gateway_enable="YES" in rc.conf but with net.inet.ip.forwarding=1 in sysctl.conf? Thank you, Raimundo Santos From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 19:03:14 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 36A42C87 for ; Thu, 31 Oct 2013 19:03:14 +0000 (UTC) (envelope-from egrosbein@rdtc.ru) Received: from eg.sd.rdtc.ru (eg.sd.rdtc.ru [IPv6:2a03:3100:c:13::5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7F22F28F2 for ; Thu, 31 Oct 2013 19:03:13 +0000 (UTC) X-Envelope-From: egrosbein@rdtc.ru X-Envelope-To: freebsd-net@freebsd.org Received: from eg.sd.rdtc.ru (eugen@localhost [127.0.0.1]) by eg.sd.rdtc.ru (8.14.7/8.14.7) with ESMTP id r9VJ347D046501; Fri, 1 Nov 2013 02:03:04 +0700 (NOVT) (envelope-from egrosbein@rdtc.ru) Message-ID: <5272A968.2050205@rdtc.ru> Date: Fri, 01 Nov 2013 02:03:04 +0700 From: Eugene Grosbein User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130415 Thunderbird/17.0.5 MIME-Version: 1.0 To: Raimundo Santos Subject: Re: MPD PPTP seting 0 on net.inet.ip.forwarding References: In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.3.2 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eg.sd.rdtc.ru Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 19:03:14 -0000 On 01.11.2013 01:39, Raimundo Santos wrote: > Ok, I have found some weird thing: > > > On 31 October 2013 16:08, Raimundo Santos wrote: > >> >> >> As you can see, there is no gateway_enable="YES", but there is >> net.inet.ip.forwarding=1 in /etc/sysctl.conf >> >> > MPD do not respect my configuration in sysctl.conf, only the one in > rc.conf. To test: > > * put net.inet.ip.forwarding and net.inet6.ip6.forwarding = 1 in sysctl.conf > * put gateway_enable="YES" in rc.conf > * connect to PPTP server > > You will see that net.inet.ip.forwarding, after PPTP connection are > stablished, remains 1, but net.inet6.ip6.forwarding goes to 0! > > Is that behaviour expected? > > Am I worng when setting a router without gateway_enable="YES" in rc.conf > but with net.inet.ip.forwarding=1 in sysctl.conf? That's not MPD's fault. That's FreeBSD 9.2's devd starting /etc/pccard_ether $subsystem start every time an interface is created. This leads to start of /etc/rc.d/netif quietstart $ifn netif does LOTS of things making severe (and unneeded for mpd) load on the system and resetting net.inet.ip.forwarding to 0 if you don't have gateway_enable="YES" in your /etc/rc.conf I don't need devd so I just disabled it in rc.conf with devd_enable="NO". If you need it, just switch from sysctls to: gateway_enable="YES" ipv6_gateway_enable="YES" This seems as regression from 9.1 behavior for me for busy mpd-based BRAS'es as performance of the box drops significantly due to extra work performed by devd and its scripts. From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 20:57:43 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 61D1F393 for ; Thu, 31 Oct 2013 20:57:43 +0000 (UTC) (envelope-from raitech@gmail.com) Received: from mail-pa0-x233.google.com (mail-pa0-x233.google.com [IPv6:2607:f8b0:400e:c03::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3B3A62200 for ; Thu, 31 Oct 2013 20:57:43 +0000 (UTC) Received: by mail-pa0-f51.google.com with SMTP id ld10so3063073pab.38 for ; Thu, 31 Oct 2013 13:57:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=Lu0wue6EFj0Bm3Qd25OWc50JSdRtClsYa4ikXtjjPoo=; b=iR+P1dIr6XhbTdG9oZdg/A60Uq0+W2cil/GmdczyCbrner0Jlf8Dodndug+JNr1A2u 7bPLZolOFgE+bk3BjRu3KAqRBCBpuXBKr3sePJAfoZh1wgz6R7H3BGSYcNGAA+Go8Hot l2koG5CcMx44pCpL8nD4vEbIcrTeq/3xf61d5qr7AEIk/vmoJ8HFz8zTYkqGTsDWq5II faoItqKkZSAZtconyVQxF1ZmWyl6d/QUWnb93R3xbR1AVBjrIH1D24uvdu7OBoyO10a0 AGedAODgdXGYMnJlL3CXKYeatEA7fih1qeE3DoLnrThLACeCuNSG6dZt99yjvSCsFxme 32Nw== X-Received: by 10.68.254.231 with SMTP id al7mr3858603pbd.158.1383253062795; Thu, 31 Oct 2013 13:57:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.70.101.70 with HTTP; Thu, 31 Oct 2013 13:57:22 -0700 (PDT) In-Reply-To: <5272A968.2050205@rdtc.ru> References: <5272A968.2050205@rdtc.ru> From: Raimundo Santos Date: Thu, 31 Oct 2013 18:57:22 -0200 Message-ID: Subject: Re: MPD PPTP seting 0 on net.inet.ip.forwarding To: Eugene Grosbein Content-Type: text/plain; charset=KOI8-R X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 20:57:43 -0000 On 31 October 2013 17:03, Eugene Grosbein wrote: > > That's not MPD's fault. That's FreeBSD 9.2's devd starting > /etc/pccard_ether $subsystem start > every time an interface is created. This leads to start of > /etc/rc.d/netif quietstart $ifn > > netif does LOTS of things making severe (and unneeded for mpd) load on the system > and resetting net.inet.ip.forwarding to 0 if you don't have gateway_enable="YES" > in your /etc/rc.conf > Good to know. Not a problem for me by now, but I will keep an eye at the problem. > I don't need devd so I just disabled it in rc.conf with devd_enable="NO". > If you need it, just switch from sysctls to: > > gateway_enable="YES" > ipv6_gateway_enable="YES" > Yes, that was the solution that worked. I needed a quick an dirty VPN, ended stopping my customers network! But it's okey now, as I am such a good sysadm - heee... Thank you, Eugene! > This seems as regression from 9.1 behavior for me for busy mpd-based BRAS'es > as performance of the box drops significantly due to extra work performed > by devd and its scripts. > > > From owner-freebsd-net@FreeBSD.ORG Thu Oct 31 21:58:11 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id E9BBE647 for ; Thu, 31 Oct 2013 21:58:11 +0000 (UTC) (envelope-from ole.myhre@dataoppdrag.no) Received: from mail2.dataoppdrag.no (mail2.dataoppdrag.no [IPv6:2a02:f58:7:2::2]) by mx1.freebsd.org (Postfix) with ESMTP id A355E2683 for ; Thu, 31 Oct 2013 21:58:11 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail2.dataoppdrag.no (Postfix) with ESMTP id DCA234058C for ; Thu, 31 Oct 2013 22:58:09 +0100 (CET) Received: from mail2.dataoppdrag.no ([127.0.0.1]) by localhost (mail2.dataoppdrag.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SsO7wRr9ubbP for ; Thu, 31 Oct 2013 22:58:09 +0100 (CET) Received: from EX-MBX02.cust-d1.dataoppdrag.no (ex-mbx02.cust-d1.dataoppdrag.no [IPv6:2a02:f58:0:313:b898:7b82:13e0:c3bd]) by mail2.dataoppdrag.no (Postfix) with ESMTPS id B9DC340442 for ; Thu, 31 Oct 2013 22:58:09 +0100 (CET) Received: from EX-MBX01.cust-d1.dataoppdrag.no ([fe80::6db0:e393:6a07:457]) by EX-MBX02.cust-d1.dataoppdrag.no ([fe80::b898:7b82:13e0:c3bd%11]) with mapi id 14.02.0342.003; Thu, 31 Oct 2013 22:58:09 +0100 From: Ole Myhre To: "freebsd-net@freebsd.org" Subject: carp on 10.0 and ipv6 network route Thread-Topic: carp on 10.0 and ipv6 network route Thread-Index: Ac7WhEWGxJMHfReCSm+wVWEHkUtYuw== Date: Thu, 31 Oct 2013 21:58:08 +0000 Message-ID: Accept-Language: en-US, nb-NO Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.20.20.26] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Oct 2013 21:58:12 -0000 Hi, I'm testing carp on 10.0-BETA2, and there seems to be different behaviour with the network route between IPv4 and IPv6 when using carp on interfaces. IPv4 routes are not present in the routing table when the interface is in BACKUP state (as expected), but IPv6 routes are present in the routing table in both BACKUP and MASTER state. This causes some issues with routing daemons as the network route is announced to other routers from both machines running carp. [root@rtr1 ~]# ifconfig em2 vhid 1 192.168.0.1/24 [root@rtr2 ~]# ifconfig em2 vhid 1 192.168.0.1/24 [root@rtr1 ~]# ifconfig em2 | grep carp carp: MASTER vhid 1 advbase 1 advskew 0 [root@rtr1 ~]# netstat -rn | grep 192.168.0.0 192.168.0.0/24 link#3 U 0 0 em2 [root@rtr1 ~]# [root@rtr2 ~]# ifconfig em2 | grep carp carp: BACKUP vhid 1 advbase 1 advskew 0 [root@rtr2 ~]# netstat -rn | grep 192.168.0.0 [root@rtr2 ~]# [root@rtr1 ~]# ifconfig em2 inet6 2001:db8::1/64 vhid 1 [root@rtr2 ~]# ifconfig em2 inet6 2001:db8::1/64 vhid 1 [root@rtr1 ~]# ifconfig em2 | grep carp carp: MASTER vhid 1 advbase 1 advskew 0 [root@rtr1 ~]# netstat -rn | grep 2001:db8::/64 2001:db8::/64 link#3 U = em2 [root@rtr1 ~]# [root@rtr2 ~]# ifconfig em2 | grep carp carp: BACKUP vhid 1 advbase 1 advskew 0 [root@rtr2 ~]# netstat -rn | grep 2001:db8::/64 2001:db8::/64 link#3 U = em2 [root@rtr2 ~]# Thanks, Ole From owner-freebsd-net@FreeBSD.ORG Fri Nov 1 12:47:23 2013 Return-Path: Delivered-To: net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 8CC43909; Fri, 1 Nov 2013 12:47:23 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CFBFF293F; Fri, 1 Nov 2013 12:47:22 +0000 (UTC) Received: from cell.glebius.int.ru (localhost [127.0.0.1]) by cell.glebius.int.ru (8.14.7/8.14.7) with ESMTP id rA1ClKJ5065572; Fri, 1 Nov 2013 16:47:20 +0400 (MSK) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebius.int.ru (8.14.7/8.14.7/Submit) id rA1ClKg5065571; Fri, 1 Nov 2013 16:47:20 +0400 (MSK) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to glebius@FreeBSD.org using -f Date: Fri, 1 Nov 2013 16:47:20 +0400 From: Gleb Smirnoff To: net@FreeBSD.org, current@FreeBSD.org Subject: [CFT & review] new in_control() Message-ID: <20131101124720.GF52889@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="SvF6CGw9fzJC4Rcx" Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Nov 2013 12:47:23 -0000 --SvF6CGw9fzJC4Rcx Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi! I've got a patch that cleans up the way we configure and delete IPv4 on interfaces. What it does: 1) separate function for SIOCAIFADDR, with clear code flow from beginning to the end. 2) separate function for SIOCDIFADDR, with clear code flow from beginning to the end. 3) provided 1) and 2) the in_control() got very thin and clear. The above wasn't just a cut&paste job, instead every step taken was evaluated. I've cut quite a lot of strange code, added extra sanity checking and provided comments on the strange code that remains. 4) sx(9) lock covers entire SIOCAIFADDR/SIOCDIFADDR operation, so we close races ifconfig vs ifconfig, or ifconfig vs mpd. On interface detach SIOCDIFADDR is called w/o sx(9), but its operation is covered by IF_ADDR_LOCK(). Also, except of redesign of SIOCAIFADDR/SIOCDIFADDR, the following two related changes leaked into the patch. It is possible to separate them out, but won't be easy. 5) Removed useloopback conditional. Rationale: - option was always on since pre-FreeBSD times - sysctl knob lives in invalid (ethernet) namespace, and documented in wrong (arp(8)) place. - since new-ARP, the knob was consulted on route addition, but was ignored on delete. - operation of network stack useloopback=0 is strange The only reason running useloopback=0 could be a router that doesn't want to pollute large network with its /32 announces. However, this can be achieved with filtering in routing daemons. 6) Implemented correctly code from r201282, that tried to keep localhost route in table when multiple P2P interfaces with same local address are created and deleted. The check in of the code can cause problems. I could make mistakes, and some program that relied on strange behavior can pop up. Thus, early testing is appreciated. So far I have tested simple address assignment, CARP, and mpd5 as L2TP access concentrator. Advice for reviewers is to not look at diff, but look at patched in.c instead. -- Totus tuus, Glebius. --SvF6CGw9fzJC4Rcx Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="in_control.diff" Index: sys/net/if.c =================================================================== --- sys/net/if.c (revision 257503) +++ sys/net/if.c (working copy) @@ -1525,6 +1525,25 @@ ifa_del_loopback_route(struct ifaddr *ifa, struct return (error); } +int +ifa_switch_loopback_route(struct ifaddr *ifa, struct sockaddr *sa) +{ + struct rtentry *rt; + + rt = rtalloc1_fib(sa, 0, 0, 0); + if (rt == NULL) { + log(LOG_DEBUG, "%s: fail", __func__); + return (EHOSTUNREACH); + } + ((struct sockaddr_dl *)rt->rt_gateway)->sdl_type = + ifa->ifa_ifp->if_type; + ((struct sockaddr_dl *)rt->rt_gateway)->sdl_index = + ifa->ifa_ifp->if_index; + RTFREE_LOCKED(rt); + + return (0); +} + /* * XXX: Because sockaddr_dl has deeper structure than the sockaddr * structs used to represent other address families, it is necessary Index: sys/net/if_var.h =================================================================== --- sys/net/if_var.h (revision 257503) +++ sys/net/if_var.h (working copy) @@ -491,6 +491,7 @@ struct ifnet *ifunit_ref(const char *); int ifa_add_loopback_route(struct ifaddr *, struct sockaddr *); int ifa_del_loopback_route(struct ifaddr *, struct sockaddr *); +int ifa_switch_loopback_route(struct ifaddr *, struct sockaddr *); struct ifaddr *ifa_ifwithaddr(struct sockaddr *); int ifa_ifwithaddr_check(struct sockaddr *); Index: sys/netinet/if_ether.c =================================================================== --- sys/netinet/if_ether.c (revision 257503) +++ sys/netinet/if_ether.c (working copy) @@ -85,8 +85,6 @@ static SYSCTL_NODE(_net_link_ether, PF_ARP, arp, C static VNET_DEFINE(int, arpt_keep) = (20*60); /* once resolved, good for 20 * minutes */ static VNET_DEFINE(int, arp_maxtries) = 5; -VNET_DEFINE(int, useloopback) = 1; /* use loopback interface for - * local traffic */ static VNET_DEFINE(int, arp_proxyall) = 0; static VNET_DEFINE(int, arpt_down) = 20; /* keep incomplete entries for * 20 seconds */ @@ -111,9 +109,6 @@ SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, ma SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, maxtries, CTLFLAG_RW, &VNET_NAME(arp_maxtries), 0, "ARP resolution attempts before returning error"); -SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, useloopback, CTLFLAG_RW, - &VNET_NAME(useloopback), 0, - "Use the loopback interface for local traffic"); SYSCTL_VNET_INT(_net_link_ether_inet, OID_AUTO, proxyall, CTLFLAG_RW, &VNET_NAME(arp_proxyall), 0, "Enable proxy ARP for all suitable requests"); Index: sys/netinet/in.c =================================================================== --- sys/netinet/in.c (revision 257503) +++ sys/netinet/in.c (working copy) @@ -47,6 +47,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include @@ -71,10 +72,10 @@ static int in_mask2len(struct in_addr *); static void in_len2mask(struct in_addr *, int); static int in_lifaddr_ioctl(struct socket *, u_long, caddr_t, struct ifnet *, struct thread *); +static int in_aifaddr_ioctl(caddr_t, struct ifnet *, struct thread *); +static int in_difaddr_ioctl(caddr_t, struct ifnet *, struct thread *); static void in_socktrim(struct sockaddr_in *); -static int in_ifinit(struct ifnet *, struct in_ifaddr *, - struct sockaddr_in *, int, int); static void in_purgemaddrs(struct ifnet *); static VNET_DEFINE(int, nosameprefix); @@ -86,6 +87,9 @@ SYSCTL_VNET_INT(_net_inet_ip, OID_AUTO, no_same_pr VNET_DECLARE(struct inpcbinfo, ripcbinfo); #define V_ripcbinfo VNET(ripcbinfo) +static struct sx in_control_sx; +SX_SYSINIT(in_control_sx, &in_control_sx, "in_control"); + /* * Return 1 if an internet address is for a ``local'' host * (one to which we have a connection). @@ -128,6 +132,28 @@ in_localip(struct in_addr in) } /* + * Return an address equal to the supplied one, but not the same. + */ +static struct in_ifaddr * +more_localip(struct in_ifaddr *ia) +{ + in_addr_t in = IA_SIN(ia)->sin_addr.s_addr; + struct in_ifaddr *it; + + IN_IFADDR_RLOCK(); + LIST_FOREACH(it, INADDR_HASH(in), ia_hash) { + if (it != ia && IA_SIN(it)->sin_addr.s_addr == in) { + ifa_ref(&it->ia_ifa); + IN_IFADDR_RUNLOCK(); + return (it); + } + } + IN_IFADDR_RUNLOCK(); + + return (NULL); +} + +/* * Determine whether an IP address is in a reserved set of addresses * that may not be forwarded, or whether datagrams to that destination * may be forwarded. @@ -203,40 +229,22 @@ in_len2mask(struct in_addr *mask, int len) /* * Generic internet control operations (ioctl's). - * - * ifp is NULL if not an interface-specific ioctl. */ -/* ARGSUSED */ int in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, struct thread *td) { - register struct ifreq *ifr = (struct ifreq *)data; - register struct in_ifaddr *ia, *iap; - register struct ifaddr *ifa; - struct in_addr allhosts_addr; - struct in_addr dst; - struct in_ifinfo *ii; - struct in_aliasreq *ifra = (struct in_aliasreq *)data; - int error, hostIsNew, iaIsNew, maskIsNew; - int iaIsFirst; - u_long ocmd = cmd; + struct ifreq *ifr = (struct ifreq *)data; + struct sockaddr_in *addr = (struct sockaddr_in *)&ifr->ifr_addr; + struct in_ifaddr *ia; + int error; - /* - * Pre-10.x compat: OSIOCAIFADDR passes a shorter - * struct in_aliasreq, without ifra_vhid. - */ - if (cmd == OSIOCAIFADDR) - cmd = SIOCAIFADDR; + if (ifp == NULL) + return (EADDRNOTAVAIL); - ia = NULL; - iaIsFirst = 0; - iaIsNew = 0; - allhosts_addr.s_addr = htonl(INADDR_ALLHOSTS_GROUP); - /* - * Filter out ioctls we implement directly; forward the rest on to - * in_lifaddr_ioctl() and ifp->if_ioctl(). + * Filter out 4 ioctls we implement directly. Forward the rest + * to specific functions and ifp->if_ioctl(). */ switch (cmd) { case SIOCGIFADDR: @@ -243,34 +251,21 @@ in_control(struct socket *so, u_long cmd, caddr_t case SIOCGIFBRDADDR: case SIOCGIFDSTADDR: case SIOCGIFNETMASK: + break; case SIOCDIFADDR: - break; + sx_xlock(&in_control_sx); + error = in_difaddr_ioctl(data, ifp, td); + sx_xunlock(&in_control_sx); + return (error); case SIOCAIFADDR: - /* - * ifra_addr must be present and be of INET family. - * ifra_broadaddr and ifra_mask are optional. - */ - if (ifra->ifra_addr.sin_len != sizeof(struct sockaddr_in) || - ifra->ifra_addr.sin_family != AF_INET) - return (EINVAL); - if (ifra->ifra_broadaddr.sin_len != 0 && - (ifra->ifra_broadaddr.sin_len != - sizeof(struct sockaddr_in) || - ifra->ifra_broadaddr.sin_family != AF_INET)) - return (EINVAL); -#if 0 - /* - * ifconfig(8) in pre-10.x doesn't set sin_family for the - * mask. The code is disabled for the 10.x timeline, to - * make SIOCAIFADDR compatible with 9.x ifconfig(8). - * The code should be enabled in 11.x - */ - if (ifra->ifra_mask.sin_len != 0 && - (ifra->ifra_mask.sin_len != sizeof(struct sockaddr_in) || - ifra->ifra_mask.sin_family != AF_INET)) - return (EINVAL); -#endif - break; + sx_xlock(&in_control_sx); + error = in_aifaddr_ioctl(data, ifp, td); + sx_xunlock(&in_control_sx); + return (error); + case SIOCALIFADDR: + case SIOCDLIFADDR: + case SIOCGLIFADDR: + return (in_lifaddr_ioctl(so, cmd, data, ifp, td)); case SIOCSIFADDR: case SIOCSIFBRDADDR: case SIOCSIFDSTADDR: @@ -277,306 +272,353 @@ in_control(struct socket *so, u_long cmd, caddr_t case SIOCSIFNETMASK: /* We no longer support that old commands. */ return (EINVAL); - - case SIOCALIFADDR: - if (td != NULL) { - error = priv_check(td, PRIV_NET_ADDIFADDR); - if (error) - return (error); - } - if (ifp == NULL) - return (EINVAL); - return in_lifaddr_ioctl(so, cmd, data, ifp, td); - - case SIOCDLIFADDR: - if (td != NULL) { - error = priv_check(td, PRIV_NET_DELIFADDR); - if (error) - return (error); - } - if (ifp == NULL) - return (EINVAL); - return in_lifaddr_ioctl(so, cmd, data, ifp, td); - - case SIOCGLIFADDR: - if (ifp == NULL) - return (EINVAL); - return in_lifaddr_ioctl(so, cmd, data, ifp, td); - default: - if (ifp == NULL || ifp->if_ioctl == NULL) + if (ifp->if_ioctl == NULL) return (EOPNOTSUPP); return ((*ifp->if_ioctl)(ifp, cmd, data)); } - if (ifp == NULL) - return (EADDRNOTAVAIL); - /* - * Security checks before we get involved in any work. - */ - switch (cmd) { - case SIOCAIFADDR: - if (td != NULL) { - error = priv_check(td, PRIV_NET_ADDIFADDR); - if (error) - return (error); - } - break; - - case SIOCDIFADDR: - if (td != NULL) { - error = priv_check(td, PRIV_NET_DELIFADDR); - if (error) - return (error); - } - break; - } - - /* * Find address for this interface, if it exists. - * - * If an alias address was specified, find that one instead of the - * first one on the interface, if possible. */ - dst = ((struct sockaddr_in *)&ifr->ifr_addr)->sin_addr; IN_IFADDR_RLOCK(); - LIST_FOREACH(iap, INADDR_HASH(dst.s_addr), ia_hash) { - if (iap->ia_ifp == ifp && - iap->ia_addr.sin_addr.s_addr == dst.s_addr) { - if (td == NULL || prison_check_ip4(td->td_ucred, - &dst) == 0) - ia = iap; + LIST_FOREACH(ia, INADDR_HASH(addr->sin_addr.s_addr), ia_hash) { + if (ia->ia_ifp == ifp && + ia->ia_addr.sin_addr.s_addr == addr->sin_addr.s_addr && + prison_check_ip4(td->td_ucred, &addr->sin_addr) == 0) break; - } } - if (ia != NULL) - ifa_ref(&ia->ia_ifa); - IN_IFADDR_RUNLOCK(); + if (ia == NULL) { - IF_ADDR_RLOCK(ifp); - TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { - iap = ifatoia(ifa); - if (iap->ia_addr.sin_family == AF_INET) { - if (td != NULL && - prison_check_ip4(td->td_ucred, - &iap->ia_addr.sin_addr) != 0) - continue; - ia = iap; - break; - } - } - if (ia != NULL) - ifa_ref(&ia->ia_ifa); - IF_ADDR_RUNLOCK(ifp); + IN_IFADDR_RUNLOCK(); + return (EADDRNOTAVAIL); } - if (ia == NULL) - iaIsFirst = 1; error = 0; switch (cmd) { - case SIOCAIFADDR: - case SIOCDIFADDR: - if (ifra->ifra_addr.sin_family == AF_INET) { - struct in_ifaddr *oia; + case SIOCGIFADDR: + *addr = ia->ia_addr; + break; - IN_IFADDR_RLOCK(); - for (oia = ia; ia; ia = TAILQ_NEXT(ia, ia_link)) { - if (ia->ia_ifp == ifp && - ia->ia_addr.sin_addr.s_addr == - ifra->ifra_addr.sin_addr.s_addr) - break; - } - if (ia != NULL && ia != oia) - ifa_ref(&ia->ia_ifa); - if (oia != NULL && ia != oia) - ifa_free(&oia->ia_ifa); - IN_IFADDR_RUNLOCK(); - if ((ifp->if_flags & IFF_POINTOPOINT) - && (cmd == SIOCAIFADDR) - && (ifra->ifra_dstaddr.sin_addr.s_addr - == INADDR_ANY)) { - error = EDESTADDRREQ; - goto out; - } + case SIOCGIFBRDADDR: + if ((ifp->if_flags & IFF_BROADCAST) == 0) { + error = EINVAL; + break; } - if (cmd == SIOCDIFADDR && ia == NULL) { - error = EADDRNOTAVAIL; - goto out; - } - if (ia == NULL) { - ifa = ifa_alloc(sizeof(struct in_ifaddr), M_WAITOK); - ia = (struct in_ifaddr *)ifa; - ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr; - ifa->ifa_dstaddr = (struct sockaddr *)&ia->ia_dstaddr; - ifa->ifa_netmask = (struct sockaddr *)&ia->ia_sockmask; + *addr = ia->ia_broadaddr; + break; - ia->ia_sockmask.sin_len = 8; - ia->ia_sockmask.sin_family = AF_INET; - if (ifp->if_flags & IFF_BROADCAST) { - ia->ia_broadaddr.sin_len = sizeof(ia->ia_addr); - ia->ia_broadaddr.sin_family = AF_INET; - } - ia->ia_ifp = ifp; - - ifa_ref(ifa); /* if_addrhead */ - IF_ADDR_WLOCK(ifp); - TAILQ_INSERT_TAIL(&ifp->if_addrhead, ifa, ifa_link); - IF_ADDR_WUNLOCK(ifp); - ifa_ref(ifa); /* in_ifaddrhead */ - IN_IFADDR_WLOCK(); - TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link); - IN_IFADDR_WUNLOCK(); - iaIsNew = 1; + case SIOCGIFDSTADDR: + if ((ifp->if_flags & IFF_POINTOPOINT) == 0) { + error = EINVAL; + break; } + *addr = ia->ia_dstaddr; break; - case SIOCGIFADDR: case SIOCGIFNETMASK: - case SIOCGIFDSTADDR: - case SIOCGIFBRDADDR: - if (ia == NULL) { - error = EADDRNOTAVAIL; - goto out; - } + *addr = ia->ia_sockmask; break; } + IN_IFADDR_RUNLOCK(); + + return (error); +} + +static int +in_aifaddr_ioctl(caddr_t data, struct ifnet *ifp, struct thread *td) +{ + const struct in_aliasreq *ifra = (struct in_aliasreq *)data; + const struct sockaddr_in *addr = &ifra->ifra_addr; + const struct sockaddr_in *broadaddr = &ifra->ifra_broadaddr; + const struct sockaddr_in *mask = &ifra->ifra_mask; + const struct sockaddr_in *dstaddr = &ifra->ifra_dstaddr; + const int vhid = ifra->ifra_vhid; + struct ifaddr *ifa; + struct in_ifaddr *ia; + bool iaIsFirst; + int error = 0; + + error = priv_check(td, PRIV_NET_ADDIFADDR); + if (error) + return (error); + /* - * Most paths in this switch return directly or via out. Only paths - * that remove the address break in order to hit common removal code. + * ifra_addr must be present and be of INET family. + * ifra_broadaddr/ifra_dstaddr and ifra_mask are optional. */ - switch (cmd) { - case SIOCGIFADDR: - *((struct sockaddr_in *)&ifr->ifr_addr) = ia->ia_addr; - goto out; + if (addr->sin_len != sizeof(struct sockaddr_in) || + addr->sin_family != AF_INET) + return (EINVAL); + if (broadaddr->sin_len != 0 && + (broadaddr->sin_len != sizeof(struct sockaddr_in) || + broadaddr->sin_family != AF_INET)) + return (EINVAL); + if (mask->sin_len != 0 && + (mask->sin_len != sizeof(struct sockaddr_in) || + mask->sin_family != AF_INET)) + return (EINVAL); + if ((ifp->if_flags & IFF_POINTOPOINT) && + (dstaddr->sin_len != sizeof(struct sockaddr_in) || + dstaddr->sin_addr.s_addr == INADDR_ANY)) + return (EDESTADDRREQ); + if (vhid > 0 && carp_attach_p == NULL) + return (EPROTONOSUPPORT); - case SIOCGIFBRDADDR: - if ((ifp->if_flags & IFF_BROADCAST) == 0) { - error = EINVAL; - goto out; - } - *((struct sockaddr_in *)&ifr->ifr_dstaddr) = ia->ia_broadaddr; - goto out; + /* + * See whether address already exist. + */ + iaIsFirst = true; + ia = NULL; + IF_ADDR_RLOCK(ifp); + TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { + struct in_ifaddr *it = ifatoia(ifa); - case SIOCGIFDSTADDR: - if ((ifp->if_flags & IFF_POINTOPOINT) == 0) { - error = EINVAL; - goto out; - } - *((struct sockaddr_in *)&ifr->ifr_dstaddr) = ia->ia_dstaddr; - goto out; + if (it->ia_addr.sin_family != AF_INET) + continue; - case SIOCGIFNETMASK: - *((struct sockaddr_in *)&ifr->ifr_addr) = ia->ia_sockmask; - goto out; + iaIsFirst = false; + if (it->ia_addr.sin_addr.s_addr == addr->sin_addr.s_addr && + prison_check_ip4(td->td_ucred, &addr->sin_addr) == 0) + ia = it; + } + IF_ADDR_RUNLOCK(ifp); - case SIOCAIFADDR: - maskIsNew = 0; - hostIsNew = 1; - error = 0; - if (ifra->ifra_addr.sin_addr.s_addr == - ia->ia_addr.sin_addr.s_addr) - hostIsNew = 0; - if (ifra->ifra_mask.sin_len) { - /* - * QL: XXX - * Need to scrub the prefix here in case - * the issued command is SIOCAIFADDR with - * the same address, but with a different - * prefix length. And if the prefix length - * is the same as before, then the call is - * un-necessarily executed here. - */ - in_scrubprefix(ia, LLE_STATIC); - ia->ia_sockmask = ifra->ifra_mask; - ia->ia_sockmask.sin_family = AF_INET; - ia->ia_subnetmask = - ntohl(ia->ia_sockmask.sin_addr.s_addr); - maskIsNew = 1; - } - if ((ifp->if_flags & IFF_POINTOPOINT) && - (ifra->ifra_dstaddr.sin_family == AF_INET)) { - in_scrubprefix(ia, LLE_STATIC); - ia->ia_dstaddr = ifra->ifra_dstaddr; - maskIsNew = 1; /* We lie; but the effect's the same */ - } - if (hostIsNew || maskIsNew) - error = in_ifinit(ifp, ia, &ifra->ifra_addr, maskIsNew, - (ocmd == cmd ? ifra->ifra_vhid : 0)); - if (error != 0 && iaIsNew) - break; + if (ia != NULL) + (void )in_difaddr_ioctl(data, ifp, td); - if ((ifp->if_flags & IFF_BROADCAST) && - ifra->ifra_broadaddr.sin_len) - ia->ia_broadaddr = ifra->ifra_broadaddr; - if (error == 0) { - ii = ((struct in_ifinfo *)ifp->if_afdata[AF_INET]); - if (iaIsFirst && - (ifp->if_flags & IFF_MULTICAST) != 0) { - error = in_joingroup(ifp, &allhosts_addr, - NULL, &ii->ii_allhosts); - } - EVENTHANDLER_INVOKE(ifaddr_event, ifp); - } - goto out; + ifa = ifa_alloc(sizeof(struct in_ifaddr), M_WAITOK); + ia = (struct in_ifaddr *)ifa; + ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr; + ifa->ifa_dstaddr = (struct sockaddr *)&ia->ia_dstaddr; + ifa->ifa_netmask = (struct sockaddr *)&ia->ia_sockmask; - case SIOCDIFADDR: - /* - * in_scrubprefix() kills the interface route. - */ - in_scrubprefix(ia, LLE_STATIC); + ia->ia_ifp = ifp; + ia->ia_ifa.ifa_metric = ifp->if_metric; + ia->ia_addr = *addr; + if (mask->sin_len != 0) { + ia->ia_sockmask = *mask; + ia->ia_subnetmask = ntohl(ia->ia_sockmask.sin_addr.s_addr); + } else { + in_addr_t i = ntohl(addr->sin_addr.s_addr); /* - * in_ifadown gets rid of all the rest of - * the routes. This is not quite the right - * thing to do, but at least if we are running - * a routing process they will come back. - */ - in_ifadown(&ia->ia_ifa, 1); - EVENTHANDLER_INVOKE(ifaddr_event, ifp); - error = 0; - break; + * Be compatible with network classes, if netmask isn't + * supplied, guess it based on classes. + */ + if (IN_CLASSA(i)) + ia->ia_subnetmask = IN_CLASSA_NET; + else if (IN_CLASSB(i)) + ia->ia_subnetmask = IN_CLASSB_NET; + else + ia->ia_subnetmask = IN_CLASSC_NET; + ia->ia_sockmask.sin_addr.s_addr = htonl(ia->ia_subnetmask); + } + ia->ia_subnet = ntohl(addr->sin_addr.s_addr) & ia->ia_subnetmask; + in_socktrim(&ia->ia_sockmask); - default: - panic("in_control: unsupported ioctl"); + if (ifp->if_flags & IFF_BROADCAST) { + if (broadaddr->sin_len != 0) { + ia->ia_broadaddr = *broadaddr; + } else if (ia->ia_subnetmask == IN_RFC3021_MASK) { + ia->ia_broadaddr.sin_addr.s_addr = INADDR_BROADCAST; + ia->ia_broadaddr.sin_len = sizeof(struct sockaddr_in); + ia->ia_broadaddr.sin_family = AF_INET; + } else { + ia->ia_broadaddr.sin_addr.s_addr = + htonl(ia->ia_subnet | ~ia->ia_subnetmask); + ia->ia_broadaddr.sin_len = sizeof(struct sockaddr_in); + ia->ia_broadaddr.sin_family = AF_INET; + } } + if (ifp->if_flags & IFF_POINTOPOINT) + ia->ia_dstaddr = *dstaddr; + + /* XXXGL: rtinit() needs this strange assignment. */ + if (ifp->if_flags & IFF_LOOPBACK) + ia->ia_dstaddr = ia->ia_addr; + + ifa_ref(ifa); /* if_addrhead */ + IF_ADDR_WLOCK(ifp); + TAILQ_INSERT_TAIL(&ifp->if_addrhead, ifa, ifa_link); + IF_ADDR_WUNLOCK(ifp); + + ifa_ref(ifa); /* in_ifaddrhead */ + IN_IFADDR_WLOCK(); + TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link); + LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr), ia, ia_hash); + IN_IFADDR_WUNLOCK(); + + if (vhid != 0) + error = (*carp_attach_p)(&ia->ia_ifa, vhid); + if (error) + goto fail1; + + /* + * Give the interface a chance to initialize + * if this is its first address, + * and to validate the address if necessary. + */ + if (ifp->if_ioctl != NULL) + error = (*ifp->if_ioctl)(ifp, SIOCSIFADDR, (caddr_t)ia); + if (error) + goto fail2; + + /* + * Add route for the network. + */ + if (vhid == 0) { + int flags = RTF_UP; + + if (ifp->if_flags & (IFF_LOOPBACK|IFF_POINTOPOINT)) + flags |= RTF_HOST; + + error = in_addprefix(ia, flags); + if (error) + goto fail2; + } + + /* + * Add a loopback route to self. + */ + if (vhid == 0 && (ifp->if_flags & IFF_LOOPBACK) == 0 && + ia->ia_addr.sin_addr.s_addr != INADDR_ANY) { + struct in_ifaddr *eia; + + eia = more_localip(ia); + + if (eia == NULL) { + error = ifa_add_loopback_route((struct ifaddr *)ia, + (struct sockaddr *)&ia->ia_addr); + if (error) + goto fail3; + } else + ifa_free(&eia->ia_ifa); + } + + if (iaIsFirst && (ifp->if_flags & IFF_MULTICAST)) { + struct in_addr allhosts_addr; + struct in_ifinfo *ii; + + ii = ((struct in_ifinfo *)ifp->if_afdata[AF_INET]); + allhosts_addr.s_addr = htonl(INADDR_ALLHOSTS_GROUP); + + error = in_joingroup(ifp, &allhosts_addr, NULL, + &ii->ii_allhosts); + } + + EVENTHANDLER_INVOKE(ifaddr_event, ifp); + + return (error); + +fail3: + if (vhid == 0) + (void )in_scrubprefix(ia, LLE_STATIC); + +fail2: if (ia->ia_ifa.ifa_carp) (*carp_detach_p)(&ia->ia_ifa); +fail1: IF_ADDR_WLOCK(ifp); - /* Re-check that ia is still part of the list. */ + TAILQ_REMOVE(&ifp->if_addrhead, &ia->ia_ifa, ifa_link); + IF_ADDR_WUNLOCK(ifp); + ifa_free(&ia->ia_ifa); + + IN_IFADDR_WLOCK(); + TAILQ_REMOVE(&V_in_ifaddrhead, ia, ia_link); + LIST_REMOVE(ia, ia_hash); + IN_IFADDR_WUNLOCK(); + ifa_free(&ia->ia_ifa); + + return (error); +} + +static int +in_difaddr_ioctl(caddr_t data, struct ifnet *ifp, struct thread *td) +{ + const struct ifreq *ifr = (struct ifreq *)data; + const struct sockaddr_in *addr = (struct sockaddr_in *)&ifr->ifr_addr; + struct ifaddr *ifa; + struct in_ifaddr *ia; + bool deleteAny, iaIsLast; + int error; + + if (td != NULL) { + error = priv_check(td, PRIV_NET_DELIFADDR); + if (error) + return (error); + } + + if (addr->sin_len != sizeof(struct sockaddr_in) || + addr->sin_family != AF_INET) + deleteAny = true; + else + deleteAny = false; + + iaIsLast = true; + ia = NULL; + IF_ADDR_WLOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { - if (ifa == &ia->ia_ifa) - break; + struct in_ifaddr *it = ifatoia(ifa); + + if (it->ia_addr.sin_family != AF_INET) + continue; + + if (deleteAny && ia == NULL && (td == NULL || + prison_check_ip4(td->td_ucred, &it->ia_addr.sin_addr) == 0)) + ia = it; + + if (it->ia_addr.sin_addr.s_addr == addr->sin_addr.s_addr && + (td == NULL || prison_check_ip4(td->td_ucred, + &addr->sin_addr) == 0)) + ia = it; + + if (it != ia) + iaIsLast = false; } - if (ifa == NULL) { - /* - * If we lost the race with another thread, there is no need to - * try it again for the next loop as there is no other exit - * path between here and out. - */ + + if (ia == NULL) { IF_ADDR_WUNLOCK(ifp); - error = EADDRNOTAVAIL; - goto out; + return (EADDRNOTAVAIL); } + TAILQ_REMOVE(&ifp->if_addrhead, &ia->ia_ifa, ifa_link); IF_ADDR_WUNLOCK(ifp); - ifa_free(&ia->ia_ifa); /* if_addrhead */ + ifa_free(&ia->ia_ifa); /* if_addrhead */ IN_IFADDR_WLOCK(); TAILQ_REMOVE(&V_in_ifaddrhead, ia, ia_link); - LIST_REMOVE(ia, ia_hash); IN_IFADDR_WUNLOCK(); + ifa_free(&ia->ia_ifa); /* in_ifaddrhead */ + /* + * in_scrubprefix() kills the interface route. + */ + in_scrubprefix(ia, LLE_STATIC); + + /* + * in_ifadown gets rid of all the rest of + * the routes. This is not quite the right + * thing to do, but at least if we are running + * a routing process they will come back. + */ + in_ifadown(&ia->ia_ifa, 1); + + if (ia->ia_ifa.ifa_carp) + (*carp_detach_p)(&ia->ia_ifa); + + /* * If this is the last IPv4 address configured on this * interface, leave the all-hosts group. * No state-change report need be transmitted. */ - IFP_TO_IA(ifp, iap); - if (iap == NULL) { + if (iaIsLast && (ifp->if_flags & IFF_MULTICAST)) { + struct in_ifinfo *ii; + ii = ((struct in_ifinfo *)ifp->if_afdata[AF_INET]); IN_MULTI_LOCK(); if (ii->ii_allhosts) { @@ -584,14 +626,11 @@ in_control(struct socket *so, u_long cmd, caddr_t ii->ii_allhosts = NULL; } IN_MULTI_UNLOCK(); - } else - ifa_free(&iap->ia_ifa); + } - ifa_free(&ia->ia_ifa); /* in_ifaddrhead */ -out: - if (ia != NULL) - ifa_free(&ia->ia_ifa); - return (error); + EVENTHANDLER_INVOKE(ifaddr_event, ifp); + + return (0); } /* @@ -616,11 +655,23 @@ in_lifaddr_ioctl(struct socket *so, u_long cmd, ca { struct if_laddrreq *iflr = (struct if_laddrreq *)data; struct ifaddr *ifa; + int error; - /* sanity checks */ - if (data == NULL || ifp == NULL) { - panic("invalid argument to in_lifaddr_ioctl"); - /*NOTRECHED*/ + switch (cmd) { + case SIOCALIFADDR: + if (td != NULL) { + error = priv_check(td, PRIV_NET_ADDIFADDR); + if (error) + return (error); + } + break; + case SIOCDLIFADDR: + if (td != NULL) { + error = priv_check(td, PRIV_NET_DELIFADDR); + if (error) + return (error); + } + break; } switch (cmd) { @@ -770,115 +821,6 @@ in_lifaddr_ioctl(struct socket *so, u_long cmd, ca return (EOPNOTSUPP); /*just for safety*/ } -/* - * Initialize an interface's internet address - * and routing table entry. - */ -static int -in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin, - int masksupplied, int vhid) -{ - register u_long i = ntohl(sin->sin_addr.s_addr); - int flags, error = 0; - - IN_IFADDR_WLOCK(); - if (ia->ia_addr.sin_family == AF_INET) - LIST_REMOVE(ia, ia_hash); - ia->ia_addr = *sin; - LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr), - ia, ia_hash); - IN_IFADDR_WUNLOCK(); - - if (vhid > 0) { - if (carp_attach_p != NULL) - error = (*carp_attach_p)(&ia->ia_ifa, vhid); - else - error = EPROTONOSUPPORT; - } - if (error) - return (error); - - /* - * Give the interface a chance to initialize - * if this is its first address, - * and to validate the address if necessary. - */ - if (ifp->if_ioctl != NULL && - (error = (*ifp->if_ioctl)(ifp, SIOCSIFADDR, (caddr_t)ia)) != 0) - /* LIST_REMOVE(ia, ia_hash) is done in in_control */ - return (error); - - /* - * Be compatible with network classes, if netmask isn't supplied, - * guess it based on classes. - */ - if (!masksupplied) { - if (IN_CLASSA(i)) - ia->ia_subnetmask = IN_CLASSA_NET; - else if (IN_CLASSB(i)) - ia->ia_subnetmask = IN_CLASSB_NET; - else - ia->ia_subnetmask = IN_CLASSC_NET; - ia->ia_sockmask.sin_addr.s_addr = htonl(ia->ia_subnetmask); - } - ia->ia_subnet = i & ia->ia_subnetmask; - in_socktrim(&ia->ia_sockmask); - - /* - * Add route for the network. - */ - flags = RTF_UP; - ia->ia_ifa.ifa_metric = ifp->if_metric; - if (ifp->if_flags & IFF_BROADCAST) { - if (ia->ia_subnetmask == IN_RFC3021_MASK) - ia->ia_broadaddr.sin_addr.s_addr = INADDR_BROADCAST; - else - ia->ia_broadaddr.sin_addr.s_addr = - htonl(ia->ia_subnet | ~ia->ia_subnetmask); - } else if (ifp->if_flags & IFF_LOOPBACK) { - ia->ia_dstaddr = ia->ia_addr; - flags |= RTF_HOST; - } else if (ifp->if_flags & IFF_POINTOPOINT) { - if (ia->ia_dstaddr.sin_family != AF_INET) - return (0); - flags |= RTF_HOST; - } - if (!vhid && (error = in_addprefix(ia, flags)) != 0) - return (error); - - if (ia->ia_addr.sin_addr.s_addr == INADDR_ANY) - return (0); - - if (ifp->if_flags & IFF_POINTOPOINT && - ia->ia_dstaddr.sin_addr.s_addr == ia->ia_addr.sin_addr.s_addr) - return (0); - - /* - * add a loopback route to self - */ - if (V_useloopback && !vhid && !(ifp->if_flags & IFF_LOOPBACK)) { - struct route ia_ro; - - bzero(&ia_ro, sizeof(ia_ro)); - *((struct sockaddr_in *)(&ia_ro.ro_dst)) = ia->ia_addr; - rtalloc_ign_fib(&ia_ro, 0, RT_DEFAULT_FIB); - if ((ia_ro.ro_rt != NULL) && (ia_ro.ro_rt->rt_ifp != NULL) && - (ia_ro.ro_rt->rt_ifp == V_loif)) { - RT_LOCK(ia_ro.ro_rt); - RT_ADDREF(ia_ro.ro_rt); - RTFREE_LOCKED(ia_ro.ro_rt); - } else - error = ifa_add_loopback_route((struct ifaddr *)ia, - (struct sockaddr *)&ia->ia_addr); - if (error == 0) - ia->ia_flags |= IFA_RTSELF; - if (ia_ro.ro_rt != NULL) - RTFREE(ia_ro.ro_rt); - } - - return (error); -} - #define rtinitflags(x) \ ((((x)->ia_ifp->if_flags & (IFF_LOOPBACK | IFF_POINTOPOINT)) != 0) \ ? RTF_HOST : 0) @@ -1007,44 +949,27 @@ in_scrubprefix(struct in_ifaddr *target, u_int fla /* * Remove the loopback route to the interface address. - * The "useloopback" setting is not consulted because if the - * user configures an interface address, turns off this - * setting, and then tries to delete that interface address, - * checking the current setting of "useloopback" would leave - * that interface address loopback route untouched, which - * would be wrong. Therefore the interface address loopback route - * deletion is unconditional. */ if ((target->ia_addr.sin_addr.s_addr != INADDR_ANY) && !(target->ia_ifp->if_flags & IFF_LOOPBACK) && - (target->ia_flags & IFA_RTSELF)) { - struct route ia_ro; - int freeit = 0; + (flags & LLE_STATIC)) { + struct in_ifaddr *eia; - bzero(&ia_ro, sizeof(ia_ro)); - *((struct sockaddr_in *)(&ia_ro.ro_dst)) = target->ia_addr; - rtalloc_ign_fib(&ia_ro, 0, 0); - if ((ia_ro.ro_rt != NULL) && (ia_ro.ro_rt->rt_ifp != NULL) && - (ia_ro.ro_rt->rt_ifp == V_loif)) { - RT_LOCK(ia_ro.ro_rt); - if (ia_ro.ro_rt->rt_refcnt <= 1) - freeit = 1; - else if (flags & LLE_STATIC) { - RT_REMREF(ia_ro.ro_rt); - target->ia_flags &= ~IFA_RTSELF; - } - RTFREE_LOCKED(ia_ro.ro_rt); - } - if (freeit && (flags & LLE_STATIC)) { + eia = more_localip(target); + + if (eia != NULL) { + error = ifa_switch_loopback_route((struct ifaddr *)eia, + (struct sockaddr *)&target->ia_addr); + ifa_free(&eia->ia_ifa); + } else { error = ifa_del_loopback_route((struct ifaddr *)target, (struct sockaddr *)&target->ia_addr); - if (error == 0) - target->ia_flags &= ~IFA_RTSELF; } - if ((flags & LLE_STATIC) && - !(target->ia_ifp->if_flags & IFF_NOARP)) + + if (!(target->ia_ifp->if_flags & IFF_NOARP)) /* remove arp cache */ - arp_ifscrub(target->ia_ifp, IA_SIN(target)->sin_addr.s_addr); + arp_ifscrub(target->ia_ifp, + IA_SIN(target)->sin_addr.s_addr); } if (rtinitflags(target)) { Index: sys/netinet/raw_ip.c =================================================================== --- sys/netinet/raw_ip.c (revision 257503) +++ sys/netinet/raw_ip.c (working copy) @@ -774,8 +774,6 @@ rip_ctlinput(int cmd, struct sockaddr *sa, void *v flags |= RTF_HOST; err = ifa_del_loopback_route((struct ifaddr *)ia, sa); - if (err == 0) - ia->ia_flags &= ~IFA_RTSELF; err = rtinit(&ia->ia_ifa, RTM_ADD, flags); if (err == 0) @@ -782,8 +780,6 @@ rip_ctlinput(int cmd, struct sockaddr *sa, void *v ia->ia_flags |= IFA_ROUTE; err = ifa_add_loopback_route((struct ifaddr *)ia, sa); - if (err == 0) - ia->ia_flags |= IFA_RTSELF; ifa_free(&ia->ia_ifa); break; --SvF6CGw9fzJC4Rcx-- From owner-freebsd-net@FreeBSD.ORG Fri Nov 1 16:33:52 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id C196D3B3 for ; Fri, 1 Nov 2013 16:33:52 +0000 (UTC) (envelope-from s.khanchi@gmail.com) Received: from mail-wi0-x231.google.com (mail-wi0-x231.google.com [IPv6:2a00:1450:400c:c05::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5D50D28D8 for ; Fri, 1 Nov 2013 16:33:52 +0000 (UTC) Received: by mail-wi0-f177.google.com with SMTP id f4so1321752wiw.4 for ; Fri, 01 Nov 2013 09:33:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=vswjD5pt+Ip9CC/rcoL0LIQ3egOuHF5H89slgjqp5TA=; b=PQhxmugASjaGWTZ4J0aUpyVH/3InPniQNLgOO9vdfcMujRTKgg+e/l4zLDlp+TD95K WaVxENF+xgDlMyJ6+yCFmATHsLaZeEvuoPQWKe0KB4BQM1rNTP6xgOueQc6iT0+XFC0l 4uUTrn2pSu66idX0uLYZOvNB5QwbaBO9NxojoLaZR2BuKouSZs25mf2wwdgr1zaeypZd hzp3x1Qf6c/kk1NeML6+JJDqJ1OjT0IYr7bGPYbh7UXg6Xma9eEua0fv8s/8VfS7XE5T FEj2lauy/4Zu7IN8UQXY/ktD8erqruOJJWhG2sRuYqdTcJeqaSaILGUtBVE17hWbnDcN KJTg== X-Received: by 10.180.105.194 with SMTP id go2mr3069156wib.39.1383323630849; Fri, 01 Nov 2013 09:33:50 -0700 (PDT) MIME-Version: 1.0 Sender: s.khanchi@gmail.com Received: by 10.194.122.230 with HTTP; Fri, 1 Nov 2013 09:33:30 -0700 (PDT) In-Reply-To: <20131031180907.GB62132@onelab2.iet.unipi.it> References: <20131031180907.GB62132@onelab2.iet.unipi.it> From: h bagade Date: Fri, 1 Nov 2013 20:03:30 +0330 X-Google-Sender-Auth: YQiKg7MJnikHJ6kl3tbJubrydd8 Message-ID: Subject: Re: Errors on running kipfw with vale switches To: Luigi Rizzo Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Nov 2013 16:33:52 -0000 On Thu, Oct 31, 2013 at 9:39 PM, Luigi Rizzo wrote: > On Thu, Oct 31, 2013 at 11:10:39AM +0330, h bagade wrote: > > Hi all, > > > > I want to run userland ipfw with netmap support(kipfw). When I try to > > follow the example to test kipfw, it encounters an error on following > > command: > > i suspect that stable/9 has an old version of the netmap code > so the argument to the ioctl fails. > In fact, I don't even remember if the code in stable/9 > supports VALE. > > Please wait for a few days, we am going to push a newer > version of netmap to both HEAD and stable/9 soon > > cheers > luigi > Thanks for your great support. I'll wait for your changes :) From owner-freebsd-net@FreeBSD.ORG Sat Nov 2 12:20:02 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 4D057B38 for ; Sat, 2 Nov 2013 12:20:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 21D072265 for ; Sat, 2 Nov 2013 12:20:02 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id rA2CK0dm081543 for ; Sat, 2 Nov 2013 12:20:00 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id rA2CK0s0081542; Sat, 2 Nov 2013 12:20:00 GMT (envelope-from gnats) Date: Sat, 2 Nov 2013 12:20:00 GMT Message-Id: <201311021220.rA2CK0s0081542@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: "Pataki Antal (Granaglia Kft.)" Subject: Re: kern/183391: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: "Pataki Antal \(Granaglia Kft.\)" List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Nov 2013 12:20:02 -0000 The following reply was made to PR kern/183391; it has been noted by GNATS. From: "Pataki Antal (Granaglia Kft.)" To: bug-followup@FreeBSD.org, Pataki Antal Cc: Subject: Re: kern/183391: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA Date: Sat, 2 Nov 2013 13:11:49 +0100 --Apple-Mail=_6FE46762-5B52-4C3F-8C0F-4A4AEB8D919B Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii I would like to correct his line: Synopsis: [ixgbe] 10gigabit networking problems with Emulex OCE 11102 CNA This problem is not realted to the ixgbe, but related to the oce. Thanks, Antal Pataki --Apple-Mail=_6FE46762-5B52-4C3F-8C0F-4A4AEB8D919B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii I = would like to correct his line: 

Synopsis:[ixgbe] 10gigabit networking problems = with Emulex OCE 11102 = CNA


This= problem is not realted to the ixgbe, but related to the = oce.


Thanks,

<= div>Antal Pataki= --Apple-Mail=_6FE46762-5B52-4C3F-8C0F-4A4AEB8D919B-- From owner-freebsd-net@FreeBSD.ORG Sat Nov 2 19:50:54 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B195C335; Sat, 2 Nov 2013 19:50:54 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7E70B257E; Sat, 2 Nov 2013 19:50:54 +0000 (UTC) Received: from Julian-MBP3.local ([12.157.112.67]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id rA2Johgn037781 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Sat, 2 Nov 2013 12:50:44 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <5275578E.40000@freebsd.org> Date: Sat, 02 Nov 2013 12:50:38 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Andre Oppermann , Luigi Rizzo , Adrian Chadd , Navdeep Parhar , Randall Stewart , "freebsd-net@freebsd.org" Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> <52717A62.7040600@freebsd.org> In-Reply-To: <52717A62.7040600@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Nov 2013 19:50:54 -0000 On 10/30/13, 2:30 PM, Andre Oppermann wrote: > > Now ifnet has become very complex and large and should be brought > back to its original purpose of the being the logical layer 3 interface > abstraction. There isn't necessarily a 1:1 mapping from one ifnet > instance to one hardware interface. In fact there are pure logical > ifnets (gre, tun, ...), direct hardware ifnets (simple network > interfaces > like fxp(4)), and multiple logic interfaces on top a single hardware > (vlan, lagg, ...). Depending on the ifnets purpose the backend can > be very different. Thus I want to decouple the current implicit > notion of ifnet==hardware with associated queuing and such. Instead > it should become a layer 3 abstraction inside the kernel again and > delegate all lower layers to appropriate protocol, layer 2, and > hardware specific implementations. I have thought for a long time that the 'if' should be split in two.. the top half really is just common for everything.. it is basically what tun is.. (or ng_iface for that matter) From owner-freebsd-net@FreeBSD.ORG Sat Nov 2 22:40:02 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 87B26333 for ; Sat, 2 Nov 2013 22:40:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5BC462C86 for ; Sat, 2 Nov 2013 22:40:02 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id rA2Me2R8018040 for ; Sat, 2 Nov 2013 22:40:02 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id rA2Me28R018039; Sat, 2 Nov 2013 22:40:02 GMT (envelope-from gnats) Date: Sat, 2 Nov 2013 22:40:02 GMT Message-Id: <201311022240.rA2Me28R018039@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: Mohamad Aghakhani Subject: Re: kern/172683: [ip6] Duplicate IPv6 Link Local Addresses X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Mohamad Aghakhani List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Nov 2013 22:40:02 -0000 The following reply was made to PR kern/172683; it has been noted by GNATS. From: Mohamad Aghakhani To: bug-followup@FreeBSD.org, doug@lafn.org Cc: Subject: Re: kern/172683: [ip6] Duplicate IPv6 Link Local Addresses Date: Sun, 3 Nov 2013 02:02:34 +0330 --089e0139ffb864788604ea394257 Content-Type: text/plain; charset=ISO-8859-1 --089e0139ffb864788604ea394257 Content-Type: text/html; charset=ISO-8859-1 --089e0139ffb864788604ea394257--