From owner-freebsd-net@FreeBSD.ORG Sun Aug 3 00:48:32 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B43182E8; Sun, 3 Aug 2014 00:48:32 +0000 (UTC) Received: from mail-qg0-x22c.google.com (mail-qg0-x22c.google.com [IPv6:2607:f8b0:400d:c04::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 68E5C23F3; Sun, 3 Aug 2014 00:48:32 +0000 (UTC) Received: by mail-qg0-f44.google.com with SMTP id e89so7471620qgf.17 for ; Sat, 02 Aug 2014 17:48:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=GJZ26BbHCzuoGVWetmXAulBlrlqHDfHes/J0/AjEyVc=; b=laDlfLhEBNyPaaJb6mGB9cqShJFWehFx1kJ2k/Y4bMQ6B49xkQ4+/hIYyBqnykD2DA Lv1li6tUb5aJM4+yGiMEV+sEfq7YoU4aqPlWVrWNc2hZchWCTAjxF3HYYARGk44R9znB Wf6w3MniqOSUq7jst2bXIRdQztSyrLTdHVxhYGq89kv/ks3Ty1ACF6ih0BB4I5nFDmdk bmnpgXlNcr0P4DJJj8ERyNnVtBjpCHnV7zcKx1q362kLsHpTieMbLiOydF+dT0VA9wWf m/bdUe/re7/+o+lrCOTIXTYE51bRhAmM70497QinB4YUlYRQCkzD98LAR8WKFEpPw99W uQ8Q== MIME-Version: 1.0 X-Received: by 10.224.55.131 with SMTP id u3mr23103065qag.98.1407026911560; Sat, 02 Aug 2014 17:48:31 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.1.6 with HTTP; Sat, 2 Aug 2014 17:48:31 -0700 (PDT) Date: Sat, 2 Aug 2014 17:48:31 -0700 X-Google-Sender-Auth: O0AGsM1IjsDd0SXMzZ_GMC0nz8E Message-ID: Subject: [rfc] UDP RSS awareness; handling IPv4 fragments From: Adrian Chadd To: FreeBSD Net , "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Aug 2014 00:48:32 -0000 Hi! http://people.freebsd.org/~adrian/rss/20140802-rss-udp-ip-fragments-1.diff This implements the RSS awareness bits for UDPv4, UDPv6 and IP fragment handling. It should work for UDP transmit and receive. There's a phabricator review: https://phabric.freebsd.org/D527 I'll finish off IPv6 RSS after this and then look at the various ways that packets make their way in and out of the stack. There's a whole lot of silliness with multicast and IP tunneling / IPSEC decapsulation that requires a recompute of the hash. I appreciate any/all feedback and testing! Especially testing! Thanks! -a From owner-freebsd-net@FreeBSD.ORG Mon Aug 4 08:00:12 2014 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 21D7B11F for ; Mon, 4 Aug 2014 08:00:12 +0000 (UTC) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id EC6222FCC for ; Mon, 4 Aug 2014 08:00:11 +0000 (UTC) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.14.8/8.14.8) with ESMTP id s7480BkV093743 for ; Mon, 4 Aug 2014 08:00:11 GMT (envelope-from bugzilla-noreply@freebsd.org) Message-Id: <201408040800.s7480BkV093743@kenobi.freebsd.org> From: bugzilla-noreply@freebsd.org To: freebsd-net@FreeBSD.org Subject: [Bugzilla] Commit Needs MFC MIME-Version: 1.0 X-Bugzilla-Type: whine X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated Date: Mon, 04 Aug 2014 08:00:11 +0000 Content-Type: text/plain X-Content-Filtered-By: Mailman/MimeDel 2.1.18 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Aug 2014 08:00:12 -0000 Hi, You have a bug in the "Needs MFC" state which has not been touched in 7 or more days. This email serves as a reminder that you may want to MFC this bug or marked it as completed. In the event you have a longer MFC timeout you may update this bug with a comment and I won't remind you again for 7 days. This reminder is only sent on Mondays. Please file a bug about concerns you may have. This search was scheduled by eadler@FreeBSD.org. (1 bugs) Bug 183659: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183659 Severity: Affects Only Me Priority: Normal Hardware: Any Assignee: freebsd-net@FreeBSD.org Status: Needs MFC Resolution: Summary: [tcp] TCP stack lock contention with short-lived connections From owner-freebsd-net@FreeBSD.ORG Mon Aug 4 09:45:26 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 19E92FC0; Mon, 4 Aug 2014 09:45:26 +0000 (UTC) Received: from mail.ipfw.ru (mail.ipfw.ru [IPv6:2a01:4f8:120:6141::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 865CC2F38; Mon, 4 Aug 2014 09:45:25 +0000 (UTC) Received: from [2a02:6b8:0:401:222:4dff:fe50:cd2f] (helo=ptichko.yndx.net) by mail.ipfw.ru with esmtpsa (TLSv1:DHE-RSA-AES128-SHA:128) (Exim 4.82 (FreeBSD)) (envelope-from ) id 1XEAsQ-000COY-5a; Mon, 04 Aug 2014 09:32:06 +0400 Message-ID: <53DF55FA.8010303@FreeBSD.org> Date: Mon, 04 Aug 2014 13:44:26 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: ipfw named objejcts, table values and syntax change References: <53DC01DE.3000000@FreeBSD.org> <53DCA25C.1000108@FreeBSD.org> In-Reply-To: <53DCA25C.1000108@FreeBSD.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-ipfw , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Aug 2014 09:45:26 -0000 On 02.08.2014 12:33, Alexander V. Chernikov wrote: > On 02.08.2014 10:33, Luigi Rizzo wrote: >> >> >> On Fri, Aug 1, 2014 at 11:08 PM, Alexander V. Chernikov >> > wrote: >> >> Hello all. >> >> I'm currently working on to enhance ipfw in some areas. >> The most notable (and user-visible) change is named table support. >> The other one is support for different lookup algorithms for different >> key types. >> >> For example, new ipfw permits writing this: >> >> ipfw table tb1 create type cidr >> ipfw add allow ip from table(tl1) to any >> ipfw add allow ip from any lookup dst-ip tb1 >> >> ipfw table if1 create type iface >> ipfw add skipto tablearg ip from any to any via table(if1) >> >> or even this: >> ipfw table fl1 create type flow:src-ip,proto,dst-ip,dst-port >> ipfw table fl1 add 10.0.0.5,tcp,10.0.0.6,80 4444 >> ipfw add allow ip from any to any flow table(fl1) >> >> all these changes fully preserve backward compatibility. >> (actually tables needs now to be created before use and their type needs >> to match with opcode used, but new ipfw(8) performs auto-creation >> for cidr tables). >> >> There is another thing I'm going to change and I'm not sure I can keep >> the same compatibility level. >> >> Table values, from one point of view, can be classified to the following >> types: >> >> - skipto argument >> - fwd argument (*) >> - link to another object (nat, pipe, queue) >> - plain u32 (not bound to any object) >> (divert/tee,netgraph,tag/utag,limit) >> >> There are the following reasons why I think it is necessary to implement >> explicit table values typing (like tables): >> - Implementing fwd tablearg for IPv6 hosts requires indirection table >> - Converting nat/pipe instance ids to names renders values unusable >> - retiring old hack with storing saved pointer of found object/rule >> inside rule w/o proper locking >> - making faster skipto >> >> >> ​​i don't buy the idea that you need typed arguments >> for all the cases above. Maybe the case that >> may make sense is the fwd argument (and in the future >> something else). >> We already discussed, i think, the fact that now it >> is legal to have references to non existing things >> (skipto, pipes etc.) implemented as u32. >> Removing that would break configurations. > It depends on actual implementation. This can be preserved by > auto-creating necessary objects in kernel and/or in userspace, so > we can (and should) avoid breaking in this particular way. Can you please explain your vision on values another time? As far as I understand, you're not against it in general, but the details matter: * IP address can be one of the types (it won't break much, and we can simply skip that one for MFC) * what about typing for nat/pipes ? we're not going to convert their ids to names? (or maybe you can suggest other non-disruptive way?) * everything else is type "u32" >> Efficiency is not affected, even for skipto, > It depends on workload. While binary search is fast in terms of cpu, it > is may be not so fast in terms of memory (since each of the rule is > allocated by separate malloc() (and that is another thing which is worth > discussing)). > >> and while i agree that unprotected writes to the pointers >> in rules should not happen, these pointers are changed >> infrequently so a global read-mostly lock should be >> sufficient to protect all changes to the rules. >> >> cheers >> luigi >> >> >> So, as the result, table will have lookup key type (already done), >> value type ('skipto', 'nexthop', 'nat', 'pipe', 'number', ..) and some >> additional restrictions (like inability to add non-existing nat instance >> id). >> >> This change will break (at least) scenarios where people are >> using one table for both nat/pipe instances (and keep nat ids in sync >> with pipe ones). For example: >> >> ipfw table 1 add 10.0.10.0/24 110 >> ipfw table 1 add 10.0.20.0/24 120 >> >> ipfw add 100 nat tablearg from table(1) to any via vlanX in >> .. >> ipfw add 500 pipe tablearg from table(1) to any via ix0 out >> >> It looks like it is not so easy to bind values for given table to >> different objects (or different tasks) (and lack of compatibility kills >> hope for MFC). >> >> Ideas? >> >> >> >> >> >> >> _______________________________________________ >> freebsd-ipfw@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-ipfw >> To unsubscribe, send any mail to >> "freebsd-ipfw-unsubscribe@freebsd.org >> " >> >> >> >> >> -- >> -----------------------------------------+------------------------------- >> Prof. Luigi RIZZO, rizzo@iet.unipi.it . >> Dip. di Ing. dell'Informazione >> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >> TEL +39-050-2211611 . via >> Diotisalvi 2 >> Mobile +39-338-6809875 . 56122 >> PISA (Italy) >> -----------------------------------------+------------------------------- From owner-freebsd-net@FreeBSD.ORG Mon Aug 4 09:59:55 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 93C67677 for ; Mon, 4 Aug 2014 09:59:55 +0000 (UTC) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 5A3F72083 for ; Mon, 4 Aug 2014 09:59:54 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 4EA247300A; Mon, 4 Aug 2014 11:55:28 +0200 (CEST) Date: Mon, 4 Aug 2014 11:55:28 +0200 From: Luigi Rizzo To: FreeBSD Net Subject: tutorial on Netmap in Mountain View - Aug.28 Message-ID: <20140804095528.GA12625@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Aug 2014 09:59:55 -0000 In case someone (especially those in the bay area) is interested: I will give a half day tutorial on netmap at Hot Interconnects, in Mountain View on August 28, 2014 http://www.hoti.org/hoti22/tutorials/#tut4 This tutorial targets hardware vendors, network engineers, and researchers looking for solutions to: OS support for high speed NICs; efficient software packet processing techniques for SDN products; high speed networking in VMs. We will show how to achieve these results using netmap. cheers luigi (P.S. I have no financial interest in the event. I am posting the info because I think it might be useful to people on this list, and of course having a larger audience at the tutorial will generate more interesting feedback from participants) -----------------------------------------+------------------------------- Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+------------------------------- From owner-freebsd-net@FreeBSD.ORG Mon Aug 4 11:54:38 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9DBBC294; Mon, 4 Aug 2014 11:54:38 +0000 (UTC) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id ED1F82022; Mon, 4 Aug 2014 11:54:37 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 990DF73029; Mon, 4 Aug 2014 13:58:17 +0200 (CEST) Date: Mon, 4 Aug 2014 13:58:17 +0200 From: Luigi Rizzo To: "Alexander V. Chernikov" Subject: Re: ipfw named objejcts, table values and syntax change Message-ID: <20140804115817.GA13814@onelab2.iet.unipi.it> References: <53DC01DE.3000000@FreeBSD.org> <53DCA25C.1000108@FreeBSD.org> <53DF55FA.8010303@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53DF55FA.8010303@FreeBSD.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-ipfw , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Aug 2014 11:54:38 -0000 On Mon, Aug 04, 2014 at 01:44:26PM +0400, Alexander V. Chernikov wrote: > On 02.08.2014 12:33, Alexander V. Chernikov wrote: > > On 02.08.2014 10:33, Luigi Rizzo wrote: > >> > >> > >> On Fri, Aug 1, 2014 at 11:08 PM, Alexander V. Chernikov > >> > wrote: > >> > >> Hello all. > >> > >> I'm currently working on to enhance ipfw in some areas. > >> The most notable (and user-visible) change is named table support. > >> The other one is support for different lookup algorithms for different > >> key types. > >> > >> For example, new ipfw permits writing this: > >> > >> ipfw table tb1 create type cidr > >> ipfw add allow ip from table(tl1) to any > >> ipfw add allow ip from any lookup dst-ip tb1 > >> > >> ipfw table if1 create type iface > >> ipfw add skipto tablearg ip from any to any via table(if1) > >> > >> or even this: > >> ipfw table fl1 create type flow:src-ip,proto,dst-ip,dst-port > >> ipfw table fl1 add 10.0.0.5,tcp,10.0.0.6,80 4444 > >> ipfw add allow ip from any to any flow table(fl1) > >> > >> all these changes fully preserve backward compatibility. > >> (actually tables needs now to be created before use and their type needs > >> to match with opcode used, but new ipfw(8) performs auto-creation > >> for cidr tables). > >> > >> There is another thing I'm going to change and I'm not sure I can keep > >> the same compatibility level. > >> > >> Table values, from one point of view, can be classified to the following > >> types: > >> > >> - skipto argument > >> - fwd argument (*) > >> - link to another object (nat, pipe, queue) > >> - plain u32 (not bound to any object) > >> (divert/tee,netgraph,tag/utag,limit) > >> > >> There are the following reasons why I think it is necessary to implement > >> explicit table values typing (like tables): > >> - Implementing fwd tablearg for IPv6 hosts requires indirection table > >> - Converting nat/pipe instance ids to names renders values unusable > >> - retiring old hack with storing saved pointer of found object/rule > >> inside rule w/o proper locking > >> - making faster skipto > >> > >> > >> ??????i don't buy the idea that you need typed arguments > >> for all the cases above. Maybe the case that > >> may make sense is the fwd argument (and in the future > >> something else). > >> We already discussed, i think, the fact that now it > >> is legal to have references to non existing things > >> (skipto, pipes etc.) implemented as u32. > >> Removing that would break configurations. > > It depends on actual implementation. This can be preserved by > > auto-creating necessary objects in kernel and/or in userspace, so > > we can (and should) avoid breaking in this particular way. > Can you please explain your vision on values another time? > As far as I understand, you're not against it in general, but the > details matter: > * IP address can be one of the types (it won't break much, and we can > simply skip that one for MFC) > * what about typing for nat/pipes ? we're not going to convert their ids > to names? (or maybe you can suggest other non-disruptive way?) > * everything else is type "u32" Correct, I am mostly concerned about the details, not on the general concept. To summarize the discussion Alexander and I had about converting identifiers from numbers to arbitrary strings (this is partly related to the values stored in tables, but I think we should have a coherent behaviour) 1. CURRENTLY ipfw uses numeric identifiers in a small range (16 bits or less) for rules, pipes, queues, tables, probably nat instances. 2. CURRENTLY, in all the above contexts, it is legal to reference a non existing object (rule, pipe, table names, etc.), and the kernel will do something reasonable, namely jump to the next rule, drop traffic for non existing pipes, and so on. 3. of course we want to preserve backward compatibility both for the ioctl interface, and for user configurations. 4. The in-kernel representation of identifiers is not visible to users, so we can use a numeric representation in the kernel for identifiers. Strings like "12345" are converted with atoi() or the like, whereas for other identifiers or numbers outside of the 2^16 range the kernel manages a translation table, allocating new numeric identifiers if a new string appears. This permits backward compatibility for old rulesets, and does not impact performance because the translation table is only used during rules additions or deletion. With this in mind, i think we should follow a similar approach for objects stored in tables, hence if an u32 value was available in the past, it must be available also in the new implementation. The issue with tables is that some convoluted configuration could use the same table to reference pipes _and_ rules _and_ perhaps other things represented as numbers (the former is not too strange, if i have a large configuration i might place sections at rules 12000, 13000, 14000... and associate pipes with the same numberic identifier to each block of rules). Typed table values would clearly disturb backward compatibility in the above configurations. However it should not be difficult to accept arbitrary strings as the values stored in tables, and then store multiple representations as appropriate, including: - the string representation, unconditionally - for names that can be resolved by DNS, the ipv6 and ipv4 address(es) associated with them. ipfw already translates hostnames in rules so this is POLA - for other strings, a u32 from the translation table as previously indicated - and for numeric values, the u32 representation (truncated if needed, according to whatever is the existing behaviour) - If we cannot generate an u32 we will put some value (e.g. 0) that hopefully will not cause confusion. If we do it this way, we should be able to preserve backward compatibility _and_ add features that people may need. cheers luigi From owner-freebsd-net@FreeBSD.ORG Mon Aug 4 19:51:38 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C0BF68E8; Mon, 4 Aug 2014 19:51:38 +0000 (UTC) Received: from mail.ipfw.ru (mail.ipfw.ru [IPv6:2a01:4f8:120:6141::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4CC542177; Mon, 4 Aug 2014 19:51:38 +0000 (UTC) Received: from v6.mpls.in ([2a02:978:2::5] helo=ws.su29.net) by mail.ipfw.ru with esmtpsa (TLSv1:DHE-RSA-AES128-SHA:128) (Exim 4.82 (FreeBSD)) (envelope-from ) id 1XEKL6-000JDe-9n; Mon, 04 Aug 2014 19:38:20 +0400 Message-ID: <53DFE438.5050209@FreeBSD.org> Date: Mon, 04 Aug 2014 23:51:20 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: ipfw named objejcts, table values and syntax change References: <53DC01DE.3000000@FreeBSD.org> <53DCA25C.1000108@FreeBSD.org> <53DF55FA.8010303@FreeBSD.org> <20140804115817.GA13814@onelab2.iet.unipi.it> In-Reply-To: <20140804115817.GA13814@onelab2.iet.unipi.it> X-Enigmail-Version: 1.6 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="pHf6c1bQV38KmqoIE4V1KKOaAxNm7R47H" Cc: freebsd-ipfw , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Aug 2014 19:51:38 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --pHf6c1bQV38KmqoIE4V1KKOaAxNm7R47H Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 04.08.2014 15:58, Luigi Rizzo wrote: > On Mon, Aug 04, 2014 at 01:44:26PM +0400, Alexander V. Chernikov wrote:= >> On 02.08.2014 12:33, Alexander V. Chernikov wrote: >>> On 02.08.2014 10:33, Luigi Rizzo wrote: >>>> >>>> >>>> On Fri, Aug 1, 2014 at 11:08 PM, Alexander V. Chernikov >>>> > wrote: >>>> >>>> Hello all. >>>> >>>> I'm currently working on to enhance ipfw in some areas. >>>> The most notable (and user-visible) change is named table suppo= rt. >>>> The other one is support for different lookup algorithms for di= fferent >>>> key types. >>>> >>>> For example, new ipfw permits writing this: >>>> >>>> ipfw table tb1 create type cidr >>>> ipfw add allow ip from table(tl1) to any >>>> ipfw add allow ip from any lookup dst-ip tb1 >>>> >>>> ipfw table if1 create type iface >>>> ipfw add skipto tablearg ip from any to any via table(if1) >>>> >>>> or even this: >>>> ipfw table fl1 create type flow:src-ip,proto,dst-ip,dst-port >>>> ipfw table fl1 add 10.0.0.5,tcp,10.0.0.6,80 4444 >>>> ipfw add allow ip from any to any flow table(fl1) >>>> >>>> all these changes fully preserve backward compatibility. >>>> (actually tables needs now to be created before use and their t= ype needs >>>> to match with opcode used, but new ipfw(8) performs auto-creati= on >>>> for cidr tables). >>>> >>>> There is another thing I'm going to change and I'm not sure I c= an keep >>>> the same compatibility level. >>>> >>>> Table values, from one point of view, can be classified to the = following >>>> types: >>>> >>>> - skipto argument >>>> - fwd argument (*) >>>> - link to another object (nat, pipe, queue) >>>> - plain u32 (not bound to any object) >>>> (divert/tee,netgraph,tag/utag,limit) >>>> >>>> There are the following reasons why I think it is necessary to = implement >>>> explicit table values typing (like tables): >>>> - Implementing fwd tablearg for IPv6 hosts requires indirection= table >>>> - Converting nat/pipe instance ids to names renders values unus= able >>>> - retiring old hack with storing saved pointer of found object/= rule >>>> inside rule w/o proper locking >>>> - making faster skipto >>>> >>>> >>>> ??????i don't buy the idea that you need typed arguments >>>> for all the cases above. Maybe the case that >>>> may make sense is the fwd argument (and in the future >>>> something else). >>>> We already discussed, i think, the fact that now it >>>> is legal to have references to non existing things >>>> (skipto, pipes etc.) implemented as u32. >>>> Removing that would break configurations. >>> It depends on actual implementation. This can be preserved by >>> auto-creating necessary objects in kernel and/or in userspace, so >>> we can (and should) avoid breaking in this particular way. >> Can you please explain your vision on values another time? >> As far as I understand, you're not against it in general, but the=20 >> details matter: >> * IP address can be one of the types (it won't break much, and we can = >> simply skip that one for MFC) >> * what about typing for nat/pipes ? we're not going to convert their i= ds=20 >> to names? (or maybe you can suggest other non-disruptive way?) >> * everything else is type "u32" >=20 > Correct, I am mostly concerned about the details, not on the general co= ncept. >=20 > To summarize the discussion Alexander and I had about converting > identifiers from numbers to arbitrary strings (this is partly related > to the values stored in tables, but I think we should have a coherent > behaviour) >=20 > 1. CURRENTLY ipfw uses numeric identifiers in a small range (16 bits or= less) > for rules, pipes, queues, tables, probably nat instances. >=20 > 2. CURRENTLY, in all the above contexts, it is legal to reference a > non existing object (rule, pipe, table names, etc.), > and the kernel will do something reasonable, namely jump to the > next rule, drop traffic for non existing pipes, and so on. >=20 > 3. of course we want to preserve backward compatibility both for > the ioctl interface, and for user configurations. >=20 > 4. The in-kernel representation of identifiers is not visible to users,= > so we can use a numeric representation in the kernel for identifiers= =2E > Strings like "12345" are converted with atoi() or the like, > whereas for other identifiers or numbers outside of the 2^16 range > the kernel manages a translation table, allocating new numeric > identifiers if a new string appears. > This permits backward compatibility for old rulesets, and does not > impact performance because the translation table is only > used during rules additions or deletion. Yes. However this requires either holding either (1) 2 pointers (old&new arrays), or (2) 65k+ index array, or (3) chained hash table. (1) would require additional pointers for each subsystem (and some additional management), (2) will definitely upset embedded guys and (3) is worse in terms of performance >=20 > With this in mind, i think we should follow a similar approach for > objects stored in tables, hence >=20 > if an u32 value was available in the past, it must be > available also in the new implementation. >=20 > The issue with tables is that some convoluted configuration could > use the same table to reference pipes _and_ rules _and_ perhaps > other things represented as numbers (the former is not too strange, > if i have a large configuration i might place sections at rules > 12000, 13000, 14000... and associate pipes with the same numberic > identifier to each block of rules). >=20 > Typed table values would clearly disturb backward compatibility > in the above configurations. However it should not be difficult > to accept arbitrary strings as the values stored in tables, and > then store multiple representations as appropriate, including: Well, I've thought about thas one. It may be an option, but the details are not so promising (below) > - the string representation, unconditionally > - for names that can be resolved by DNS, the ipv6 and ipv4 address(es) > associated with them. ipfw already translates hostnames in rules > so this is POLA I'm not happy what ipfw(8) is doing instead of translation. The proper way would be not simply using first AF_INET answer but saving ALL IPv4+IPv6 records inside rule (and some more tracking should be done afterwards, but that's totally different story). Additionally, I'm unsure if we really need next-hop value expressed as hostname (how can we deal with multiple addresses and diffrent AFs?). We may store strings (and I think we should do it) but I'm unsure about this particular option of interpreting them. > - for other strings, a u32 from the translation table as previously > indicated > - and for numeric values, the u32 representation (truncated if needed, > according to whatever is the existing behaviour) > - > If we cannot generate an u32 we will put some value (e.g. 0) > that hopefully will not cause confusion. As far as I understand, we accept some string "s" as table value inside the kernel, than, we have some logic that says: oh, dummynet pipe has the same name "s"s, oh, nat entity with name "s" has just been created, let's save indices. That would require additional indirection table like: index | [ skipto idx | nat idx | pipe idx | queue idx | fwd index ] ( so we will have 2-level indirection table for fwd if we do IPv6) We can optimize this if we use "same name -> same kidx" approach regardless of kernel object we're refering to. That might require some more memory, but that's OK from my point of view. So we end up with int [ skipto idx | fwd idx | obj idx ] idx "0" is special value which means the same as 2.CURRENT That looks better, but still way to complex. I do care about compatibility, but it's hard to improve things without changing. I'd like to propose the following: * Split values into 3 types ("ip|nexthop", "number", "object") * Do not insist on object existence, use value "0" to mimic 2.CURRENT behavior. * Retain full compatibility by introducing special value type "legacy" which matches any type and is backed by given indirection table. * Issue warning in ipfw(8) binary on all auto-created tables that auto-creation is legacy and this behavior will be dropped in next major release (e.g. 11.0) * Save this behavior in MFC but drop "legacy" tables in head after a month after actual MFC. That do you think? >=20 > If we do it this way, we should be able to preserve backward > compatibility _and_ add features that people may need. >=20 > cheers > luigi >=20 --pHf6c1bQV38KmqoIE4V1KKOaAxNm7R47H Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlPf5DwACgkQwcJ4iSZ1q2lHjwCbBFGnZgeSxmyWiYo+sI0c12W0 z4UAoKRb5YoqU5WQKKMlxG4l+wEbqMKk =rPfj -----END PGP SIGNATURE----- --pHf6c1bQV38KmqoIE4V1KKOaAxNm7R47H-- From owner-freebsd-net@FreeBSD.ORG Wed Aug 6 21:25:40 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 09D74AE6 for ; Wed, 6 Aug 2014 21:25:40 +0000 (UTC) Received: from mail-ob0-x233.google.com (mail-ob0-x233.google.com [IPv6:2607:f8b0:4003:c01::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C213C2969 for ; Wed, 6 Aug 2014 21:25:39 +0000 (UTC) Received: by mail-ob0-f179.google.com with SMTP id wn1so2321895obc.10 for ; Wed, 06 Aug 2014 14:25:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=averesystems.com; s=google; h=mime-version:date:message-id:subject:from:to:content-type; bh=u7vpJuOYSKkEjU1L1sT0G2CjPUdY83YnO747s7WO534=; b=B90NfRiuO0u7TzkxYX6QdSeua5/VXHcxweRRP292Hvy4a2Fo4c/AZ6iKPky9+cvBG2 6Zkdoq++gKFANe37gjzZfLEXNhCRXcI3VgiR6LMtqfXidQ/FhAQLIZ/PaS/9wFEawPUI lH+z7Cf8XH7VvGgfZL3XHGO7y3zi3KB8Z5QTM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=u7vpJuOYSKkEjU1L1sT0G2CjPUdY83YnO747s7WO534=; b=HtTtjCx66Ow8xGSfggq94SA7vsgJTD0LNBlSjTz+OXAiOwg7TAuwXn+ULF7P+GBQfq BghsqeRaLNYqG+b/TgN1RQeXc9wkvj57yQnYyWNAkZ+wjt2Mq+UTsIZeARRCt4o0qYFK G82Fp5BatoiGpYjwgFSc1Zq2OV8eg803pkZLc7L2AVHBHk6eJqPEbtzpTiv4LYJGEzMD R/UfO5pWgO5Z+Uppre1diiGIBYgpzrfra2h+0ocVJ/+QRPeDJkLWuXVKSoSaEUYgsLYM WH+SbfYPhybFqxQXm9X+jIr/HTDaiomwcM6HjzX5IhFT5R9iPRo2sIEoRmkJJwtevyae 7koQ== X-Gm-Message-State: ALoCoQnswda1ymym5dbdlqRALymTBROBHD8dEu0lLe7E3FrKggoeutt/IKtOm5xQXOkgkY4jwsLL MIME-Version: 1.0 X-Received: by 10.60.46.167 with SMTP id w7mr18708456oem.50.1407360338681; Wed, 06 Aug 2014 14:25:38 -0700 (PDT) Received: by 10.76.93.209 with HTTP; Wed, 6 Aug 2014 14:25:38 -0700 (PDT) Date: Wed, 6 Aug 2014 17:25:38 -0400 Message-ID: Subject: zero window and persist timer not set From: Jeremiah Lott To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Aug 2014 21:25:40 -0000 Hello, We've been seeing a problem where a tcp connection is stuck in a zero window condition and even though the client has opened more window space, our FreeBSD box never sends any more. After some analysis it appears that the FreeBSD box is not sending zero window probes, because the persist timer did not get set (we can see in kgdb that the tcpcb shows 0 window, there is data in the socket buffer, but the persist timer is not active). After looking over the code for a while, I think I see the problem. When tcp_output chooses to send a packet, it never arms the persist timer. This causes a problem in the following scenario: 1. A --> B: packet containing enough data to fill the window 2. B --> A: ACK for #1 + new data (0 window advertisement) 3. A --> B: ACK for #2, 0 len packet In this case, A will not activate the persist timer, because it chose to send a packet. Unless tcp_output is called for some other reason (delayed ack timer, another input packet from B, socket syscall), A will not send zero window probes. I was finally able to recreate this condition by setting an very small window and running programs that send very specific sequences of packets without calling recv (purposefully forcing a zero window condition). Here is a packet capture that shows the sequence: A == 10.2.15.69 == FreeBSD 9.2 B == 10.2.14.61 == FreeBSD 8.2 16:19:49.664790 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [S], seq 2362665163, win 4300, options [mss 1460,nop,wscale 6,sackOK,TS val 88804503 ecr 0], length 0 16:19:49.664821 IP 10.2.15.69.12345 > 10.2.14.61.23133: Flags [S.], seq 3306387947, ack 2362665164, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1605043666 ecr 88804503], length 0 16:19:49.664859 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [.], ack 1, win 67, options [nop,nop,TS val 88804503 ecr 1605043666], length 0 16:19:49.664921 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [P.], seq 1:101, ack 1, win 67, options [nop,nop,TS val 88804503 ecr 1605043666], length 100 16:19:49.665137 IP 10.2.15.69.12345 > 10.2.14.61.23133: Flags [P.], seq 1:3001, ack 101, win 2046, options [nop,nop,TS val 1605043666 ecr 88804503], length 3000 16:19:49.665208 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [P.], seq 101:1321, ack 1449, win 45, options [nop,nop,TS val 88804503 ecr 1605043666], length 1220 16:19:49.666195 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [.], seq 1321:2769, ack 3001, win 21, options [nop,nop,TS val 88804504 ecr 1605043666], length 1448 16:19:49.666205 IP 10.2.15.69.12345 > 10.2.14.61.23133: Flags [.], ack 2769, win 2004, options [nop,nop,TS val 1605043667 ecr 88804503], length 0 16:19:49.666207 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [P.], seq 2769:2771, ack 3001, win 21, options [nop,nop,TS val 88804504 ecr 1605043666], length 2 16:19:49.667183 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [.], seq 2771:4219, ack 3001, win 21, options [nop,nop,TS val 88804505 ecr 1605043667], length 1448 16:19:49.667190 IP 10.2.15.69.12345 > 10.2.14.61.23133: Flags [.], seq 3001:4345, ack 4219, win 1982, options [nop,nop,TS val 1605043668 ecr 88804504], length 1344 16:19:49.667193 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [P.], seq 4219:4221, ack 3001, win 21, options [nop,nop,TS val 88804505 ecr 1605043667], length 2 16:19:49.766487 IP 10.2.14.61.23133 > 10.2.15.69.12345: Flags [P.], seq 4221:4321, ack 4345, win 0, options [nop,nop,TS val 88804605 ecr 1605043668], length 100 16:19:49.766499 IP 10.2.15.69.12345 > 10.2.14.61.23133: Flags [.], ack 4321, win 1980, options [nop,nop,TS val 1605043768 ecr 88804505], length 0 The important packets are the last four: 1. A --> B: length 1344, fills the remaining window 2. B --> A: length 2, does not ack additional data, delayed ack timer is set 3. B --> A: length 100, acks #1, immediate ack (delayed ack timer cancelled, tcp_output called with ACKNOW) 4. A --> B: length 0, acks #1 and #2, because a packet is sent tcp_output does not activate the persist timer. I would normally expect A to begin sending zero-window probes, but (since it didn't activate the persist timer) it does not. Using kgdb, I can see that the persist timer is not set, only the keep timer is set. This is kgdb on "A": (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->snd_nxt $5 = 3306392292 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->snd_max $6 = 3306392292 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->snd_una $7 = 3306392292 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->snd_wnd $8 = 0 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->snd_cwnd $9 = 4380 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->t_timers->tt_rexmt->c_flags $11 = 16 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->t_timers->tt_persist->c_flags $12 = 16 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->t_timers->tt_keep->c_flags $13 = 22 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->t_timers->tt_2msl->c_flags $14 = 16 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->t_timers->tt_delack->c_flags $15 = 16 (kgdb) print ((struct tcpcb*)(0xfffffe02ae289b70))->t_inpcb->inp_socket.so_snd.sb_cc $16 = 1656 There is zero window, data in the socket buffer, and the persist timer is not set. My proposed fix follows. If you send a 0-length packet, but there is data is the socket buffer, and neither the rexmt or persist timer is already set, then activate the persist timer. --- sys/netinet/tcp_output.c (revision 269644) +++ sys/netinet/tcp_output.c (working copy) @@ -1290,7 +1290,12 @@ tp->t_rxtshift = 0; } tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur); - } + } else if (len == 0 && so->so_snd.sb_cc && + !tcp_timer_active(tp, TT_REXMT) && + !tcp_timer_active(tp, TT_PERSIST)) { + tp->t_rxtshift = 0; + tcp_setpersist(tp); + } } else { /* * Persist case, update snd_max but since we are in Let me know any comments. Thanks, Jeremiah Lott Avere Systems From owner-freebsd-net@FreeBSD.ORG Wed Aug 6 23:09:18 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 94CB9481 for ; Wed, 6 Aug 2014 23:09:18 +0000 (UTC) Received: from web01.jbserver.net (web01.jbserver.net [IPv6:2a00:8240:6:a::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5E69B2784 for ; Wed, 6 Aug 2014 23:09:17 +0000 (UTC) Received: from cl-1071.udi-01.br.sixxs.net ([2001:1291:200:42e::2]) by web01.jbserver.net with esmtpsa (TLSv1.2:DHE-RSA-AES128-SHA:128) (Exim 4.83) (envelope-from ) id 1XFAKY-0001D5-5P; Thu, 07 Aug 2014 01:09:14 +0200 Message-ID: <53E2B586.3080700@gont.com.ar> Date: Wed, 06 Aug 2014 19:08:54 -0400 From: Fernando Gont User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: FreeBSD Net Subject: Routing IPv6 packets towards oneself with routing sockets? Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Aug 2014 23:09:18 -0000 Folks, I've found a "tricky" scenario when consulting the IPv6 routing table with routing sockets. Short version of the question: I'm currently consulting the IPv6 routing table with raw sockets. My own host is assigned the address fc00:1::1, and it is directly connected to fc00:1::/64 with em0. The corresponding entries from its routing table are: fc00:1::/64 link#1 U em0 fc00:1::1 link#1 UHS lo0 Essentially, packets sent to fc00:1::1 don't go through em0 but rather go through the loopback interface (if you ping6 fc00:1::1, you'll see the packets on lo0 rather than em0). However, whenever I lookup an entry for fc00:1::1 with routing sockets, the only entry I obtain is fc00:1::/64 (a network route) rather than fc00:1::1/128 (a host route). As a result, I kind of have to figure out that since fc00:1::1 is my own address, I must override whatever I learned via routing sockets, and just send my packets to loopback. I would assume that I must be doing something wrong, since I would expect the host-specific route (i.e. longest-matching route) to be route learned via routing sockets. And that I shouldn't be implementing this "is the dst address my own address?" hack. Any thoughts? P.S.: I can provide a code snippet if that'd be of any help. Thanks! Best regards, -- Fernando Gont e-mail: fernando@gont.com.ar || fgont@si6networks.com PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1 From owner-freebsd-net@FreeBSD.ORG Thu Aug 7 10:24:36 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4399997F for ; Thu, 7 Aug 2014 10:24:36 +0000 (UTC) Received: from mail.allbsd.org (gatekeeper.allbsd.org [IPv6:2001:2f0:104:e001::32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.allbsd.org", Issuer "RapidSSL CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B6A9522A5 for ; Thu, 7 Aug 2014 10:24:35 +0000 (UTC) Received: from alph.d.allbsd.org ([IPv6:2001:2f0:104:e010:862b:2bff:febc:8956]) (authenticated bits=56) by mail.allbsd.org (8.14.9/8.14.8) with ESMTP id s77AOFrw025819 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 7 Aug 2014 19:24:25 +0900 (JST) (envelope-from hrs@FreeBSD.org) Received: from localhost (localhost [IPv6:::1]) (authenticated bits=0) by alph.d.allbsd.org (8.14.8/8.14.8) with ESMTP id s77AODGJ029865; Thu, 7 Aug 2014 19:24:14 +0900 (JST) (envelope-from hrs@FreeBSD.org) Date: Thu, 07 Aug 2014 19:24:03 +0900 (JST) Message-Id: <20140807.192403.845244220459089560.hrs@allbsd.org> To: fernando@gont.com.ar Subject: Re: Routing IPv6 packets towards oneself with routing sockets? From: Hiroki Sato In-Reply-To: <53E2B586.3080700@gont.com.ar> References: <53E2B586.3080700@gont.com.ar> X-PGPkey-fingerprint: BDB3 443F A5DD B3D0 A530 FFD7 4F2C D3D8 2793 CF2D X-Mailer: Mew version 6.6 on Emacs 24.3 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Multipart/Signed; protocol="application/pgp-signature"; micalg=pgp-sha1; boundary="--Security_Multipart(Thu_Aug__7_19_24_03_2014_114)--" Content-Transfer-Encoding: 7bit X-Virus-Scanned: clamav-milter 0.97.4 at gatekeeper.allbsd.org X-Virus-Status: Clean X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mail.allbsd.org [IPv6:2001:2f0:104:e001::32]); Thu, 07 Aug 2014 19:24:29 +0900 (JST) X-Spam-Status: No, score=-97.9 required=13.0 tests=CONTENT_TYPE_PRESENT, RDNS_NONE,SPF_SOFTFAIL,USER_IN_WHITELIST autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on gatekeeper.allbsd.org Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Aug 2014 10:24:36 -0000 ----Security_Multipart(Thu_Aug__7_19_24_03_2014_114)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi, Fernando Gont wrote in <53E2B586.3080700@gont.com.ar>: fe> However, whenever I lookup an entry for fc00:1::1 with routing sockets, fe> the only entry I obtain is fc00:1::/64 (a network route) rather than fe> fc00:1::1/128 (a host route). As a result, I kind of have to figure out fe> that since fc00:1::1 is my own address, I must override whatever I fe> learned via routing sockets, and just send my packets to loopback. fe> fe> I would assume that I must be doing something wrong, since I would fe> expect the host-specific route (i.e. longest-matching route) to be route fe> learned via routing sockets. And that I shouldn't be implementing this fe> "is the dst address my own address?" hack. fe> fe> Any thoughts? fe> fe> P.S.: I can provide a code snippet if that'd be of any help. RTM_GET should return fc00:1::1/128 with ifp == lo0. Can you show the code you are using? -- Hiroki ----Security_Multipart(Thu_Aug__7_19_24_03_2014_114)-- Content-Type: application/pgp-signature Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEABECAAYFAlPjU8MACgkQTyzT2CeTzy32FgCgqdd3V4Ap0oIXDly2EGDNJarS l4wAnjEF5rCAbRQv1mx5oSsMb4whzt+h =SWGN -----END PGP SIGNATURE----- ----Security_Multipart(Thu_Aug__7_19_24_03_2014_114)---- From owner-freebsd-net@FreeBSD.ORG Thu Aug 7 11:07:01 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 618E238D; Thu, 7 Aug 2014 11:07:01 +0000 (UTC) Received: from web01.jbserver.net (web01.jbserver.net [IPv6:2a00:8240:6:a::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 25F39275C; Thu, 7 Aug 2014 11:07:01 +0000 (UTC) Received: from 18-132-17-190.fibertel.com.ar ([190.17.132.18] helo=[192.168.3.106]) by web01.jbserver.net with esmtpsa (TLSv1.2:DHE-RSA-AES128-SHA:128) (Exim 4.83) (envelope-from ) id 1XFLX9-0004Cg-8H; Thu, 07 Aug 2014 13:06:59 +0200 Message-ID: <53E35DA7.4020800@gont.com.ar> Date: Thu, 07 Aug 2014 07:06:15 -0400 From: Fernando Gont User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Hiroki Sato Subject: Re: Routing IPv6 packets towards oneself with routing sockets? References: <53E2B586.3080700@gont.com.ar> <20140807.192403.845244220459089560.hrs@allbsd.org> In-Reply-To: <20140807.192403.845244220459089560.hrs@allbsd.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Aug 2014 11:07:01 -0000 Hi, Hiroki, On 08/07/2014 06:24 AM, Hiroki Sato wrote: > > Fernando Gont wrote > in <53E2B586.3080700@gont.com.ar>: > > fe> However, whenever I lookup an entry for fc00:1::1 with routing sockets, > fe> the only entry I obtain is fc00:1::/64 (a network route) rather than > fe> fc00:1::1/128 (a host route). As a result, I kind of have to figure out > fe> that since fc00:1::1 is my own address, I must override whatever I > fe> learned via routing sockets, and just send my packets to loopback. > fe> > fe> I would assume that I must be doing something wrong, since I would > fe> expect the host-specific route (i.e. longest-matching route) to be route > fe> learned via routing sockets. And that I shouldn't be implementing this > fe> "is the dst address my own address?" hack. > fe> > fe> Any thoughts? > fe> > fe> P.S.: I can provide a code snippet if that'd be of any help. > > RTM_GET should return fc00:1::1/128 with ifp == lo0. Yes, that's what I would have expected. > Can you show > the code you are using? Yes: Run it as: bsd-lookup-simple -v IPV6_DEST_ADDR (or without the "-v" if you don't want much verbosity) Thanks! Best regards, -- Fernando Gont e-mail: fernando@gont.com.ar || fgont@si6networks.com PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1 From owner-freebsd-net@FreeBSD.ORG Thu Aug 7 20:43:43 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1C30DD8F for ; Thu, 7 Aug 2014 20:43:43 +0000 (UTC) Received: from mail.allbsd.org (gatekeeper.allbsd.org [IPv6:2001:2f0:104:e001::32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.allbsd.org", Issuer "RapidSSL CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6C1C724CC for ; Thu, 7 Aug 2014 20:43:41 +0000 (UTC) Received: from alph.d.allbsd.org ([IPv6:2001:2f0:104:e010:862b:2bff:febc:8956]) (authenticated bits=56) by mail.allbsd.org (8.14.9/8.14.8) with ESMTP id s77KhHFA087727 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 8 Aug 2014 05:43:28 +0900 (JST) (envelope-from hrs@FreeBSD.org) Received: from localhost (localhost [IPv6:::1]) (authenticated bits=0) by alph.d.allbsd.org (8.14.8/8.14.8) with ESMTP id s77KhFjf035205; Fri, 8 Aug 2014 05:43:16 +0900 (JST) (envelope-from hrs@FreeBSD.org) Date: Fri, 08 Aug 2014 05:37:57 +0900 (JST) Message-Id: <20140808.053757.1725805140861121363.hrs@allbsd.org> To: fernando@gont.com.ar Subject: Re: Routing IPv6 packets towards oneself with routing sockets? From: Hiroki Sato In-Reply-To: <53E35DA7.4020800@gont.com.ar> <53E2B586.3080700@gont.com.ar> References: <53E2B586.3080700@gont.com.ar> <20140807.192403.845244220459089560.hrs@allbsd.org> <53E35DA7.4020800@gont.com.ar> X-PGPkey-fingerprint: BDB3 443F A5DD B3D0 A530 FFD7 4F2C D3D8 2793 CF2D X-Mailer: Mew version 6.6 on Emacs 24.3 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Multipart/Signed; protocol="application/pgp-signature"; micalg=pgp-sha1; boundary="--Security_Multipart0(Fri_Aug__8_05_37_57_2014_299)--" Content-Transfer-Encoding: 7bit X-Virus-Scanned: clamav-milter 0.97.4 at gatekeeper.allbsd.org X-Virus-Status: Clean X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mail.allbsd.org [IPv6:2001:2f0:104:e001::32]); Fri, 08 Aug 2014 05:43:33 +0900 (JST) X-Spam-Status: No, score=-97.9 required=13.0 tests=CONTENT_TYPE_PRESENT, RDNS_NONE,SPF_SOFTFAIL,USER_IN_WHITELIST autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on gatekeeper.allbsd.org Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Aug 2014 20:43:43 -0000 ----Security_Multipart0(Fri_Aug__8_05_37_57_2014_299)-- Content-Type: Multipart/Mixed; boundary="--Next_Part(Fri_Aug__8_05_37_57_2014_990)--" Content-Transfer-Encoding: 7bit ----Next_Part(Fri_Aug__8_05_37_57_2014_990)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Fernando Gont wrote in <53E35DA7.4020800@gont.com.ar>: fe> Yes: fe> fe> Run it as: fe> bsd-lookup-simple -v IPV6_DEST_ADDR Hmm, I tried and it seems it worked as expected. "./bsd-lookup-simple -v fc00:1::1" returns RTA_DST with fc00:1::1, and "-v fc00:1::2" returns RTA_DST with fc00:1::/64 like the following: % netstat -nrf inet6 | grep ^fc00 fc00:1::/64 link#1 U em0 fc00:1::1 link#1 UHS lo0 % ./bsd-lookup-simple -v fc00:1::1 DEBUG: 1 SOCKET_RAW query DEBUG: Received message DEBUG: rtm_type: 4 (4), rtm_pid: 15079 (15079), rtm_seq: 1804289383 (1804289383) DEBUG: RTA_DST was set RTA_DST: fc00:1::1 DEBUG: RTA_GATEWAY was set DEBUG: Family: 18, size 54, realsize: 56 DEBUG: sizeof(AF_LINK): 54, sizeof(AF_INET6): 28 DEBUG: RTA_GATEWAY: Name: em0, Index: 1 DEBUG: Quitted loop. onlink_f: 1, queries: 1 Outgoing interface: em0 (Index: 1) % ./bsd-lookup-simple -v fc00:1::2 DEBUG: 1 SOCKET_RAW query DEBUG: Received message DEBUG: rtm_type: 4 (4), rtm_pid: 15085 (15085), rtm_seq: 1804289383 (1804289383) DEBUG: RTA_DST was set RTA_DST: fc00:1:: DEBUG: RTA_GATEWAY was set DEBUG: Family: 18, size 54, realsize: 56 DEBUG: sizeof(AF_LINK): 54, sizeof(AF_INET6): 28 DEBUG: RTA_GATEWAY: Name: em0, Index: 1 DEBUG: Quitted loop. onlink_f: 1, queries: 1 Outgoing interface: em0 (Index: 1) fe> However, whenever I lookup an entry for fc00:1::1 with routing sockets, fe> the only entry I obtain is fc00:1::/64 (a network route) rather than fe> fc00:1::1/128 (a host route). Does this mean you got RTA_DST with fc00:1::/64 when "bsd-lookup-simple -v fc00:1::1"? If so, it is very strange. What was returned when you entered "route -n get -inet6 fc00:1::1" and "route -n get -inet6 fc00:1::2" on your box? Although your code assumes RTA_GATEWAY eventually returns the outgoing interface, it is not always true. RTA_IFP should be used if you want to look up it instead of looking up gateways until AF_LINK is obtained. Certainly RTA_GATEWAY returns AF_LINK and you can check sdl_index in it, but the index number is not always the same as the actual outgoing interface (one of the examples is a host route). A revised source file is attached. Some nits are also fixed: 1) SA_SIZE() on MacOSX is not aligned with sizeof(long) and 2) IFACE_LENGTH should be IFNAMSIZ. -- Hiroki ----Next_Part(Fri_Aug__8_05_37_57_2014_990)-- Content-Type: Text/X-Patch; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="bsd-lookup-simple.c.diff" --- bsd-lookup-simple.c.orig 2014-08-08 04:47:55.000000000 +0900 +++ bsd-lookup-simple.c 2014-08-08 04:47:55.000000000 +0900 @@ -38,7 +38,12 @@ #endif #ifndef SA_SIZE -#if defined (__FreeBSD__) || defined(__NetBSD__) || defined (__OpenBSD__) || defined(__APPLE__) +#if defined(__APPLE__) +#define SA_SIZE(sa) \ + ( (!(sa) || ((struct sockaddr *)(sa))->sa_len == 0) ? \ + sizeof(long) : \ + ((struct sockaddr *)(sa))->sa_len ) +#elif defined (__FreeBSD__) || defined(__NetBSD__) || defined (__OpenBSD__) #define SA_SIZE(sa) \ ( (!(sa) || ((struct sockaddr *)(sa))->sa_len == 0) ? \ sizeof(long) : \ @@ -78,7 +83,11 @@ #endif #endif +#ifdef IFNAMSIZ +#define IFACE_LENGTH IFNAMSIZ +#else #define IFACE_LENGTH 255 +#endif unsigned int print_ipv6_address(char *s, struct in6_addr *); @@ -104,6 +113,9 @@ struct sockaddr_in6 *sin6; struct sockaddr_dl *sockpptr; struct sockaddr *sa; + struct sockaddr *so[RTAX_MAX]; + char *cp; + int i; void *end; unsigned char onlink_f=FALSE, nhaddr_f=FALSE, verbose_f=TRUE, debug_f=FALSE; struct in6_addr dstaddr, nhaddr; @@ -139,7 +151,7 @@ rtm->rtm_msglen= sizeof(struct rt_msghdr) + sizeof(struct sockaddr_in6); rtm->rtm_version= RTM_VERSION; rtm->rtm_type= RTM_GET; - rtm->rtm_addrs= RTA_DST; + rtm->rtm_addrs= RTA_DST | RTA_IFP; rtm->rtm_pid= pid= getpid(); rtm->rtm_seq= seq= random(); @@ -181,18 +193,27 @@ }while( rtm->rtm_type != RTM_GET || rtm->rtm_pid != pid || rtm->rtm_seq != seq); /* The rt_msghdr{} structure is followed by sockaddr structures */ - sa= (struct sockaddr *) (rtm+1); + cp = (char *)(rtm + 1); + for (i = 0; i < RTAX_MAX; i++) { + if (rtm->rtm_addrs & (1 << i)) { + so[i] = (struct sockaddr *)cp; + cp += SA_SIZE((struct sockaddr *)cp); + } else + so[i] = NULL; + } + + if(so[RTAX_DST] != NULL) { + sa = (struct sockaddr *)so[RTAX_DST]; - if(rtm->rtm_addrs & RTA_DST){ if(debug_f){ puts("DEBUG: RTA_DST was set"); print_ipv6_address("RTA_DST: ", &( ((struct sockaddr_in6 *)sa)->sin6_addr)); } - - SA_NEXT(sa); } - if(rtm->rtm_addrs & RTA_GATEWAY){ + if(so[RTAX_GATEWAY] != NULL){ + sa = (struct sockaddr *)so[RTAX_GATEWAY]; + if(debug_f){ puts("DEBUG: RTA_GATEWAY was set"); printf("DEBUG: Family: %d, size %d, realsize: %lu\n", sa->sa_family, sa->sa_len, SA_SIZE(sa)); @@ -207,20 +228,29 @@ print_ipv6_address("DEBUG: RTA_GATEWAY: ", &nhaddr); } } - else if(sa->sa_family == AF_LINK){ - sockpptr = (struct sockaddr_dl *) (sa); + } + + if (so[RTAX_IFP] != NULL) { + sa = (struct sockaddr *)so[RTAX_IFP]; + + sockpptr = (struct sockaddr_dl *) (sa); + if(debug_f){ + puts("DEBUG: RTA_IFP was set"); + printf("DEBUG: Family: %d, size %d, realsize: %lu\n", sa->sa_family, sa->sa_len, SA_SIZE(sa)); + } + if (sockpptr->sdl_family == AF_LINK) { nhifindex= sockpptr->sdl_index; nhifindex_f=TRUE; - - if(if_indextoname(nhifindex, nhiface) == NULL){ - puts("Error calling if_indextoname() from sel_next_hop()"); + if (sockpptr->sdl_nlen >= sizeof(nhiface)) { + puts("ifname is too long."); return(EXIT_FAILURE); } + strncpy(nhiface, sockpptr->sdl_data, + sockpptr->sdl_nlen); + nhiface[sizeof(nhiface) - 1] = '\0'; - if(debug_f){ - printf("DEBUG: RTA_GATEWAY: Name: %s, Index: %d\n", nhiface, nhifindex); - } - + if(debug_f) + printf("DEBUG: RTA_IFP: Name: %s, Index: %d\n", nhiface, nhifindex); onlink_f=TRUE; } } ----Next_Part(Fri_Aug__8_05_37_57_2014_990)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="bsd-lookup-simple.c" /* * Program: bsd-routing-sockets.c * * Test IPv6 Routing sockets */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define TRUE 1 #define FALSE 0 #ifdef __linux__ /* Consulting the routing table */ #define MAX_NLPAYLOAD 1024 #else #define MAX_RTPAYLOAD 1024 #endif #ifndef SA_SIZE #if defined(__APPLE__) #define SA_SIZE(sa) \ ( (!(sa) || ((struct sockaddr *)(sa))->sa_len == 0) ? \ sizeof(long) : \ ((struct sockaddr *)(sa))->sa_len ) #elif defined (__FreeBSD__) || defined(__NetBSD__) || defined (__OpenBSD__) #define SA_SIZE(sa) \ ( (!(sa) || ((struct sockaddr *)(sa))->sa_len == 0) ? \ sizeof(long) : \ 1 + ( (((struct sockaddr *)(sa))->sa_len - 1) | (sizeof(long) - 1) ) ) #else #define SA_SIZE(sa) sizeof(struct sockaddr) #endif #endif #ifndef SA_NEXT #define SA_NEXT(sa) (sa= (struct sockaddr *) ( (char *) sa + SA_SIZE(sa))) #endif #if defined (__FreeBSD__) || defined(__NetBSD__) || defined (__OpenBSD__) || defined(__APPLE__) #ifndef s6_addr16 #define s6_addr16 __u6_addr.__u6_addr16 #endif #ifndef s6_addr #define s6_addr __u6_addr.__u6_addr8 #endif #ifndef s6_addr8 #define s6_addr8 __u6_addr.__u6_addr8 #endif #ifndef s6_addr32 #define s6_addr32 __u6_addr.__u6_addr32 #endif #elif defined __linux__ || ( !defined(__FreeBSD__) && defined(__FreeBSD_kernel__)) #ifndef s6_addr16 #define s6_addr16 __in6_u.__u6_addr16 #endif #ifndef s6_addr32 #define s6_addr32 __in6_u.__u6_addr32 #endif #endif #ifdef IFNAMSIZ #define IFACE_LENGTH IFNAMSIZ #else #define IFACE_LENGTH 255 #endif unsigned int print_ipv6_address(char *s, struct in6_addr *); int main(int argc, char *argv[]){ int sockfd; pid_t pid; int seq; ssize_t r; size_t ssize; unsigned int queries=0; char reply[MAX_RTPAYLOAD]; unsigned char nhifindex_f=0; unsigned int nhifindex; char nhiface[IFACE_LENGTH], pv6addr[INET6_ADDRSTRLEN]; #if defined(__APPLE__) char aflink_f= FALSE; #endif struct rt_msghdr *rtm; struct sockaddr_in6 *sin6; struct sockaddr_dl *sockpptr; struct sockaddr *sa; struct sockaddr *so[RTAX_MAX]; char *cp; int i; void *end; unsigned char onlink_f=FALSE, nhaddr_f=FALSE, verbose_f=TRUE, debug_f=FALSE; struct in6_addr dstaddr, nhaddr; if(argc < 2){ puts("usage: lookup [-v] IPV6_ADDRESS"); exit(1); } else if(argc > 2){ debug_f= TRUE; } if( (sockfd=socket(AF_ROUTE, SOCK_RAW, 0)) == -1){ if(verbose_f) puts("Error in socket() call from sel_next_hop()"); return(EXIT_FAILURE); } if ( inet_pton(AF_INET6, (strlen(argv[1]) <= 2 && debug_f)?argv[2]:argv[1], &dstaddr) <= 0){ puts("inet_pton(): Target Address not valid"); exit(EXIT_FAILURE); } nhaddr= dstaddr; do{ if(debug_f) printf("DEBUG: %u SOCKET_RAW query\n", queries+1); rtm= (struct rt_msghdr *) reply; memset(rtm, 0, sizeof(struct rt_msghdr)); rtm->rtm_msglen= sizeof(struct rt_msghdr) + sizeof(struct sockaddr_in6); rtm->rtm_version= RTM_VERSION; rtm->rtm_type= RTM_GET; rtm->rtm_addrs= RTA_DST | RTA_IFP; rtm->rtm_pid= pid= getpid(); rtm->rtm_seq= seq= random(); sin6= (struct sockaddr_in6 *) (rtm + 1); memset(sin6, 0, sizeof(struct sockaddr_in6)); sin6->sin6_len= sizeof(struct sockaddr_in6); sin6->sin6_family= AF_INET6; sin6->sin6_addr= nhaddr; #if defined(__APPLE__) if(IN6_IS_ADDR_LINKLOCAL(&nhaddr)){ aflink_f= TRUE; } #endif if(write(sockfd, rtm, rtm->rtm_msglen) == -1){ if(verbose_f) puts("write() failed. No route to the intenteded destination in the local routing table"); exit(EXIT_FAILURE); } do{ if( (r=read(sockfd, rtm, MAX_RTPAYLOAD)) < 0){ if(verbose_f) puts("Error in read() call from sel_next_hop()"); exit(EXIT_FAILURE); } /* The size of the structure should be at least sizof(long) */ end= (char *) rtm + r - (sizeof(long) -1); if(debug_f){ puts("DEBUG: Received message"); printf("DEBUG: rtm_type: %d (%d), rtm_pid: %d (%d), rtm_seq: %d (%d)\n", rtm->rtm_type, RTM_GET, rtm->rtm_pid, pid, \ rtm->rtm_seq, seq); } }while( rtm->rtm_type != RTM_GET || rtm->rtm_pid != pid || rtm->rtm_seq != seq); /* The rt_msghdr{} structure is followed by sockaddr structures */ cp = (char *)(rtm + 1); for (i = 0; i < RTAX_MAX; i++) { if (rtm->rtm_addrs & (1 << i)) { so[i] = (struct sockaddr *)cp; cp += SA_SIZE((struct sockaddr *)cp); } else so[i] = NULL; } if(so[RTAX_DST] != NULL) { sa = (struct sockaddr *)so[RTAX_DST]; if(debug_f){ puts("DEBUG: RTA_DST was set"); print_ipv6_address("RTA_DST: ", &( ((struct sockaddr_in6 *)sa)->sin6_addr)); } } if(so[RTAX_GATEWAY] != NULL){ sa = (struct sockaddr *)so[RTAX_GATEWAY]; if(debug_f){ puts("DEBUG: RTA_GATEWAY was set"); printf("DEBUG: Family: %d, size %d, realsize: %lu\n", sa->sa_family, sa->sa_len, SA_SIZE(sa)); printf("DEBUG: sizeof(AF_LINK): %lu, sizeof(AF_INET6): %lu\n", sizeof(struct sockaddr_dl), sizeof(struct sockaddr_in6)); } if(sa->sa_family == AF_INET6){ nhaddr= ((struct sockaddr_in6 *) sa)->sin6_addr; nhaddr_f=TRUE; if(debug_f){ print_ipv6_address("DEBUG: RTA_GATEWAY: ", &nhaddr); } } } if (so[RTAX_IFP] != NULL) { sa = (struct sockaddr *)so[RTAX_IFP]; sockpptr = (struct sockaddr_dl *) (sa); if(debug_f){ puts("DEBUG: RTA_IFP was set"); printf("DEBUG: Family: %d, size %d, realsize: %lu\n", sa->sa_family, sa->sa_len, SA_SIZE(sa)); } if (sockpptr->sdl_family == AF_LINK) { nhifindex= sockpptr->sdl_index; nhifindex_f=TRUE; if (sockpptr->sdl_nlen >= sizeof(nhiface)) { puts("ifname is too long."); return(EXIT_FAILURE); } strncpy(nhiface, sockpptr->sdl_data, sockpptr->sdl_nlen); nhiface[sizeof(nhiface) - 1] = '\0'; if(debug_f) printf("DEBUG: RTA_IFP: Name: %s, Index: %d\n", nhiface, nhifindex); onlink_f=TRUE; } } queries++; }while(!onlink_f && queries < 10); if(debug_f) printf("DEBUG: Quitted loop. onlink_f: %d, queries: %d\n", onlink_f, queries); close(sockfd); if(nhifindex_f){ if(IN6_IS_ADDR_LINKLOCAL(&nhaddr)){ /* BSDs store the interface index in s6_addr16[1], so we must clear it */ nhaddr.s6_addr16[1] =0; nhaddr.s6_addr16[2] =0; nhaddr.s6_addr16[3] =0; } if(nhaddr_f){ if(inet_ntop(AF_INET6, &nhaddr, pv6addr, sizeof(pv6addr)) == NULL){ puts("inet_ntop(): Error converting IPv6 Address to presentation format"); exit(EXIT_FAILURE); } printf("Next-Hop address: %s\n", pv6addr); } printf("Outgoing interface: %s (Index: %d)\n", nhiface, nhifindex); return(EXIT_SUCCESS); } else{ return(EXIT_FAILURE); } } /* * Function: print_ipv6_addresss() * * Prints an IPv6 address with a legend */ unsigned int print_ipv6_address(char *s, struct in6_addr *v6addr){ char pv6addr[INET6_ADDRSTRLEN]; if(inet_ntop(AF_INET6, v6addr, pv6addr, sizeof(pv6addr)) == NULL){ puts("inet_ntop(): Error converting IPv6 Source Address to presentation format"); return(EXIT_FAILURE); } printf("%s%s\n", s, pv6addr); return(EXIT_SUCCESS); } ----Next_Part(Fri_Aug__8_05_37_57_2014_990)---- ----Security_Multipart0(Fri_Aug__8_05_37_57_2014_299)-- Content-Type: application/pgp-signature Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEABECAAYFAlPj46UACgkQTyzT2CeTzy0+dgCfSM+VFavDSY1XB9jAICfmoK0o tn0AoMoDbvE5v/Fy460jYm5XUkHzzIk6 =fVr9 -----END PGP SIGNATURE----- ----Security_Multipart0(Fri_Aug__8_05_37_57_2014_299)---- From owner-freebsd-net@FreeBSD.ORG Thu Aug 7 23:09:21 2014 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 406FCBF4 for ; Thu, 7 Aug 2014 23:09:21 +0000 (UTC) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 28F2B24CF for ; Thu, 7 Aug 2014 23:09:21 +0000 (UTC) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.14.8/8.14.8) with ESMTP id s77N9LMP062724 for ; Thu, 7 Aug 2014 23:09:21 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-net@FreeBSD.org Subject: [Bug 91311] [aue] aue interface hanging Date: Thu, 07 Aug 2014 23:09:20 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 6.0-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: dvl@FreeBSD.org X-Bugzilla-Status: In Discussion X-Bugzilla-Priority: Normal X-Bugzilla-Assigned-To: freebsd-net@FreeBSD.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Aug 2014 23:09:21 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=91311 Dan Langille changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |k@free.de --- Comment #2 from Dan Langille --- *** Bug 181160 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are the assignee for the bug. From owner-freebsd-net@FreeBSD.ORG Fri Aug 8 12:34:45 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E4BB8847 for ; Fri, 8 Aug 2014 12:34:45 +0000 (UTC) Received: from mail-qa0-x22e.google.com (mail-qa0-x22e.google.com [IPv6:2607:f8b0:400d:c00::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A34BF2B8B for ; Fri, 8 Aug 2014 12:34:45 +0000 (UTC) Received: by mail-qa0-f46.google.com with SMTP id v10so5393353qac.19 for ; Fri, 08 Aug 2014 05:34:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=/O04H460hDerbQhw04GnsmACIIn9QGJl2twccMfQIUU=; b=WpGjVS3VTrcKsN02Bz9CeBrGOsTZ3KDNKqq08RScVHvjGLx7BW7dg7Vt0gaiZfELBS 2F0OJT/2Tv3cdeARGMeWO9QZZC3r+BAGhvr7ubyffsFBsfYMkfBnbcVRJPzsimAGGQ9b vE29Gn83D+txfngK6Xcjvn8rVLUWxoybHNtkWFwiNLUKlqwIxzZLLfW5j11aZEKD22eJ XLnJzWzlRXF4nnOOjHUf2tecaKz5WOnMY9VQxlxFGJupg8nLSd1ZuJIWvvhmYogmbyKa pkF5WEhLVH8lqS6IeK6Z+dMhLp2XvVfH+DkFJqNy+xDM7Zbcj1yoAYoNM9plVIV2EWdb R1PQ== MIME-Version: 1.0 X-Received: by 10.224.137.65 with SMTP id v1mr37212015qat.53.1407501284395; Fri, 08 Aug 2014 05:34:44 -0700 (PDT) Received: by 10.224.137.71 with HTTP; Fri, 8 Aug 2014 05:34:44 -0700 (PDT) Date: Fri, 8 Aug 2014 20:34:44 +0800 Message-ID: Subject: A problem on TCP in High RTT Environment. From: Niu Zhixiong To: freebsd-net@freebsd.org, Bill Yuan Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.18 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Aug 2014 12:34:46 -0000 Dear all, Last month, I send problems related to FTP/TCP in a high RTT environment. After that, I setup a simulation environment(Dummynet) to test TCP and SCTP in high delay environment. After finishing the test, I can see TCP is always slower than SCTP. But, I think it is not possible. (Plz see the figure in the attachment). When the delay is 200ms(means RTT=3D400ms). Besides, the TCP is extremely slow. ALL BW=3D20Mbps, DELAY=3D 0 ~ 200MS, Packet LOSS =3D 0 (by dummynet) This is my parameters: FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 11:04:15 HKT 2014 sysctl net.inet.tcp net.inet.tcp.rfc1323: 1 net.inet.tcp.mssdflt: 536 net.inet.tcp.keepidle: 7200000 net.inet.tcp.keepintvl: 75000 net.inet.tcp.sendspace: 32768 net.inet.tcp.recvspace: 65536 net.inet.tcp.keepinit: 75000 net.inet.tcp.delacktime: 100 net.inet.tcp.v6mssdflt: 1220 net.inet.tcp.cc.algorithm: newreno net.inet.tcp.cc.available: newreno net.inet.tcp.hostcache.cachelimit: 15360 net.inet.tcp.hostcache.hashsize: 512 net.inet.tcp.hostcache.bucketlimit: 30 net.inet.tcp.hostcache.count: 0 net.inet.tcp.hostcache.expire: 3600 net.inet.tcp.hostcache.prune: 300 net.inet.tcp.hostcache.purge: 0 net.inet.tcp.log_in_vain: 0 net.inet.tcp.blackhole: 0 net.inet.tcp.delayed_ack: 1 net.inet.tcp.drop_synfin: 0 net.inet.tcp.rfc3042: 1 net.inet.tcp.rfc3390: 1 net.inet.tcp.experimental.initcwnd10: 1 net.inet.tcp.rfc3465: 1 net.inet.tcp.abc_l_var: 2 net.inet.tcp.ecn.enable: 0 net.inet.tcp.ecn.maxretries: 1 net.inet.tcp.insecure_rst: 0 net.inet.tcp.recvbuf_auto: 0 net.inet.tcp.recvbuf_inc: 16384 net.inet.tcp.recvbuf_max: 2097152 net.inet.tcp.path_mtu_discovery: 1 net.inet.tcp.tso: 1 net.inet.tcp.sendbuf_auto: 0 net.inet.tcp.sendbuf_inc: 8192 net.inet.tcp.sendbuf_max: 2097152 net.inet.tcp.reass.maxsegments: 15900 net.inet.tcp.reass.cursegments: 0 net.inet.tcp.reass.overflows: 0 net.inet.tcp.sack.enable: 1 net.inet.tcp.sack.maxholes: 128 net.inet.tcp.sack.globalmaxholes: 65536 net.inet.tcp.sack.globalholes: 0 net.inet.tcp.minmss: 216 net.inet.tcp.log_debug: 0 net.inet.tcp.tcbhashsize: 32768 net.inet.tcp.do_tcpdrain: 1 net.inet.tcp.pcbcount: 4 net.inet.tcp.icmp_may_rst: 1 net.inet.tcp.isn_reseed_interval: 0 net.inet.tcp.soreceive_stream: 0 net.inet.tcp.syncookies: 1 net.inet.tcp.syncookies_only: 0 net.inet.tcp.syncache.bucketlimit: 30 net.inet.tcp.syncache.cachelimit: 15375 net.inet.tcp.syncache.count: 0 net.inet.tcp.syncache.hashsize: 512 net.inet.tcp.syncache.rexmtlimit: 3 net.inet.tcp.syncache.rst_on_sock_fail: 1 net.inet.tcp.msl: 30000 net.inet.tcp.rexmit_min: 30 net.inet.tcp.rexmit_slop: 200 net.inet.tcp.always_keepalive: 1 net.inet.tcp.fast_finwait2_recycle: 0 net.inet.tcp.finwait2_timeout: 60000 net.inet.tcp.keepcnt: 8 net.inet.tcp.rexmit_drop_options: 0 net.inet.tcp.per_cpu_timers: 0 net.inet.tcp.timer_race: 0 net.inet.tcp.maxtcptw: 26070 net.inet.tcp.nolocaltimewait: 0 kern.ipc.maxsockbuf: 4000000 kern.ipc.sockbuf_waste_factor: 8 kern.ipc.max_linkhdr: 16 kern.ipc.max_protohdr: 60 kern.ipc.max_hdr: 76 kern.ipc.max_datalen: 92 kern.ipc.maxmbufmem: 2073962496 kern.ipc.nmbclusters: 253170 kern.ipc.nmbjumbop: 126584 kern.ipc.nmbjumbo9: 112518 kern.ipc.nmbjumbo16: 84388 kern.ipc.nmbufs: 1620285 kern.ipc.maxpipekva: 66736128 kern.ipc.pipekva: 16384 kern.ipc.pipefragretry: 0 kern.ipc.pipeallocfail: 0 kern.ipc.piperesizefail: 0 kern.ipc.piperesizeallowed: 1 kern.ipc.msgmax: 16384 kern.ipc.msgmni: 40 kern.ipc.msgmnb: 2048 kern.ipc.msgtql: 40 kern.ipc.msgssz: 8 kern.ipc.msgseg: 2048 kern.ipc.semmni: 50 kern.ipc.semmns: 340 kern.ipc.semmnu: 150 kern.ipc.semmsl: 340 kern.ipc.semopm: 100 kern.ipc.semume: 50 kern.ipc.semusz: 632 kern.ipc.semvmx: 32767 kern.ipc.semaem: 16384 kern.ipc.shmmax: 536870912 kern.ipc.shmmin: 1 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 131072 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0 kern.ipc.soacceptqueue: 128 kern.ipc.numopensockets: 14 kern.ipc.maxsockets: 130350 kern.ipc.sendfile.readahead: 1 sysctl net.inet.sctp net.inet.sctp.sendspace: 2097152 net.inet.sctp.recvspace: 2097152 net.inet.sctp.auto_asconf: 1 net.inet.sctp.ecn_enable: 1 net.inet.sctp.strict_sacks: 1 net.inet.sctp.peer_chkoh: 256 net.inet.sctp.maxburst: 4 net.inet.sctp.fr_maxburst: 4 net.inet.sctp.maxchunks: 31646 net.inet.sctp.tcbhashsize: 1024 net.inet.sctp.pcbhashsize: 256 net.inet.sctp.min_split_point: 2904 net.inet.sctp.chunkscale: 10 net.inet.sctp.delayed_sack_time: 200 net.inet.sctp.sack_freq: 2 net.inet.sctp.sys_resource: 1000 net.inet.sctp.asoc_resource: 10 net.inet.sctp.heartbeat_interval: 30000 net.inet.sctp.pmtu_raise_time: 600 net.inet.sctp.shutdown_guard_time: 180 net.inet.sctp.secret_lifetime: 3600 net.inet.sctp.rto_max: 60000 net.inet.sctp.rto_min: 1000 net.inet.sctp.rto_initial: 3000 net.inet.sctp.init_rto_max: 60000 net.inet.sctp.valid_cookie_life: 60000 net.inet.sctp.init_rtx_max: 8 net.inet.sctp.assoc_rtx_max: 10 net.inet.sctp.path_rtx_max: 5 net.inet.sctp.path_pf_threshold: 65535 net.inet.sctp.add_more_on_output: 1452 net.inet.sctp.incoming_streams: 2048 net.inet.sctp.outgoing_streams: 10 net.inet.sctp.cmt_on_off: 0 net.inet.sctp.nr_sack_on_off: 0 net.inet.sctp.cmt_use_dac: 0 net.inet.sctp.cwnd_maxburst: 1 net.inet.sctp.asconf_auth_nochk: 0 net.inet.sctp.auth_disable: 0 net.inet.sctp.nat_friendly: 1 net.inet.sctp.abc_l_var: 2 net.inet.sctp.max_chained_mbufs: 5 net.inet.sctp.do_sctp_drain: 1 net.inet.sctp.hb_max_burst: 4 net.inet.sctp.abort_at_limit: 0 net.inet.sctp.strict_data_order: 0 net.inet.sctp.min_residual: 1452 net.inet.sctp.max_retran_chunk: 30 net.inet.sctp.log_level: 0 net.inet.sctp.default_cc_module: 0 net.inet.sctp.default_ss_module: 0 net.inet.sctp.default_frag_interleave: 1 net.inet.sctp.mobility_base: 0 net.inet.sctp.mobility_fasthandoff: 0 net.inet.sctp.udp_tunneling_port: 0 net.inet.sctp.enable_sack_immediately: 0 net.inet.sctp.nat_friendly_init: 0 net.inet.sctp.vtag_time_wait: 60 net.inet.sctp.buffer_splitting: 0 net.inet.sctp.initial_cwnd: 3 net.inet.sctp.rttvar_bw: 4 net.inet.sctp.rttvar_rtt: 5 net.inet.sctp.rttvar_eqret: 0 net.inet.sctp.rttvar_steady_step: 20 net.inet.sctp.use_dcccecn: 1 net.inet.sctp.blackhole: 0 net.inet.sctp.debug: 0 Regards, Niu Zhixiong =EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF= =BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D kaiaixi@gmail.com From owner-freebsd-net@FreeBSD.ORG Fri Aug 8 12:50:20 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4FC34D9B; Fri, 8 Aug 2014 12:50:20 +0000 (UTC) Received: from mail.rlan.ru (mail.rlan.ru [213.234.25.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C23AA2DBA; Fri, 8 Aug 2014 12:50:19 +0000 (UTC) Message-ID: <53E4BE62.4050303@rlan.ru> Date: Fri, 08 Aug 2014 16:11:14 +0400 From: Dmitry Selivanov User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.1.1 MIME-Version: 1.0 To: "Alexander V. Chernikov" Subject: Re: ipfw named objejcts, table values and syntax change References: <53DC01DE.3000000@FreeBSD.org> <53DCA25C.1000108@FreeBSD.org> <53DF55FA.8010303@FreeBSD.org> <20140804115817.GA13814@onelab2.iet.unipi.it> <53DFE438.5050209@FreeBSD.org> In-Reply-To: <53DFE438.5050209@FreeBSD.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-ipfw , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Aug 2014 12:50:20 -0000 04.08.2014 23:51, Alexander V. Chernikov пишет: > On 04.08.2014 15:58, Luigi Rizzo wrote: >> On Mon, Aug 04, 2014 at 01:44:26PM +0400, Alexander V. Chernikov wrote: >>> On 02.08.2014 12:33, Alexander V. Chernikov wrote: >>>> On 02.08.2014 10:33, Luigi Rizzo wrote: >>>>> >>>>> >>>>> On Fri, Aug 1, 2014 at 11:08 PM, Alexander V. Chernikov >>>>> > wrote: >>>>> >>>>> Hello all. >>>>> >>>>> I'm currently working on to enhance ipfw in some areas. >>>>> The most notable (and user-visible) change is named table support. >>>>> The other one is support for different lookup algorithms for different >>>>> key types. >>>>> >>>>> For example, new ipfw permits writing this: >>>>> >>>>> ipfw table tb1 create type cidr >>>>> ipfw add allow ip from table(tl1) to any >>>>> ipfw add allow ip from any lookup dst-ip tb1 >>>>> >>>>> ipfw table if1 create type iface >>>>> ipfw add skipto tablearg ip from any to any via table(if1) >>>>> >>>>> or even this: >>>>> ipfw table fl1 create type flow:src-ip,proto,dst-ip,dst-port >>>>> ipfw table fl1 add 10.0.0.5,tcp,10.0.0.6,80 4444 >>>>> ipfw add allow ip from any to any flow table(fl1) >>>>> >>>>> all these changes fully preserve backward compatibility. >>>>> (actually tables needs now to be created before use and their type needs >>>>> to match with opcode used, but new ipfw(8) performs auto-creation >>>>> for cidr tables). >>>>> >>>>> There is another thing I'm going to change and I'm not sure I can keep >>>>> the same compatibility level. >>>>> >>>>> Table values, from one point of view, can be classified to the following >>>>> types: >>>>> >>>>> - skipto argument >>>>> - fwd argument (*) >>>>> - link to another object (nat, pipe, queue) >>>>> - plain u32 (not bound to any object) >>>>> (divert/tee,netgraph,tag/utag,limit) >>>>> >>>>> There are the following reasons why I think it is necessary to implement >>>>> explicit table values typing (like tables): >>>>> - Implementing fwd tablearg for IPv6 hosts requires indirection table >>>>> - Converting nat/pipe instance ids to names renders values unusable >>>>> - retiring old hack with storing saved pointer of found object/rule >>>>> inside rule w/o proper locking >>>>> - making faster skipto >>>>> >>>>> >>>>> ??????i don't buy the idea that you need typed arguments >>>>> for all the cases above. Maybe the case that >>>>> may make sense is the fwd argument (and in the future >>>>> something else). >>>>> We already discussed, i think, the fact that now it >>>>> is legal to have references to non existing things >>>>> (skipto, pipes etc.) implemented as u32. >>>>> Removing that would break configurations. >>>> It depends on actual implementation. This can be preserved by >>>> auto-creating necessary objects in kernel and/or in userspace, so >>>> we can (and should) avoid breaking in this particular way. >>> Can you please explain your vision on values another time? >>> As far as I understand, you're not against it in general, but the >>> details matter: >>> * IP address can be one of the types (it won't break much, and we can >>> simply skip that one for MFC) >>> * what about typing for nat/pipes ? we're not going to convert their ids >>> to names? (or maybe you can suggest other non-disruptive way?) >>> * everything else is type "u32" >> >> Correct, I am mostly concerned about the details, not on the general concept. >> >> To summarize the discussion Alexander and I had about converting >> identifiers from numbers to arbitrary strings (this is partly related >> to the values stored in tables, but I think we should have a coherent >> behaviour) >> >> 1. CURRENTLY ipfw uses numeric identifiers in a small range (16 bits or less) >> for rules, pipes, queues, tables, probably nat instances. >> >> 2. CURRENTLY, in all the above contexts, it is legal to reference a >> non existing object (rule, pipe, table names, etc.), >> and the kernel will do something reasonable, namely jump to the >> next rule, drop traffic for non existing pipes, and so on. >> >> 3. of course we want to preserve backward compatibility both for >> the ioctl interface, and for user configurations. >> >> 4. The in-kernel representation of identifiers is not visible to users, >> so we can use a numeric representation in the kernel for identifiers. >> Strings like "12345" are converted with atoi() or the like, >> whereas for other identifiers or numbers outside of the 2^16 range >> the kernel manages a translation table, allocating new numeric >> identifiers if a new string appears. >> This permits backward compatibility for old rulesets, and does not >> impact performance because the translation table is only >> used during rules additions or deletion. > Yes. However this requires either holding either (1) 2 pointers (old&new > arrays), or (2) 65k+ index array, or (3) chained hash table. > (1) would require additional pointers for each subsystem (and some > additional management), > (2) will definitely upset embedded guys and > (3) is worse in terms of performance >> >> With this in mind, i think we should follow a similar approach for >> objects stored in tables, hence >> >> if an u32 value was available in the past, it must be >> available also in the new implementation. >> >> The issue with tables is that some convoluted configuration could >> use the same table to reference pipes _and_ rules _and_ perhaps >> other things represented as numbers (the former is not too strange, >> if i have a large configuration i might place sections at rules >> 12000, 13000, 14000... and associate pipes with the same numberic >> identifier to each block of rules). >> >> Typed table values would clearly disturb backward compatibility >> in the above configurations. However it should not be difficult >> to accept arbitrary strings as the values stored in tables, and >> then store multiple representations as appropriate, including: > Well, I've thought about thas one. It may be an option, but the details > are not so promising (below) >> - the string representation, unconditionally >> - for names that can be resolved by DNS, the ipv6 and ipv4 address(es) >> associated with them. ipfw already translates hostnames in rules >> so this is POLA > I'm not happy what ipfw(8) is doing instead of translation. The proper > way would be not simply using first AF_INET answer but saving ALL > IPv4+IPv6 records inside rule (and some more tracking should be done > afterwards, but that's totally different story). Additionally, I'm > unsure if we really need next-hop value expressed as hostname (how can > we deal with multiple addresses and diffrent AFs?). We may store strings > (and I think we should do it) but I'm unsure about this particular > option of interpreting them. >> - for other strings, a u32 from the translation table as previously >> indicated >> - and for numeric values, the u32 representation (truncated if needed, >> according to whatever is the existing behaviour) >> - >> If we cannot generate an u32 we will put some value (e.g. 0) >> that hopefully will not cause confusion. > As far as I understand, we accept some string "s" as table value inside > the kernel, than, we have some logic that says: > oh, dummynet pipe has the same name "s"s, oh, nat entity with name "s" > has just been created, let's save indices. > > That would require additional indirection table like: > > index | [ skipto idx | nat idx | pipe idx | queue idx | fwd index ] > ( so we will have 2-level indirection table for fwd if we do IPv6) > > We can optimize this if we use "same name -> same kidx" approach > regardless of kernel object we're refering to. That might require some > more memory, but that's OK from my point of view. > > So we end up with > int [ skipto idx | fwd idx | obj idx ] > > idx "0" is special value which means the same as 2.CURRENT > > That looks better, but still way to complex. > I do care about compatibility, but it's hard to improve things without > changing. > > I'd like to propose the following: > * Split values into 3 types ("ip|nexthop", "number", "object") > * Do not insist on object existence, use value "0" to mimic 2.CURRENT > behavior. > * Retain full compatibility by introducing special value type "legacy" > which matches any type and is backed by given indirection table. > * Issue warning in ipfw(8) binary on all auto-created tables that > auto-creation is legacy and this behavior will be dropped in next major > release (e.g. 11.0) > * Save this behavior in MFC but drop "legacy" tables in head after a > month after actual MFC. > > That do you think? >> >> If we do it this way, we should be able to preserve backward >> compatibility _and_ add features that people may need. >> >> cheers >> luigi >> Here is my idea: tablearg should contain more than one value. I think getting several values from one table lookup is faster than several table lookups with one value. Let tablearg be not just uint32, but array with different value types inside it. For example I have many such rules: allow src-ip 1.2.3.4 MAC any 11:22:33:44:55:66 recv vlan1234 dst-ip 1.1.1.1 These rules can be replaced with such construction: allow src-ip table(1) MAC any tablearg[1] recv tablearg[2] dst-ip tablearg[3] But I don't think indexing by value is a good idea. I think index==starting byte is a better way: allow src-ip table(1) MAC any tablearg:0 recv tablearg:6 dst-ip tablearg:32 where MAC's 6 bytes are from 0 to 5 in tablearg; iface string is from 6 and till \0, but less than 26 bytes; and IPv4's 4 bytes are from 32 to 35. So we need to create table for it: table 1 set MAC:0 string:6:26 ip:32 table 1 add 1.2.3.4 11:22:33:44:55:66 vlan1234 1.1.1.1 String can be used both for iface and comment. Other possible value types: uint16 for nat, pipe, skipto and other 2-bytes actions IPv4 4 bytes CIDRv4 5 bytes IPv6 16 bytes CIDRv6 17 bytes table_id 2 bytes - link to another table Table value length can be set for example with loader tunable like net.inet.ip.fw.table_value_length. Even with default uint32 value length we can get 2 uint16 values or 4 uint8 values, this can help in some configurations. This way is more complex, but much more flexible. It's like netgraph subsystem. I think it suites both Alexander and Luigi requests. From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 05:53:13 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C0E93858; Sat, 9 Aug 2014 05:53:13 +0000 (UTC) Received: from web01.jbserver.net (web01.jbserver.net [IPv6:2a00:8240:6:a::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 85F272149; Sat, 9 Aug 2014 05:53:13 +0000 (UTC) Received: from 18-132-17-190.fibertel.com.ar ([190.17.132.18] helo=[192.168.3.107]) by web01.jbserver.net with esmtpsa (TLSv1.2:DHE-RSA-AES128-SHA:128) (Exim 4.83) (envelope-from ) id 1XFzaX-0002Qb-4R; Sat, 09 Aug 2014 07:53:09 +0200 Message-ID: <53E5B71D.2030500@gont.com.ar> Date: Sat, 09 Aug 2014 01:52:29 -0400 From: Fernando Gont User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Hiroki Sato Subject: Re: Routing IPv6 packets towards oneself with routing sockets? References: <53E2B586.3080700@gont.com.ar> <20140807.192403.845244220459089560.hrs@allbsd.org> <53E35DA7.4020800@gont.com.ar> <20140808.053757.1725805140861121363.hrs@allbsd.org> In-Reply-To: <20140808.053757.1725805140861121363.hrs@allbsd.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 05:53:13 -0000 Hi, Hiroki, On 08/07/2014 04:37 PM, Hiroki Sato wrote: > > fe> However, whenever I lookup an entry for fc00:1::1 with routing sockets, > fe> the only entry I obtain is fc00:1::/64 (a network route) rather than > fe> fc00:1::1/128 (a host route). > > Does this mean you got RTA_DST with fc00:1::/64 when > "bsd-lookup-simple -v fc00:1::1"? If so, it is very strange. Nope, you're right. I get fc00:1::1... the only "problem" was that the outgoing interface was incorrect... but as you noted, that had to do with me not setting RTA_IFP. (FWIW, my guide for using routing sockets was Stevens' UNPv1... but IIRC he never mentioned that of setting RTA_IFP, but rather suggested that RTA_GATEWAY could return AF_INET or AF_LINK (in his discussion about IPv4, since there was no discussion about IPv6). > Although your code assumes RTA_GATEWAY eventually returns the > outgoing interface, it is not always true. RTA_IFP should be used if > you want to look up it instead of looking up gateways until AF_LINK > is obtained. Certainly RTA_GATEWAY returns AF_LINK and you can check > sdl_index in it, but the index number is not always the same as the > actual outgoing interface (one of the examples is a host route). Just curious: what's the meaning of the AF_LINK I was reading? > A revised source file is attached. Some nits are also fixed: 1) > SA_SIZE() on MacOSX is not aligned with sizeof(long) and 2) > IFACE_LENGTH should be IFNAMSIZ. Thanks so much! -- I'll incorporate these into the ipv6toolkit (that's the reason for which I was playing with this in the first place). Thanks again! Best regards, -- Fernando Gont e-mail: fernando@gont.com.ar || fgont@si6networks.com PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1 From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 09:52:42 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 85550788 for ; Sat, 9 Aug 2014 09:52:42 +0000 (UTC) Received: from mx12.netapp.com (mx12.netapp.com [216.240.18.77]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "mx12.netapp.com", Issuer "VeriSign Class 3 International Server CA - G3" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 5DCDA2716 for ; Sat, 9 Aug 2014 09:52:41 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.01,831,1400050800"; d="asc'?scan'208";a="180922884" Received: from vmwexceht04-prd.hq.netapp.com ([10.106.77.34]) by mx12-out.netapp.com with ESMTP; 09 Aug 2014 02:52:41 -0700 Received: from HIOEXCMBX06-PRD.hq.netapp.com (10.122.105.39) by vmwexceht04-prd.hq.netapp.com (10.106.77.34) with Microsoft SMTP Server (TLS) id 14.3.123.3; Sat, 9 Aug 2014 02:52:21 -0700 Received: from HIOEXCMBX07-PRD.hq.netapp.com (10.122.105.40) by hioexcmbx06-prd.hq.netapp.com (10.122.105.39) with Microsoft SMTP Server (TLS) id 15.0.913.22; Sat, 9 Aug 2014 02:52:08 -0700 Received: from HIOEXCMBX07-PRD.hq.netapp.com ([::1]) by hioexcmbx07-prd.hq.netapp.com ([fe80::f0de:b572:dd26:36b5%21]) with mapi id 15.00.0913.011; Sat, 9 Aug 2014 02:51:49 -0700 From: "Eggert, Lars" To: Niu Zhixiong Subject: Re: A problem on TCP in High RTT Environment. Thread-Topic: A problem on TCP in High RTT Environment. Thread-Index: AQHPswaXaqmGo9mF3Ua6+epBNevn4ZvIfnSA Date: Sat, 9 Aug 2014 09:51:49 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-mailer: Apple Mail (2.1878.6) x-originating-ip: [10.120.60.35] Content-Type: multipart/signed; boundary="Apple-Mail=_F3479225-B71E-45E7-9BAC-EAF4DAEF047F"; protocol="application/pgp-signature"; micalg=pgp-sha1 MIME-Version: 1.0 Cc: "freebsd-net@freebsd.org" , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 09:52:42 -0000 --Apple-Mail=_F3479225-B71E-45E7-9BAC-EAF4DAEF047F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 At 400ms at @ 20Mbps, your are probably receive window limited. Bump = net.inet.tcp.recvspace. (Your net.inet.sctp.recvspace is much larger, = which probably explains the performance difference.) On 2014-8-8, at 14:34, Niu Zhixiong wrote: > Dear all, >=20 > Last month, I send problems related to FTP/TCP in a high RTT = environment. > After that, I setup a simulation environment(Dummynet) to test TCP and = SCTP > in high delay environment. After finishing the test, I can see TCP is > always slower than SCTP. But, I think it is not possible. (Plz see the > figure in the attachment). When the delay is 200ms(means RTT=3D400ms). > Besides, the TCP is extremely slow. >=20 > ALL BW=3D20Mbps, DELAY=3D 0 ~ 200MS, Packet LOSS =3D 0 (by dummynet) >=20 > This is my parameters: > FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 > 11:04:15 HKT 2014 >=20 > sysctl net.inet.tcp > net.inet.tcp.rfc1323: 1 > net.inet.tcp.mssdflt: 536 > net.inet.tcp.keepidle: 7200000 > net.inet.tcp.keepintvl: 75000 > net.inet.tcp.sendspace: 32768 > net.inet.tcp.recvspace: 65536 > net.inet.tcp.keepinit: 75000 > net.inet.tcp.delacktime: 100 > net.inet.tcp.v6mssdflt: 1220 > net.inet.tcp.cc.algorithm: newreno > net.inet.tcp.cc.available: newreno > net.inet.tcp.hostcache.cachelimit: 15360 > net.inet.tcp.hostcache.hashsize: 512 > net.inet.tcp.hostcache.bucketlimit: 30 > net.inet.tcp.hostcache.count: 0 > net.inet.tcp.hostcache.expire: 3600 > net.inet.tcp.hostcache.prune: 300 > net.inet.tcp.hostcache.purge: 0 > net.inet.tcp.log_in_vain: 0 > net.inet.tcp.blackhole: 0 > net.inet.tcp.delayed_ack: 1 > net.inet.tcp.drop_synfin: 0 > net.inet.tcp.rfc3042: 1 > net.inet.tcp.rfc3390: 1 > net.inet.tcp.experimental.initcwnd10: 1 > net.inet.tcp.rfc3465: 1 > net.inet.tcp.abc_l_var: 2 > net.inet.tcp.ecn.enable: 0 > net.inet.tcp.ecn.maxretries: 1 > net.inet.tcp.insecure_rst: 0 > net.inet.tcp.recvbuf_auto: 0 > net.inet.tcp.recvbuf_inc: 16384 > net.inet.tcp.recvbuf_max: 2097152 > net.inet.tcp.path_mtu_discovery: 1 > net.inet.tcp.tso: 1 > net.inet.tcp.sendbuf_auto: 0 > net.inet.tcp.sendbuf_inc: 8192 > net.inet.tcp.sendbuf_max: 2097152 > net.inet.tcp.reass.maxsegments: 15900 > net.inet.tcp.reass.cursegments: 0 > net.inet.tcp.reass.overflows: 0 > net.inet.tcp.sack.enable: 1 > net.inet.tcp.sack.maxholes: 128 > net.inet.tcp.sack.globalmaxholes: 65536 > net.inet.tcp.sack.globalholes: 0 > net.inet.tcp.minmss: 216 > net.inet.tcp.log_debug: 0 > net.inet.tcp.tcbhashsize: 32768 > net.inet.tcp.do_tcpdrain: 1 > net.inet.tcp.pcbcount: 4 > net.inet.tcp.icmp_may_rst: 1 > net.inet.tcp.isn_reseed_interval: 0 > net.inet.tcp.soreceive_stream: 0 > net.inet.tcp.syncookies: 1 > net.inet.tcp.syncookies_only: 0 > net.inet.tcp.syncache.bucketlimit: 30 > net.inet.tcp.syncache.cachelimit: 15375 > net.inet.tcp.syncache.count: 0 > net.inet.tcp.syncache.hashsize: 512 > net.inet.tcp.syncache.rexmtlimit: 3 > net.inet.tcp.syncache.rst_on_sock_fail: 1 > net.inet.tcp.msl: 30000 > net.inet.tcp.rexmit_min: 30 > net.inet.tcp.rexmit_slop: 200 > net.inet.tcp.always_keepalive: 1 > net.inet.tcp.fast_finwait2_recycle: 0 > net.inet.tcp.finwait2_timeout: 60000 > net.inet.tcp.keepcnt: 8 > net.inet.tcp.rexmit_drop_options: 0 > net.inet.tcp.per_cpu_timers: 0 > net.inet.tcp.timer_race: 0 > net.inet.tcp.maxtcptw: 26070 > net.inet.tcp.nolocaltimewait: 0 >=20 > kern.ipc.maxsockbuf: 4000000 > kern.ipc.sockbuf_waste_factor: 8 > kern.ipc.max_linkhdr: 16 > kern.ipc.max_protohdr: 60 > kern.ipc.max_hdr: 76 > kern.ipc.max_datalen: 92 > kern.ipc.maxmbufmem: 2073962496 > kern.ipc.nmbclusters: 253170 > kern.ipc.nmbjumbop: 126584 > kern.ipc.nmbjumbo9: 112518 > kern.ipc.nmbjumbo16: 84388 > kern.ipc.nmbufs: 1620285 > kern.ipc.maxpipekva: 66736128 > kern.ipc.pipekva: 16384 > kern.ipc.pipefragretry: 0 > kern.ipc.pipeallocfail: 0 > kern.ipc.piperesizefail: 0 > kern.ipc.piperesizeallowed: 1 > kern.ipc.msgmax: 16384 > kern.ipc.msgmni: 40 > kern.ipc.msgmnb: 2048 > kern.ipc.msgtql: 40 > kern.ipc.msgssz: 8 > kern.ipc.msgseg: 2048 > kern.ipc.semmni: 50 > kern.ipc.semmns: 340 > kern.ipc.semmnu: 150 > kern.ipc.semmsl: 340 > kern.ipc.semopm: 100 > kern.ipc.semume: 50 > kern.ipc.semusz: 632 > kern.ipc.semvmx: 32767 > kern.ipc.semaem: 16384 > kern.ipc.shmmax: 536870912 > kern.ipc.shmmin: 1 > kern.ipc.shmmni: 192 > kern.ipc.shmseg: 128 > kern.ipc.shmall: 131072 > kern.ipc.shm_use_phys: 0 > kern.ipc.shm_allow_removed: 0 > kern.ipc.soacceptqueue: 128 > kern.ipc.numopensockets: 14 > kern.ipc.maxsockets: 130350 > kern.ipc.sendfile.readahead: 1 >=20 > sysctl net.inet.sctp > net.inet.sctp.sendspace: 2097152 > net.inet.sctp.recvspace: 2097152 > net.inet.sctp.auto_asconf: 1 > net.inet.sctp.ecn_enable: 1 > net.inet.sctp.strict_sacks: 1 > net.inet.sctp.peer_chkoh: 256 > net.inet.sctp.maxburst: 4 > net.inet.sctp.fr_maxburst: 4 > net.inet.sctp.maxchunks: 31646 > net.inet.sctp.tcbhashsize: 1024 > net.inet.sctp.pcbhashsize: 256 > net.inet.sctp.min_split_point: 2904 > net.inet.sctp.chunkscale: 10 > net.inet.sctp.delayed_sack_time: 200 > net.inet.sctp.sack_freq: 2 > net.inet.sctp.sys_resource: 1000 > net.inet.sctp.asoc_resource: 10 > net.inet.sctp.heartbeat_interval: 30000 > net.inet.sctp.pmtu_raise_time: 600 > net.inet.sctp.shutdown_guard_time: 180 > net.inet.sctp.secret_lifetime: 3600 > net.inet.sctp.rto_max: 60000 > net.inet.sctp.rto_min: 1000 > net.inet.sctp.rto_initial: 3000 > net.inet.sctp.init_rto_max: 60000 > net.inet.sctp.valid_cookie_life: 60000 > net.inet.sctp.init_rtx_max: 8 > net.inet.sctp.assoc_rtx_max: 10 > net.inet.sctp.path_rtx_max: 5 > net.inet.sctp.path_pf_threshold: 65535 > net.inet.sctp.add_more_on_output: 1452 > net.inet.sctp.incoming_streams: 2048 > net.inet.sctp.outgoing_streams: 10 > net.inet.sctp.cmt_on_off: 0 > net.inet.sctp.nr_sack_on_off: 0 > net.inet.sctp.cmt_use_dac: 0 > net.inet.sctp.cwnd_maxburst: 1 > net.inet.sctp.asconf_auth_nochk: 0 > net.inet.sctp.auth_disable: 0 > net.inet.sctp.nat_friendly: 1 > net.inet.sctp.abc_l_var: 2 > net.inet.sctp.max_chained_mbufs: 5 > net.inet.sctp.do_sctp_drain: 1 > net.inet.sctp.hb_max_burst: 4 > net.inet.sctp.abort_at_limit: 0 > net.inet.sctp.strict_data_order: 0 > net.inet.sctp.min_residual: 1452 > net.inet.sctp.max_retran_chunk: 30 > net.inet.sctp.log_level: 0 > net.inet.sctp.default_cc_module: 0 > net.inet.sctp.default_ss_module: 0 > net.inet.sctp.default_frag_interleave: 1 > net.inet.sctp.mobility_base: 0 > net.inet.sctp.mobility_fasthandoff: 0 > net.inet.sctp.udp_tunneling_port: 0 > net.inet.sctp.enable_sack_immediately: 0 > net.inet.sctp.nat_friendly_init: 0 > net.inet.sctp.vtag_time_wait: 60 > net.inet.sctp.buffer_splitting: 0 > net.inet.sctp.initial_cwnd: 3 > net.inet.sctp.rttvar_bw: 4 > net.inet.sctp.rttvar_rtt: 5 > net.inet.sctp.rttvar_eqret: 0 > net.inet.sctp.rttvar_steady_step: 20 > net.inet.sctp.use_dcccecn: 1 > net.inet.sctp.blackhole: 0 > net.inet.sctp.debug: 0 >=20 >=20 > Regards, > Niu Zhixiong > =EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D= =EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D > kaiaixi@gmail.com > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" --Apple-Mail=_F3479225-B71E-45E7-9BAC-EAF4DAEF047F Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="signature.asc" Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- iQCVAwUBU+XvR9ZcnpRveo1xAQJQ6gQApApLZlns8p5k8CSBUNJJMK9P9nUih30J lQJU0YHFORsmmQHDa2TENg/uWoIkWpS8QEzigDq8NQsfpwzpW9KFS4iy7o1RMtx1 cWQHD7qX4l/DkSa/PVI/+bF5bSuadO8WDlrfDAKxwEUaxeN3msfJ8FN8ZFHJlW8V cQBFhZ2pDHQ= =QLRo -----END PGP SIGNATURE----- --Apple-Mail=_F3479225-B71E-45E7-9BAC-EAF4DAEF047F-- From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 18:42:40 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 67F38FB8 for ; Sat, 9 Aug 2014 18:42:40 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "funkthat.com", Issuer "funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2AC822A4C for ; Sat, 9 Aug 2014 18:42:39 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s79IgWnZ098900 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 9 Aug 2014 11:42:33 -0700 (PDT) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s79IgWd0098899; Sat, 9 Aug 2014 11:42:32 -0700 (PDT) (envelope-from jmg) Date: Sat, 9 Aug 2014 11:42:32 -0700 From: John-Mark Gurney To: Niu Zhixiong Subject: Re: A problem on TCP in High RTT Environment. Message-ID: <20140809184232.GF83475@funkthat.com> Mail-Followup-To: Niu Zhixiong , freebsd-net@freebsd.org, Bill Yuan References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Sat, 09 Aug 2014 11:42:33 -0700 (PDT) Cc: freebsd-net@freebsd.org, Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 18:42:40 -0000 Niu Zhixiong wrote this message on Fri, Aug 08, 2014 at 20:34 +0800: > Dear all, > > Last month, I send problems related to FTP/TCP in a high RTT environment. > After that, I setup a simulation environment(Dummynet) to test TCP and SCTP > in high delay environment. After finishing the test, I can see TCP is > always slower than SCTP. But, I think it is not possible. (Plz see the > figure in the attachment). When the delay is 200ms(means RTT=400ms). > Besides, the TCP is extremely slow. > > ALL BW=20Mbps, DELAY= 0 ~ 200MS, Packet LOSS = 0 (by dummynet) > > This is my parameters: > FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 > 11:04:15 HKT 2014 > > sysctl net.inet.tcp [...] > net.inet.tcp.recvbuf_auto: 0 [...] > net.inet.tcp.sendbuf_auto: 0 Try enabling this... This should allow the buffer to grow large enough to deal w/ the higher latency... Also, make sure your program isn't setting the recv buffer size as that will disable the auto growing... If you use netstat -a, you should be able to see the send-q on the sender grow as necessary... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 19:49:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 58FB9CBD for ; Sat, 9 Aug 2014 19:49:09 +0000 (UTC) Received: from mail-n.franken.de (drew.ipv6.franken.de [IPv6:2001:638:a02:a001:20e:cff:fe4a:feaa]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mail-n.franken.de", Issuer "Thawte DV SSL CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DBAF120BB for ; Sat, 9 Aug 2014 19:49:08 +0000 (UTC) Received: from [192.168.1.200] (p54819F65.dip0.t-ipconnect.de [84.129.159.101]) (Authenticated sender: macmic) by mail-n.franken.de (Postfix) with ESMTP id 962401C104DF2; Sat, 9 Aug 2014 21:49:04 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: A problem on TCP in High RTT Environment. From: Michael Tuexen In-Reply-To: Date: Sat, 9 Aug 2014 21:49:03 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: "Eggert, Lars" X-Mailer: Apple Mail (2.1878.6) Cc: "freebsd-net@freebsd.org" , Niu Zhixiong , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 19:49:09 -0000 On 09 Aug 2014, at 11:51, Eggert, Lars wrote: > At 400ms at @ 20Mbps, your are probably receive window limited. Bump = net.inet.tcp.recvspace. (Your net.inet.sctp.recvspace is much larger, = which probably explains the performance difference.) The program he is using is netperfmeter and I think it sets the = send/recv buffer to 2MB... Best regards Michael >=20 > On 2014-8-8, at 14:34, Niu Zhixiong wrote: >=20 >> Dear all, >>=20 >> Last month, I send problems related to FTP/TCP in a high RTT = environment. >> After that, I setup a simulation environment(Dummynet) to test TCP = and SCTP >> in high delay environment. After finishing the test, I can see TCP is >> always slower than SCTP. But, I think it is not possible. (Plz see = the >> figure in the attachment). When the delay is 200ms(means RTT=3D400ms). >> Besides, the TCP is extremely slow. >>=20 >> ALL BW=3D20Mbps, DELAY=3D 0 ~ 200MS, Packet LOSS =3D 0 (by dummynet) >>=20 >> This is my parameters: >> FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 >> 11:04:15 HKT 2014 >>=20 >> sysctl net.inet.tcp >> net.inet.tcp.rfc1323: 1 >> net.inet.tcp.mssdflt: 536 >> net.inet.tcp.keepidle: 7200000 >> net.inet.tcp.keepintvl: 75000 >> net.inet.tcp.sendspace: 32768 >> net.inet.tcp.recvspace: 65536 >> net.inet.tcp.keepinit: 75000 >> net.inet.tcp.delacktime: 100 >> net.inet.tcp.v6mssdflt: 1220 >> net.inet.tcp.cc.algorithm: newreno >> net.inet.tcp.cc.available: newreno >> net.inet.tcp.hostcache.cachelimit: 15360 >> net.inet.tcp.hostcache.hashsize: 512 >> net.inet.tcp.hostcache.bucketlimit: 30 >> net.inet.tcp.hostcache.count: 0 >> net.inet.tcp.hostcache.expire: 3600 >> net.inet.tcp.hostcache.prune: 300 >> net.inet.tcp.hostcache.purge: 0 >> net.inet.tcp.log_in_vain: 0 >> net.inet.tcp.blackhole: 0 >> net.inet.tcp.delayed_ack: 1 >> net.inet.tcp.drop_synfin: 0 >> net.inet.tcp.rfc3042: 1 >> net.inet.tcp.rfc3390: 1 >> net.inet.tcp.experimental.initcwnd10: 1 >> net.inet.tcp.rfc3465: 1 >> net.inet.tcp.abc_l_var: 2 >> net.inet.tcp.ecn.enable: 0 >> net.inet.tcp.ecn.maxretries: 1 >> net.inet.tcp.insecure_rst: 0 >> net.inet.tcp.recvbuf_auto: 0 >> net.inet.tcp.recvbuf_inc: 16384 >> net.inet.tcp.recvbuf_max: 2097152 >> net.inet.tcp.path_mtu_discovery: 1 >> net.inet.tcp.tso: 1 >> net.inet.tcp.sendbuf_auto: 0 >> net.inet.tcp.sendbuf_inc: 8192 >> net.inet.tcp.sendbuf_max: 2097152 >> net.inet.tcp.reass.maxsegments: 15900 >> net.inet.tcp.reass.cursegments: 0 >> net.inet.tcp.reass.overflows: 0 >> net.inet.tcp.sack.enable: 1 >> net.inet.tcp.sack.maxholes: 128 >> net.inet.tcp.sack.globalmaxholes: 65536 >> net.inet.tcp.sack.globalholes: 0 >> net.inet.tcp.minmss: 216 >> net.inet.tcp.log_debug: 0 >> net.inet.tcp.tcbhashsize: 32768 >> net.inet.tcp.do_tcpdrain: 1 >> net.inet.tcp.pcbcount: 4 >> net.inet.tcp.icmp_may_rst: 1 >> net.inet.tcp.isn_reseed_interval: 0 >> net.inet.tcp.soreceive_stream: 0 >> net.inet.tcp.syncookies: 1 >> net.inet.tcp.syncookies_only: 0 >> net.inet.tcp.syncache.bucketlimit: 30 >> net.inet.tcp.syncache.cachelimit: 15375 >> net.inet.tcp.syncache.count: 0 >> net.inet.tcp.syncache.hashsize: 512 >> net.inet.tcp.syncache.rexmtlimit: 3 >> net.inet.tcp.syncache.rst_on_sock_fail: 1 >> net.inet.tcp.msl: 30000 >> net.inet.tcp.rexmit_min: 30 >> net.inet.tcp.rexmit_slop: 200 >> net.inet.tcp.always_keepalive: 1 >> net.inet.tcp.fast_finwait2_recycle: 0 >> net.inet.tcp.finwait2_timeout: 60000 >> net.inet.tcp.keepcnt: 8 >> net.inet.tcp.rexmit_drop_options: 0 >> net.inet.tcp.per_cpu_timers: 0 >> net.inet.tcp.timer_race: 0 >> net.inet.tcp.maxtcptw: 26070 >> net.inet.tcp.nolocaltimewait: 0 >>=20 >> kern.ipc.maxsockbuf: 4000000 >> kern.ipc.sockbuf_waste_factor: 8 >> kern.ipc.max_linkhdr: 16 >> kern.ipc.max_protohdr: 60 >> kern.ipc.max_hdr: 76 >> kern.ipc.max_datalen: 92 >> kern.ipc.maxmbufmem: 2073962496 >> kern.ipc.nmbclusters: 253170 >> kern.ipc.nmbjumbop: 126584 >> kern.ipc.nmbjumbo9: 112518 >> kern.ipc.nmbjumbo16: 84388 >> kern.ipc.nmbufs: 1620285 >> kern.ipc.maxpipekva: 66736128 >> kern.ipc.pipekva: 16384 >> kern.ipc.pipefragretry: 0 >> kern.ipc.pipeallocfail: 0 >> kern.ipc.piperesizefail: 0 >> kern.ipc.piperesizeallowed: 1 >> kern.ipc.msgmax: 16384 >> kern.ipc.msgmni: 40 >> kern.ipc.msgmnb: 2048 >> kern.ipc.msgtql: 40 >> kern.ipc.msgssz: 8 >> kern.ipc.msgseg: 2048 >> kern.ipc.semmni: 50 >> kern.ipc.semmns: 340 >> kern.ipc.semmnu: 150 >> kern.ipc.semmsl: 340 >> kern.ipc.semopm: 100 >> kern.ipc.semume: 50 >> kern.ipc.semusz: 632 >> kern.ipc.semvmx: 32767 >> kern.ipc.semaem: 16384 >> kern.ipc.shmmax: 536870912 >> kern.ipc.shmmin: 1 >> kern.ipc.shmmni: 192 >> kern.ipc.shmseg: 128 >> kern.ipc.shmall: 131072 >> kern.ipc.shm_use_phys: 0 >> kern.ipc.shm_allow_removed: 0 >> kern.ipc.soacceptqueue: 128 >> kern.ipc.numopensockets: 14 >> kern.ipc.maxsockets: 130350 >> kern.ipc.sendfile.readahead: 1 >>=20 >> sysctl net.inet.sctp >> net.inet.sctp.sendspace: 2097152 >> net.inet.sctp.recvspace: 2097152 >> net.inet.sctp.auto_asconf: 1 >> net.inet.sctp.ecn_enable: 1 >> net.inet.sctp.strict_sacks: 1 >> net.inet.sctp.peer_chkoh: 256 >> net.inet.sctp.maxburst: 4 >> net.inet.sctp.fr_maxburst: 4 >> net.inet.sctp.maxchunks: 31646 >> net.inet.sctp.tcbhashsize: 1024 >> net.inet.sctp.pcbhashsize: 256 >> net.inet.sctp.min_split_point: 2904 >> net.inet.sctp.chunkscale: 10 >> net.inet.sctp.delayed_sack_time: 200 >> net.inet.sctp.sack_freq: 2 >> net.inet.sctp.sys_resource: 1000 >> net.inet.sctp.asoc_resource: 10 >> net.inet.sctp.heartbeat_interval: 30000 >> net.inet.sctp.pmtu_raise_time: 600 >> net.inet.sctp.shutdown_guard_time: 180 >> net.inet.sctp.secret_lifetime: 3600 >> net.inet.sctp.rto_max: 60000 >> net.inet.sctp.rto_min: 1000 >> net.inet.sctp.rto_initial: 3000 >> net.inet.sctp.init_rto_max: 60000 >> net.inet.sctp.valid_cookie_life: 60000 >> net.inet.sctp.init_rtx_max: 8 >> net.inet.sctp.assoc_rtx_max: 10 >> net.inet.sctp.path_rtx_max: 5 >> net.inet.sctp.path_pf_threshold: 65535 >> net.inet.sctp.add_more_on_output: 1452 >> net.inet.sctp.incoming_streams: 2048 >> net.inet.sctp.outgoing_streams: 10 >> net.inet.sctp.cmt_on_off: 0 >> net.inet.sctp.nr_sack_on_off: 0 >> net.inet.sctp.cmt_use_dac: 0 >> net.inet.sctp.cwnd_maxburst: 1 >> net.inet.sctp.asconf_auth_nochk: 0 >> net.inet.sctp.auth_disable: 0 >> net.inet.sctp.nat_friendly: 1 >> net.inet.sctp.abc_l_var: 2 >> net.inet.sctp.max_chained_mbufs: 5 >> net.inet.sctp.do_sctp_drain: 1 >> net.inet.sctp.hb_max_burst: 4 >> net.inet.sctp.abort_at_limit: 0 >> net.inet.sctp.strict_data_order: 0 >> net.inet.sctp.min_residual: 1452 >> net.inet.sctp.max_retran_chunk: 30 >> net.inet.sctp.log_level: 0 >> net.inet.sctp.default_cc_module: 0 >> net.inet.sctp.default_ss_module: 0 >> net.inet.sctp.default_frag_interleave: 1 >> net.inet.sctp.mobility_base: 0 >> net.inet.sctp.mobility_fasthandoff: 0 >> net.inet.sctp.udp_tunneling_port: 0 >> net.inet.sctp.enable_sack_immediately: 0 >> net.inet.sctp.nat_friendly_init: 0 >> net.inet.sctp.vtag_time_wait: 60 >> net.inet.sctp.buffer_splitting: 0 >> net.inet.sctp.initial_cwnd: 3 >> net.inet.sctp.rttvar_bw: 4 >> net.inet.sctp.rttvar_rtt: 5 >> net.inet.sctp.rttvar_eqret: 0 >> net.inet.sctp.rttvar_steady_step: 20 >> net.inet.sctp.use_dcccecn: 1 >> net.inet.sctp.blackhole: 0 >> net.inet.sctp.debug: 0 >>=20 >>=20 >> Regards, >> Niu Zhixiong >> =EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D= =EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D=EF=BC=8D >> kaiaixi@gmail.com >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to = "freebsd-net-unsubscribe@freebsd.org" >=20 From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 19:52:00 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 14E1AE44 for ; Sat, 9 Aug 2014 19:52:00 +0000 (UTC) Received: from mail-n.franken.de (drew.ipv6.franken.de [IPv6:2001:638:a02:a001:20e:cff:fe4a:feaa]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mail-n.franken.de", Issuer "Thawte DV SSL CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CABA22156 for ; Sat, 9 Aug 2014 19:51:59 +0000 (UTC) Received: from [192.168.1.200] (p54819F65.dip0.t-ipconnect.de [84.129.159.101]) (Authenticated sender: macmic) by mail-n.franken.de (Postfix) with ESMTP id F2D241C10481A; Sat, 9 Aug 2014 21:51:56 +0200 (CEST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: A problem on TCP in High RTT Environment. From: Michael Tuexen In-Reply-To: <20140809184232.GF83475@funkthat.com> Date: Sat, 9 Aug 2014 21:51:55 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> References: <20140809184232.GF83475@funkthat.com> To: John-Mark Gurney X-Mailer: Apple Mail (2.1878.6) Cc: freebsd-net@freebsd.org, Niu Zhixiong , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 19:52:00 -0000 On 09 Aug 2014, at 20:42, John-Mark Gurney wrote: > Niu Zhixiong wrote this message on Fri, Aug 08, 2014 at 20:34 +0800: >> Dear all, >>=20 >> Last month, I send problems related to FTP/TCP in a high RTT = environment. >> After that, I setup a simulation environment(Dummynet) to test TCP = and SCTP >> in high delay environment. After finishing the test, I can see TCP is >> always slower than SCTP. But, I think it is not possible. (Plz see = the >> figure in the attachment). When the delay is 200ms(means RTT=3D400ms). >> Besides, the TCP is extremely slow. >>=20 >> ALL BW=3D20Mbps, DELAY=3D 0 ~ 200MS, Packet LOSS =3D 0 (by dummynet) >>=20 >> This is my parameters: >> FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 >> 11:04:15 HKT 2014 >>=20 >> sysctl net.inet.tcp >=20 > [...] >=20 >> net.inet.tcp.recvbuf_auto: 0 >=20 > [...] >=20 >> net.inet.tcp.sendbuf_auto: 0 >=20 > Try enabling this... This should allow the buffer to grow large = enough > to deal w/ the higher latency... >=20 > Also, make sure your program isn't setting the recv buffer size as = that > will disable the auto growing... I think the program sets the buffer to 2MB, which it also does for SCTP. So having both statically at the same size makes sense for the = comparison. I remember that there was a bug in the combination of LRO and delayed = ACK, which was fixed, but I don't remember it was fixed before 10.0... Best regards Michael >=20 > If you use netstat -a, you should be able to see the send-q on the > sender grow as necessary... >=20 > --=20 > John-Mark Gurney Voice: +1 415 225 5579 >=20 > "All that I will do, has been done, All that I have, has not." > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20 From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 20:22:08 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2D4B97D1 for ; Sat, 9 Aug 2014 20:22:08 +0000 (UTC) Received: from mail-la0-x22a.google.com (mail-la0-x22a.google.com [IPv6:2a00:1450:4010:c03::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A9C7A24C3 for ; Sat, 9 Aug 2014 20:22:07 +0000 (UTC) Received: by mail-la0-f42.google.com with SMTP id pv20so5685761lab.29 for ; Sat, 09 Aug 2014 13:22:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=oH0/t8WR+vNMNGhB8RwR0NYsdx/WUNpYm1t0sp9Ltik=; b=R+9py5J9sMZi88lEznbBVdm16L3oW3LZusl27KnYAlWxEvQGcg4ih9r52PSHSf9GZq sjb+P6nCz9iDV0Rw/DYQDRh4ZSbKTp/4GBLIOniWdZ4ZXgoh0KaDk6jUFqK5Y5mXmzU/ +lNhRnZ6BOzonGUSvd4yk54Sys/p2UgUJgf7IjFKvU27rn1q4Ltl0xVKA7/rgxTkZZxM TpqP8qekXge3URcO+Jh83AWY5kYWriexoeZ7D4mk1XTggRJ09UXAHuBliuBQsUZe4T7X xOqbupFaunnzfmZP1Vx0YatH8GQ81OprpJJlC+uVl3Zib/T0SwDmltwpaNu085TY+r3n F2MA== MIME-Version: 1.0 X-Received: by 10.152.8.82 with SMTP id p18mr10518laa.83.1407615725070; Sat, 09 Aug 2014 13:22:05 -0700 (PDT) Received: by 10.114.81.73 with HTTP; Sat, 9 Aug 2014 13:22:05 -0700 (PDT) In-Reply-To: <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> References: <20140809184232.GF83475@funkthat.com> <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> Date: Sat, 9 Aug 2014 13:22:05 -0700 Message-ID: Subject: Re: A problem on TCP in High RTT Environment. From: hiren panchasara To: Michael Tuexen Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-net@freebsd.org" , John-Mark Gurney , Niu Zhixiong , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 20:22:08 -0000 On Sat, Aug 9, 2014 at 12:51 PM, Michael Tuexen wrote: > > On 09 Aug 2014, at 20:42, John-Mark Gurney wrote: > >> Niu Zhixiong wrote this message on Fri, Aug 08, 2014 at 20:34 +0800: >>> Dear all, >>> >>> Last month, I send problems related to FTP/TCP in a high RTT environment. >>> After that, I setup a simulation environment(Dummynet) to test TCP and SCTP >>> in high delay environment. After finishing the test, I can see TCP is >>> always slower than SCTP. But, I think it is not possible. (Plz see the >>> figure in the attachment). When the delay is 200ms(means RTT=400ms). >>> Besides, the TCP is extremely slow. >>> >>> ALL BW=20Mbps, DELAY= 0 ~ 200MS, Packet LOSS = 0 (by dummynet) >>> >>> This is my parameters: >>> FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 >>> 11:04:15 HKT 2014 >>> >>> sysctl net.inet.tcp >> >> [...] >> >>> net.inet.tcp.recvbuf_auto: 0 >> >> [...] >> >>> net.inet.tcp.sendbuf_auto: 0 >> >> Try enabling this... This should allow the buffer to grow large enough >> to deal w/ the higher latency... >> >> Also, make sure your program isn't setting the recv buffer size as that >> will disable the auto growing... > I think the program sets the buffer to 2MB, which it also does for SCTP. > So having both statically at the same size makes sense for the comparison. > I remember that there was a bug in the combination of LRO and delayed ACK, > which was fixed, but I don't remember it was fixed before 10.0... If you are thinking of r256920, I believe it did make it into 10.0R. cheers, Hiren From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 20:45:02 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id AD48DE3 for ; Sat, 9 Aug 2014 20:45:02 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "funkthat.com", Issuer "funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 88C90267A for ; Sat, 9 Aug 2014 20:45:01 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s79Kj0sX000553 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 9 Aug 2014 13:45:01 -0700 (PDT) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s79Kj0Cg000552; Sat, 9 Aug 2014 13:45:00 -0700 (PDT) (envelope-from jmg) Date: Sat, 9 Aug 2014 13:45:00 -0700 From: John-Mark Gurney To: Michael Tuexen Subject: Re: A problem on TCP in High RTT Environment. Message-ID: <20140809204500.GG83475@funkthat.com> Mail-Followup-To: Michael Tuexen , freebsd-net@freebsd.org, Niu Zhixiong , Bill Yuan References: <20140809184232.GF83475@funkthat.com> <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Sat, 09 Aug 2014 13:45:01 -0700 (PDT) Cc: freebsd-net@freebsd.org, Niu Zhixiong , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 20:45:02 -0000 Michael Tuexen wrote this message on Sat, Aug 09, 2014 at 21:51 +0200: > > On 09 Aug 2014, at 20:42, John-Mark Gurney wrote: > > > Niu Zhixiong wrote this message on Fri, Aug 08, 2014 at 20:34 +0800: > >> Dear all, > >> > >> Last month, I send problems related to FTP/TCP in a high RTT environment. > >> After that, I setup a simulation environment(Dummynet) to test TCP and SCTP > >> in high delay environment. After finishing the test, I can see TCP is > >> always slower than SCTP. But, I think it is not possible. (Plz see the > >> figure in the attachment). When the delay is 200ms(means RTT=400ms). > >> Besides, the TCP is extremely slow. > >> > >> ALL BW=20Mbps, DELAY= 0 ~ 200MS, Packet LOSS = 0 (by dummynet) > >> > >> This is my parameters: > >> FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 > >> 11:04:15 HKT 2014 > >> > >> sysctl net.inet.tcp > > > > [...] > > > >> net.inet.tcp.recvbuf_auto: 0 > > > > [...] > > > >> net.inet.tcp.sendbuf_auto: 0 > > > > Try enabling this... This should allow the buffer to grow large enough > > to deal w/ the higher latency... > > > > Also, make sure your program isn't setting the recv buffer size as that > > will disable the auto growing... > I think the program sets the buffer to 2MB, which it also does for SCTP. > So having both statically at the same size makes sense for the comparison. > I remember that there was a bug in the combination of LRO and delayed ACK, > which was fixed, but I don't remember it was fixed before 10.0... Sounds like disabling LRO and TSO would be a useful test to see if that improves things... But hiren said that the fix made it, so... > > If you use netstat -a, you should be able to see the send-q on the > > sender grow as necessary... Also, getting the send-q output while it's running would let us know if the buffer is getting to 2MB or not... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 20:57:33 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 406664B3 for ; Sat, 9 Aug 2014 20:57:33 +0000 (UTC) Received: from mail-n.franken.de (drew.ipv6.franken.de [IPv6:2001:638:a02:a001:20e:cff:fe4a:feaa]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mail-n.franken.de", Issuer "Thawte DV SSL CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 002A9278D for ; Sat, 9 Aug 2014 20:57:32 +0000 (UTC) Received: from [192.168.1.200] (p54819F65.dip0.t-ipconnect.de [84.129.159.101]) (Authenticated sender: macmic) by mail-n.franken.de (Postfix) with ESMTP id E77411C10481A; Sat, 9 Aug 2014 22:57:29 +0200 (CEST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: A problem on TCP in High RTT Environment. From: Michael Tuexen In-Reply-To: Date: Sat, 9 Aug 2014 22:57:29 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <7D6601F0-268C-4615-8243-9020499D68B0@lurchi.franken.de> References: <20140809184232.GF83475@funkthat.com> <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> To: hiren panchasara X-Mailer: Apple Mail (2.1878.6) Cc: "freebsd-net@freebsd.org" , John-Mark Gurney , Niu Zhixiong , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 20:57:33 -0000 On 09 Aug 2014, at 22:22, hiren panchasara = wrote: > On Sat, Aug 9, 2014 at 12:51 PM, Michael Tuexen > wrote: >>=20 >> On 09 Aug 2014, at 20:42, John-Mark Gurney wrote: >>=20 >>> Niu Zhixiong wrote this message on Fri, Aug 08, 2014 at 20:34 +0800: >>>> Dear all, >>>>=20 >>>> Last month, I send problems related to FTP/TCP in a high RTT = environment. >>>> After that, I setup a simulation environment(Dummynet) to test TCP = and SCTP >>>> in high delay environment. After finishing the test, I can see TCP = is >>>> always slower than SCTP. But, I think it is not possible. (Plz see = the >>>> figure in the attachment). When the delay is 200ms(means = RTT=3D400ms). >>>> Besides, the TCP is extremely slow. >>>>=20 >>>> ALL BW=3D20Mbps, DELAY=3D 0 ~ 200MS, Packet LOSS =3D 0 (by = dummynet) >>>>=20 >>>> This is my parameters: >>>> FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 >>>> 11:04:15 HKT 2014 >>>>=20 >>>> sysctl net.inet.tcp >>>=20 >>> [...] >>>=20 >>>> net.inet.tcp.recvbuf_auto: 0 >>>=20 >>> [...] >>>=20 >>>> net.inet.tcp.sendbuf_auto: 0 >>>=20 >>> Try enabling this... This should allow the buffer to grow large = enough >>> to deal w/ the higher latency... >>>=20 >>> Also, make sure your program isn't setting the recv buffer size as = that >>> will disable the auto growing... >> I think the program sets the buffer to 2MB, which it also does for = SCTP. >> So having both statically at the same size makes sense for the = comparison. >> I remember that there was a bug in the combination of LRO and delayed = ACK, >> which was fixed, but I don't remember it was fixed before 10.0... >=20 > If you are thinking of r256920, I believe it did make it into 10.0R. Yepp, that is what I was thinking of... Best regards Michael >=20 > cheers, > Hiren >=20 From owner-freebsd-net@FreeBSD.ORG Sat Aug 9 20:58:28 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 32307548 for ; Sat, 9 Aug 2014 20:58:28 +0000 (UTC) Received: from mail-n.franken.de (drew.ipv6.franken.de [IPv6:2001:638:a02:a001:20e:cff:fe4a:feaa]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mail-n.franken.de", Issuer "Thawte DV SSL CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E79D1279D for ; Sat, 9 Aug 2014 20:58:27 +0000 (UTC) Received: from [192.168.1.200] (p54819F65.dip0.t-ipconnect.de [84.129.159.101]) (Authenticated sender: macmic) by mail-n.franken.de (Postfix) with ESMTP id 25B041C10481A; Sat, 9 Aug 2014 22:58:25 +0200 (CEST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: A problem on TCP in High RTT Environment. From: Michael Tuexen In-Reply-To: <20140809204500.GG83475@funkthat.com> Date: Sat, 9 Aug 2014 22:58:25 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <3F6BC212-4223-4AAC-8668-A27075DC55C2@lurchi.franken.de> References: <20140809184232.GF83475@funkthat.com> <8AE1AC56-D52F-4F13-AAA3-BB96042B37DD@lurchi.franken.de> <20140809204500.GG83475@funkthat.com> To: John-Mark Gurney X-Mailer: Apple Mail (2.1878.6) Cc: freebsd-net@freebsd.org, Niu Zhixiong , Bill Yuan X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Aug 2014 20:58:28 -0000 On 09 Aug 2014, at 22:45, John-Mark Gurney wrote: > Michael Tuexen wrote this message on Sat, Aug 09, 2014 at 21:51 +0200: >>=20 >> On 09 Aug 2014, at 20:42, John-Mark Gurney wrote: >>=20 >>> Niu Zhixiong wrote this message on Fri, Aug 08, 2014 at 20:34 +0800: >>>> Dear all, >>>>=20 >>>> Last month, I send problems related to FTP/TCP in a high RTT = environment. >>>> After that, I setup a simulation environment(Dummynet) to test TCP = and SCTP >>>> in high delay environment. After finishing the test, I can see TCP = is >>>> always slower than SCTP. But, I think it is not possible. (Plz see = the >>>> figure in the attachment). When the delay is 200ms(means = RTT=3D400ms). >>>> Besides, the TCP is extremely slow. >>>>=20 >>>> ALL BW=3D20Mbps, DELAY=3D 0 ~ 200MS, Packet LOSS =3D 0 (by = dummynet) >>>>=20 >>>> This is my parameters: >>>> FreeBSD vfreetest0 10.0-RELEASE FreeBSD 10.0-RELEASE #0: Thu Aug 7 >>>> 11:04:15 HKT 2014 >>>>=20 >>>> sysctl net.inet.tcp >>>=20 >>> [...] >>>=20 >>>> net.inet.tcp.recvbuf_auto: 0 >>>=20 >>> [...] >>>=20 >>>> net.inet.tcp.sendbuf_auto: 0 >>>=20 >>> Try enabling this... This should allow the buffer to grow large = enough >>> to deal w/ the higher latency... >>>=20 >>> Also, make sure your program isn't setting the recv buffer size as = that >>> will disable the auto growing... >> I think the program sets the buffer to 2MB, which it also does for = SCTP. >> So having both statically at the same size makes sense for the = comparison. >> I remember that there was a bug in the combination of LRO and delayed = ACK, >> which was fixed, but I don't remember it was fixed before 10.0... >=20 > Sounds like disabling LRO and TSO would be a useful test to see if = that > improves things... But hiren said that the fix made it, so... >=20 >>> If you use netstat -a, you should be able to see the send-q on the >>> sender grow as necessary... >=20 > Also, getting the send-q output while it's running would let us know > if the buffer is getting to 2MB or not... That is correct. Niu: Can you provide this? Best regards Michael >=20 > --=20 > John-Mark Gurney Voice: +1 415 225 5579 >=20 > "All that I will do, has been done, All that I have, has not." >=20