From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 01:40:45 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 53913559 for ; Sun, 26 Jan 2014 01:40:45 +0000 (UTC) Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5B0D11A2F for ; Sun, 26 Jan 2014 01:40:43 +0000 (UTC) Received: from 172.24.2.119 (EHLO szxeml207-edg.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id BPD48434; Sun, 26 Jan 2014 09:40:23 +0800 (CST) Received: from SZXEML410-HUB.china.huawei.com (10.82.67.137) by szxeml207-edg.china.huawei.com (172.24.2.56) with Microsoft SMTP Server (TLS) id 14.3.158.1; Sun, 26 Jan 2014 09:40:15 +0800 Received: from [127.0.0.1] (10.177.18.75) by szxeml410-hub.china.huawei.com (10.82.67.137) with Microsoft SMTP Server id 14.3.158.1; Sun, 26 Jan 2014 09:40:01 +0800 Message-ID: <52E46770.2000000@huawei.com> Date: Sun, 26 Jan 2014 09:40:00 +0800 From: Wang Weidong User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Vincenzo Maffione Subject: Re: netmap: I got some troubles with netmap References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> <52DD1914.7090506@iet.unipi.it> <52E1E272.8060009@huawei.com> In-Reply-To: Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.177.18.75] X-CFilter-Loop: Reflected Cc: =?ISO-8859-1?Q?facolt=E0?= , Giuseppe Lettieri , Luigi Rizzo , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 01:40:45 -0000 On 2014/1/24 22:56, Vincenzo Maffione wrote: > > > > 2014/1/24 Wang Weidong > > [...] > > Hello, > [...] > > You are using the old/deprecated QEMU command line syntax (-net), and therefore honestly It's not clear to me what kind of network configuration you are running. > Here, I use the default configuration which provided by the QEMU. > Please use our scripts "launch-qemu.sh", "prep-taps.sh", according to what described in the README.images file (attached). > Alternatively, use the syntax like in the following examples > > (#1) qemu-system-x86_64 archdisk.qcow -enable-kvm -device virtio-net-pci,netdev=mynet -netdev tap,ifname=tap01,id=mynet,script=no,downscript=no -smp 2 > (#2) qemu-system-x86_64 archdisk.qcow -enable-kvm -device e1000,mitigation=off,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 > I will use them, thanks. > so that it's clear to us what network frontend (e.g. emulated NIC) and network backend (e.g. netmap, tap, vde, ecc..) you are using. > In example #1 we are using virtio-net as frontend and tap as backend, while in example #2 we are using e1000 as frontend and netmap as backend. > Also consider giving more than one core (e.g. -smp 2) to each guest, to mitigate receiver livelock problems. > > > > 2. I use the vale below: > qemu-system-x86_64 -m 2048 -boot c -net nic -net netmap,vale0:0 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > > Same for here, it's not clear what you are using. I guess each guest has an e1000 device and is connected to a different port of the same vale switch (e.g. vale0:0 and vale0:1)? > > Test with 2 vms from the same host > vale0 without device. > I use the pkt-gen, the speed is 938 Kpps > > > You should get ~4Mpps with e1000 frontend + netmap backend on a reasonably good machine. Make sure you have ./configure'd QEMU with --enable-e1000-paravirt. > > > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 195M/195M, then add -- -m 8, I only got 1.07M/1.07M. > When use the smaller msg size, the speed will smaller? > > > If you use e1000 with netperf (without pkt-gen) your performance is doomed to be horrible. Use e1000-paravirt (as a frontend) instead if you are interested in netperf experiment. > Also consider that the point in using the "-- -m8" options is experimenting high packet rates, so what you should measure here is not the througput in Mbps, but the packet rate: netperf reports the number of packets sent and received, so you can obtain the packet rate by dividing by the running time. > The throughput in Mbps is uninteresting, if you want high bulk throughput you just don't use "-- -m 8", but leave the defaults. > Using virtio-net in this case will help because of the TSO offloadings. > > cheers > Vincenzo > Hi Vincenzo, Nice, I will retest them. Thanks, Wang > > > with vale-ctl -a vale0:eth2, > use pkt-gen, the speed is 928 Kpps > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 209M/208M, then add -- -m 8, I only got 1.06M/1.06M. > > with vale-ctl -h vale0:eth2, > use pkt-gen, the speed is 928 Kpps > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 192M/192M, then add -- -m 8, I only got 1.06M/1.06M. > > Test with 2 vms form two host, > I only can test it by vale-ctl -h vale0:eth2 and set eth2 into promisc > use pkt-gen with the default params, the speed is about 750 Kpps > use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 160M/160M > Is this right? > > > 3. I can't use the l2 utils. > When I do the "sudo l2open -t eth0 l2recv[l2send], I got that "l2open ioctl(TUNSETIFF...): Invalid argument" > and "use l2open -r eth0 l2recv", wait a moment (only several seconds), I got the result: > TEST-RESULT: 0.901 kpps 1pkts > select/read=100.00 err=0 > > And I can't find the l2 utils from the net? Is it implemented by your team? > > All of them is tested on vms. > > Cheers. > Wang > > > > > > Cheers, > > Giuseppe > > > > Il 17/01/2014 04:39, Wang Weidong ha scritto: > >> On 2014/1/16 18:24, facoltà wrote: > [...] > >> > >> > > > > > > > > > > -- > Vincenzo Maffione From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 01:56:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 472DF794; Sun, 26 Jan 2014 01:56:12 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id ED2E41ADF; Sun, 26 Jan 2014 01:56:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,721,1384318800"; d="scan'208";a="91061178" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 25 Jan 2014 20:55:47 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id B306CB4051; Sat, 25 Jan 2014 20:55:47 -0500 (EST) Date: Sat, 25 Jan 2014 20:55:47 -0500 (EST) From: Rick Macklem To: J David Message-ID: <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 01:56:12 -0000 J David wrote: > On Fri, Jan 24, 2014 at 7:10 PM, Rick Macklem > wrote: > > I would like to hear if you find Linux doing read before write when > > you use "-r 2k", since I think that is writing less than a page. > > It doesn't. As I reported in the original test, I used an 8k > rsize/wsize and a 4k write size on the Linux test and no > read-before-write was observed. And just now I did as you asked, a > 2k > test with Linux mounting with 32k rsize/wsize. No extra reads, > excellent performance. FreeBSD, with the same mount options, does > reads even on the appends in this case and can't. > Well, when I get home in April, I'll try the fairly recent Linux client I have at home and see what it does. Not sure what trick they could use to avoid the read before write for partial pages. (I suppose I can look at their sources, but that could be pretty scary;-) If I understand the 15year old commit message, the main problem with not doing the read before write for a partial buffer is that mmap()'d file access will look at entire pages and potentially gets garbage if the entire page isn't valid. At this time, there is a single B_CACHE flag to indicate the buffer cache entry has been filled in. I think it would be possible to add a bitmap that marks which pages are actually allocated to the buffer cache entry, but I suspect the coding would be non-trivial. This would help for the case of page size writes on page boundaries, but would require the pages to be read in before write when the writes are not of page size on page boundaries. Well, one application I do have some experience with is software builds and the "ld" stage tends to write lots of chunks of odd sizes at any byte offset. (When I did testing of some code that extended the single dirty byte range to a list of dirty byte ranges, I discovered that "ld" often generates 100+ of these odd sized non-contiguous writes before resulting in a completely written block. I recently added a mount option called "noncontigwr" that would allow the single dirty byte range to cover these non-contiguous writes.) Bottom line, if the pages were read in individually, the "ld" case would result in several (up to 16 for 4K in a 64K buffer) small reads against the server, which isn't nearly as efficient as one larger 64K read. As mentioned above, I don't know how Linux would avoid the read before write for partial blocks/pages being written. rick > random > random > > KB reclen write rewrite read reread read > write > > Linux 1048576 2 281082 358672 125687 > 121964 > > FreeBSD 1048576 2 59042 22624 10304 > 1933 > > > For comparison, here's the same test with 32k reclen (again, both > Linux and FreeBSD using 32k rsize/wsize): > > random > random > > KB reclen write rewrite read reread read > write > > Linux 1048576 32 319387 373021 411106 > 364393 > > FreeBSD 1048576 32 74892 73703 34889 > 66350 > > > Unfortunately it sounds like this state of affairs isn't really going > to improve, at least in the near future. If there was one area where > I never thought Linux would surpass us, it was NFS. :( > > Thanks! > From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 02:25:33 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4A6C6DD8; Sun, 26 Jan 2014 02:25:33 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DCAB41DB2; Sun, 26 Jan 2014 02:25:32 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0Q2PU2n045130; Sat, 25 Jan 2014 21:25:30 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0Q2PUp1045129; Sat, 25 Jan 2014 21:25:30 -0500 (EST) (envelope-from wollman) Date: Sat, 25 Jan 2014 21:25:30 -0500 (EST) Message-Id: <201401260225.s0Q2PUp1045129@hergotha.csail.mit.edu> From: wollman@freebsd.org To: rmacklem@uoguelph.ca Subject: Re: Terrible NFS performance under 9.2-RELEASE? X-Newsgroups: mit.lcs.mail.freebsd-net In-Reply-To: <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca> References: Organization: none X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Sat, 25 Jan 2014 21:25:30 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 02:25:33 -0000 In article <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca>, Rick Macklem writes: >Well, when I get home in April, I'll try the fairly recent Linux client >I have at home and see what it does. Not sure what trick they could use >to avoid the read before write for partial pages. (I suppose I can >look at their sources, but that could be pretty scary;-) For what it's worth, our performance for large-block 100%-read workloads is also not what it could (or ought to) be. Between two 20G-attached servers, I can get about 12 Gbit/s with three parallel TCP connections. (Multiple connections are required to trick the lagg hash into balancing the load across both 10G links, because the hash function used for load-balancing uses the source and destination ports.) On the same pair of servers, "dd if=/mnt/test bs=1024k" runs at about 3 Gbit/s, whereas reading from the local filesystem goes anywhere from 1.5 to 3 G*byte*/s (i.e., eight times faster) with much higher CPU utilization. Luckily, most of our users are only connected at 1G so they don't notice. I'm going to lose my test server soon (it has to go into production shortly), so I'm not really able to work on this. I'll have another test server soon (old hardware being replaced by the new server) and hope to be able to try out the new code that's going to be in 10.1, with the expectation of upgrading to 10.x over summer break. -GAWollman From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 02:36:33 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 64AC9257; Sun, 26 Jan 2014 02:36:33 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 03BF11E4D; Sun, 26 Jan 2014 02:36:32 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ag8UACJ05FKDaFve/2dsb2JhbABagXICAYFPVoJ9tVKDL0+BH3SCJQEBAQMBAQEBICsgCwUWGAICDRkCKQEJJg4HBAEcBIdcCA2rYJwvF4EpjRMBAQ0ONAcWglmBSQSJSIwMhAWQbINLHjF9Bxci X-IronPort-AV: E=Sophos;i="4.95,721,1384318800"; d="scan'208";a="90488526" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 25 Jan 2014 21:36:25 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 026E5B4042; Sat, 25 Jan 2014 21:36:26 -0500 (EST) Date: Sat, 25 Jan 2014 21:36:26 -0500 (EST) From: Rick Macklem To: wollman@freebsd.org Message-ID: <188195924.16327973.1390703786000.JavaMail.root@uoguelph.ca> In-Reply-To: <201401260225.s0Q2PUp1045129@hergotha.csail.mit.edu> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 02:36:33 -0000 Garrett Wollman wrote: > In article > <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca>, Rick > Macklem writes: > > >Well, when I get home in April, I'll try the fairly recent Linux > >client > >I have at home and see what it does. Not sure what trick they could > >use > >to avoid the read before write for partial pages. (I suppose I can > >look at their sources, but that could be pretty scary;-) > > For what it's worth, our performance for large-block 100%-read > workloads is also not what it could (or ought to) be. Between two > 20G-attached servers, I can get about 12 Gbit/s with three parallel > TCP connections. (Multiple connections are required to trick the > lagg > hash into balancing the load across both 10G links, because the hash > function used for load-balancing uses the source and destination > ports.) On the same pair of servers, "dd if=/mnt/test bs=1024k" runs > at about 3 Gbit/s, whereas reading from the local filesystem goes > anywhere from 1.5 to 3 G*byte*/s (i.e., eight times faster) with much > higher CPU utilization. Luckily, most of our users are only > connected > at 1G so they don't notice. > Have you tried increasing readahead by any chance? I think the default is 1, which means the client will make 2 read requests and then wait for those replies before doing any more reads. Since you have fast links, maybe the 2 * 64K reads isn't enough to keep the pipe filled? (This depends on latency, which you didn't mention.) Might be worth trying, rick ps: If/when you have a test server, you could also try compiling a kernel with MAXBSIZE set to 128Kbytes instead of 64Kbytes. You'll need to boot this kernel on both the server and client (assuming a FreeBSD client) before the default rsize will increase to 128Kbytes. I'm no ZFS guy, but I understand 128Kbytes is the blocksize it likes. > I'm going to lose my test server soon (it has to go into production > shortly), so I'm not really able to work on this. I'll have another > test server soon (old hardware being replaced by the new server) and > hope to be able to try out the new code that's going to be in 10.1, > with the expectation of upgrading to 10.x over summer break. > > -GAWollman > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 03:13:17 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8AD51B4D; Sun, 26 Jan 2014 03:13:17 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 452041188; Sun, 26 Jan 2014 03:13:17 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0Q3DFEw045684; Sat, 25 Jan 2014 22:13:15 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0Q3DFYt045681; Sat, 25 Jan 2014 22:13:15 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21220.32074.958702.595502@hergotha.csail.mit.edu> Date: Sat, 25 Jan 2014 22:13:14 -0500 From: Garrett Wollman To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? In-Reply-To: <188195924.16327973.1390703786000.JavaMail.root@uoguelph.ca> References: <201401260225.s0Q2PUp1045129@hergotha.csail.mit.edu> <188195924.16327973.1390703786000.JavaMail.root@uoguelph.ca> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Sat, 25 Jan 2014 22:13:15 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 03:13:17 -0000 < said: > Have you tried increasing readahead by any chance? I think the default > is 1, which means the client will make 2 read requests and then wait for > those replies before doing any more reads. Since you have fast links, > maybe the 2 * 64K reads isn't enough to keep the pipe filled? (This > depends on latency, which you didn't mention.) -o readahead=4 nearly doubles the speed, to a bit over 5 Gbit/s. Oddly, when I unmount the filesystem, the test client sometimes freezes for 15-30 seconds. Since I'm not on the console I can't tell what it's doing when this happens. -GAWollman From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 03:19:24 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 576EFCC4; Sun, 26 Jan 2014 03:19:24 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id EA2CF11C8; Sun, 26 Jan 2014 03:19:23 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAHF95FKDaFve/2dsb2JhbABag0RWgn25AU+BH3SCJQEBAQMBAQEBICsgCwUWGAICDRkCKQEJJgYIBwQBHASHXAgNq1+cLheBKY0TAQEbNAeCb4FJBIlIjAyEBZBsg0seMYEEOQ X-IronPort-AV: E=Sophos;i="4.95,721,1384318800"; d="scan'208";a="90491411" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 25 Jan 2014 22:19:23 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 15FD7B4054; Sat, 25 Jan 2014 22:19:23 -0500 (EST) Date: Sat, 25 Jan 2014 22:19:23 -0500 (EST) From: Rick Macklem To: Garrett Wollman Message-ID: <688905116.16333139.1390706363082.JavaMail.root@uoguelph.ca> In-Reply-To: <21220.32074.958702.595502@hergotha.csail.mit.edu> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 03:19:24 -0000 Garrett Wollman wrote: > < said: > > > Have you tried increasing readahead by any chance? I think the > > default > > is 1, which means the client will make 2 read requests and then > > wait for > > those replies before doing any more reads. Since you have fast > > links, > > maybe the 2 * 64K reads isn't enough to keep the pipe filled? (This > > depends on latency, which you didn't mention.) > > -o readahead=4 nearly doubles the speed, to a bit over 5 Gbit/s. > And "-o readahead=8" is slower or faster? (I think you can go up to at least 16, but I can't remember the upper bound. It's in one of the .h files.;-) > Oddly, when I unmount the filesystem, the test client sometimes > freezes for 15-30 seconds. Since I'm not on the console I can't tell > what it's doing when this happens. > Hmm, no idea. Maybe it takes a while to throw away all the buffer cache blocks? I run such small systems by to-days standards, I wouldn't see a delay that "might" occur for a large buffer cache. At least a little progress, rick > -GAWollman > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 07:43:41 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DF0CD3DE for ; Sun, 26 Jan 2014 07:43:41 +0000 (UTC) Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [119.145.14.64]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 949C0118B for ; Sun, 26 Jan 2014 07:43:40 +0000 (UTC) Received: from 172.24.2.119 (EHLO szxeml209-edg.china.huawei.com) ([172.24.2.119]) by szxrg01-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id BQP79666; Sun, 26 Jan 2014 15:41:51 +0800 (CST) Received: from SZXEML420-HUB.china.huawei.com (10.82.67.159) by szxeml209-edg.china.huawei.com (172.24.2.184) with Microsoft SMTP Server (TLS) id 14.3.158.1; Sun, 26 Jan 2014 15:41:43 +0800 Received: from [127.0.0.1] (10.177.18.75) by szxeml420-hub.china.huawei.com (10.82.67.159) with Microsoft SMTP Server id 14.3.158.1; Sun, 26 Jan 2014 15:41:45 +0800 Message-ID: <52E4BC38.7040407@huawei.com> Date: Sun, 26 Jan 2014 15:41:44 +0800 From: Wang Weidong User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Vincenzo Maffione Subject: Re: netmap: I got some troubles with netmap References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> <52DD1914.7090506@iet.unipi.it> <52E1E272.8060009@huawei.com> In-Reply-To: Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.177.18.75] X-CFilter-Loop: Reflected Cc: =?ISO-8859-1?Q?facolt=E0?= , Giuseppe Lettieri , Luigi Rizzo , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 07:43:41 -0000 On 2014/1/24 22:56, Vincenzo Maffione wrote: > > > > 2014/1/24 Wang Weidong > > > On 2014/1/20 20:39, Giuseppe Lettieri wrote: > > Hi Wang, > > > > OK, you are using the netmap support in the upstream qemu git. That does not yet include all our modifications, some of which are very important for high throughput with VALE. In particular, the upstream qemu does not include the batching improvements in the frontend/backend interface, and it does not include the "map ring" optimization of the e1000 frontend. Please find attached a gzipped patch that contains all of our qemu code. The patch is against the latest upstream master (commit 1cf892ca). > > > > Please ./configure the patched qemu with the following option, in addition to any other option you may need: > > > > --enable-e1000-paravirt --enable-netmap \ > > --extra-cflags=-I/path/to/netmap/sys/directory > > > > Note that --enable-e1000-paravirt is needed to enable the "map ring" optimization in the e1000 frontend, even if you are not going to use the e1000-paravirt device. > > > > Now you should be able to rerun your tests. I am also attaching a README file that describes some more tests you may want to run. > > > > Hello, > > > Yes, I patch the qemu-netmap-bc767e701.patch to the qemu, download the 20131019-tinycore-netmap.hdd. > And I do some test that: > > 1. I use the bridge below: > qemu-system-x86_64 -m 2048 -boot c -net nic -net bridge,br=br1 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > test between two vms. > br1 without device. > Use pktgen, I got the 237.95 kpps. > Use the netserver/netperf I got the speed 1037M bits/sec with TCP_STREAM. The max speed is up to 1621M. > Use the netserver/netperf I got the speed 3296/s with TCP_RR > Use the netserver/netperf I got the speed 234M/86M bits/sec with UDP_STREAM > > When I add a device from host to the br1, the speed is 159.86 kpps. > Use the netserver/netperf I got the speed 720M bits/sec with TCP_STREAM. The max speed is up to 1000M. > Use the netserver/netperf I got the speed 3556/s with TCP_RR > Use the netserver/netperf I got the speed 181M/181M bits/sec with UDP_STREAM > > What do you think of these data? > > > You are using the old/deprecated QEMU command line syntax (-net), and therefore honestly It's not clear to me what kind of network configuration you are running. > > Please use our scripts "launch-qemu.sh", "prep-taps.sh", according to what described in the README.images file (attached). > Alternatively, use the syntax like in the following examples > > (#1) qemu-system-x86_64 archdisk.qcow -enable-kvm -device virtio-net-pci,netdev=mynet -netdev tap,ifname=tap01,id=mynet,script=no,downscript=no -smp 2 > (#2) qemu-system-x86_64 archdisk.qcow -enable-kvm -device e1000,mitigation=off,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 > Here I use the 20131019-tinycore-netmap.hdd (download from the http://info.iet.unipi.it/~luigi/netmap/) instead archdisk.qcow with #2. And I can't do the "cpufreq-set -g performance # on linux" as the README.image. Although, I use the pkt-gen test the vms, I got the tx speed is ~3Mpps while the other vm's rx speed is only 1.44Mpps. Is it right? I can't get ~4Mpps is the reason that I can't set the "CPU power saving". > so that it's clear to us what network frontend (e.g. emulated NIC) and network backend (e.g. netmap, tap, vde, ecc..) you are using. > In example #1 we are using virtio-net as frontend and tap as backend, while in example #2 we are using e1000 as frontend and netmap as backend. > Also consider giving more than one core (e.g. -smp 2) to each guest, to mitigate receiver livelock problems. > > > > 2. I use the vale below: > qemu-system-x86_64 -m 2048 -boot c -net nic -net netmap,vale0:0 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > > Same for here, it's not clear what you are using. I guess each guest has an e1000 device and is connected to a different port of the same vale switch (e.g. vale0:0 and vale0:1)? > > Test with 2 vms from the same host > vale0 without device. > I use the pkt-gen, the speed is 938 Kpps > > > You should get ~4Mpps with e1000 frontend + netmap backend on a reasonably good machine. Make sure you have ./configure'd QEMU with --enable-e1000-paravirt. > > > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 195M/195M, then add -- -m 8, I only got 1.07M/1.07M. > When use the smaller msg size, the speed will smaller? > > > If you use e1000 with netperf (without pkt-gen) your performance is doomed to be horrible. Use e1000-paravirt (as a frontend) instead if you are interested in netperf experiment. > Also consider that the point in using the "-- -m8" options is experimenting high packet rates, so what you should measure here is not the througput in Mbps, but the packet rate: netperf reports the number of packets sent and received, so you can obtain the packet rate by dividing by the running time. > The throughput in Mbps is uninteresting, if you want high bulk throughput you just don't use "-- -m 8", but leave the defaults. > Using virtio-net in this case will help because of the TSO offloadings. > Here, I am a little interested in netperf. So I did that: qemu-system-x86_64 20131019-tinycore-netmap.hdd -enable-kvm -device *e1000-paravirt*,mitigation=off,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 -m 2048 -vnc :0 qemu-system-x86_64 20131019-tinycore-netmap.hdd -enable-kvm -device *e1000-paravirt*,mitigation=off,mac=00:AA:BB:CC:DD:02,netdev=mynet -netdev netmap,ifname=vale0:02,id=mynet -smp 2 -m 2048 -vnc :1 I think this cmd is error. because after I set ipv4 addrs to the device, But I find two vms can't communicate with each other. Thanks, Wang > cheers > Vincenzo > > > > with vale-ctl -a vale0:eth2, > use pkt-gen, the speed is 928 Kpps > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 209M/208M, then add -- -m 8, I only got 1.06M/1.06M. > > with vale-ctl -h vale0:eth2, > use pkt-gen, the speed is 928 Kpps > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 192M/192M, then add -- -m 8, I only got 1.06M/1.06M. > > Test with 2 vms form two host, > I only can test it by vale-ctl -h vale0:eth2 and set eth2 into promisc > use pkt-gen with the default params, the speed is about 750 Kpps > use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 160M/160M > Is this right? > > > 3. I can't use the l2 utils. > When I do the "sudo l2open -t eth0 l2recv[l2send], I got that "l2open ioctl(TUNSETIFF...): Invalid argument" > and "use l2open -r eth0 l2recv", wait a moment (only several seconds), I got the result: > TEST-RESULT: 0.901 kpps 1pkts > select/read=100.00 err=0 > > And I can't find the l2 utils from the net? Is it implemented by your team? > > All of them is tested on vms. > > Cheers. > Wang > > > > > > Cheers, > > Giuseppe > > > > Il 17/01/2014 04:39, Wang Weidong ha scritto: > >> On 2014/1/16 18:24, facoltà wrote: > [...] > >> > >> > > > > > > > > > > -- > Vincenzo Maffione From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 23:35:47 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx2.freebsd.org (mx2.freebsd.org [8.8.178.116]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 54D9F9F2; Sun, 26 Jan 2014 23:35:47 +0000 (UTC) Received: from butcher-nb.yandex.net (hub.freebsd.org [IPv6:2001:1900:2254:206c::16:88]) by mx2.freebsd.org (Postfix) with ESMTP id 1FE04228D; Sun, 26 Jan 2014 23:35:45 +0000 (UTC) Message-ID: <52E59B93.90304@FreeBSD.org> Date: Mon, 27 Jan 2014 03:34:43 +0400 From: "Andrey V. Elsukov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: "Alexander V. Chernikov" , "net@freebsd.org" Subject: Re: "slow path" in network code || IPv6 panic on inteface removal References: <52E21721.5010309@yandex-team.ru> In-Reply-To: <52E21721.5010309@yandex-team.ru> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, hackers@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 23:35:47 -0000 Hello, Alexander, probably it would be better, it you split your patch into two. The one, that implements this: > What exactly is proposed: > - Another one netisr queue for handling different types of packets > - metainfo is stored in mbuf_tag attached to packet > - ifnet departure handler taking care of packets queued from/to killed > ifnet > - API to register/unregister/dispath given type of traffic And second, that shows usage example: > #5 T2 calls nd6_ifptomac() which reads interface MAC from ifp->if_addr > > #6 User inspects core generated by previous call > > Using new API, we can avoid #6 by making the following code changes: > * LLE timer does not drop/reacquire LLE lock > * we require nd6_ns_output callers to lock LLE if it is provided > * nd6_ns_output() uses "slow" path instead of sending mbuf to > ip6_output() immediately if LLE is not NULL. -- WBR, Andrey V. Elsukov From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 02:16:56 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3D174B64 for ; Mon, 27 Jan 2014 02:16:56 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id B6DAE12B0 for ; Mon, 27 Jan 2014 02:16:55 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,726,1384318800"; d="scan'208";a="91214176" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 26 Jan 2014 21:16:54 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 51E6FB3F48; Sun, 26 Jan 2014 21:16:54 -0500 (EST) Date: Sun, 26 Jan 2014 21:16:54 -0500 (EST) From: Rick Macklem To: Adam McDougall Message-ID: <1629593139.16590858.1390789014324.JavaMail.root@uoguelph.ca> In-Reply-To: <52DC1241.7010004@egr.msu.edu> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_16590856_824730477.1390789014322" X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 02:16:56 -0000 ------=_Part_16590856_824730477.1390789014322 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Adam McDougall wrote: > Also try rsize=32768,wsize=32768 in your mount options, made a huge > difference for me. I've noticed slow file transfers on NFS in 9 and > finally did some searching a couple months ago, someone suggested it > and > they were on to something. > I have a "hunch" that might explain why 64K NFS reads/writes perform poorly for some network environments. A 64K NFS read reply/write request consists of a list of 34 mbufs when passed to TCP via sosend() and a total data length of around 65680bytes. Looking at a couple of drivers (virtio and ixgbe), they seem to expect no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I think (I don't have anything that does TSO to confirm this) that NFS will pass a list that is longer (34 plus a TCP/IP header). At a glance, it appears that the drivers call m_defrag() or m_collapse() when the mbuf list won't fit in their scatter table (32 or 33 elements) and if this fails, just silently drop the data without sending it. If I'm right, there would considerable overhead from m_defrag()/m_collapse() and near disaster if they fail to fix the problem and the data is silently dropped instead of xmited. Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE clusters, so the mbuf count drops from 34 to 18. If anyone has a TSO scatter/gather enabled net interface and can test this patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is enabled and see what effect it has, that would be appreciated. Btw, thanks go to Garrett Wollman for suggesting the change to MJUMPAGESIZE clusters. rick ps: If the attachment doesn't make it through and you want the patch, just email me and I'll send you a copy. > On 01/19/2014 09:32, Alfred Perlstein wrote: > > 9.x has pretty poor mbuf tuning by default. > > > > I hit nearly the same problem and raising the mbufs worked for me. > > > > I'd suggest raising that and retrying. > > > > -Alfred > > > > On 1/19/14 12:47 AM, J David wrote: > >> While setting up a test for other purposes, I noticed some really > >> horrible NFS performance issues. > >> > >> To explore this, I set up a test environment with two FreeBSD > >> 9.2-RELEASE-p3 virtual machines running under KVM. The NFS server > >> is > >> configured to serve a 2 gig mfs on /mnt. > >> > >> The performance of the virtual network is outstanding: > >> > >> Server: > >> > >> $ iperf -c 172.20.20.169 > >> > >> ------------------------------------------------------------ > >> > >> Client connecting to 172.20.20.169, TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 3] local 172.20.20.162 port 59717 connected with 172.20.20.169 > >> port > >> 5001 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 3] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec > >> > >> $ iperf -s > >> > >> ------------------------------------------------------------ > >> > >> Server listening on TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 > >> port > >> 45655 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 4] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > >> > >> > >> Client: > >> > >> > >> $ iperf -s > >> > >> ------------------------------------------------------------ > >> > >> Server listening on TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 > >> port > >> 59717 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 4] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec > >> > >> ^C$ iperf -c 172.20.20.162 > >> > >> ------------------------------------------------------------ > >> > >> Client connecting to 172.20.20.162, TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 3] local 172.20.20.169 port 45655 connected with 172.20.20.162 > >> port > >> 5001 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 3] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > >> > >> > >> The performance of the mfs filesystem on the server is also good. > >> > >> Server: > >> > >> $ sudo mdconfig -a -t swap -s 2g > >> > >> md0 > >> > >> $ sudo newfs -U -b 4k -f 4k /dev/md0 > >> > >> /dev/md0: 2048.0MB (4194304 sectors) block size 4096, fragment > >> size 4096 > >> > >> using 43 cylinder groups of 48.12MB, 12320 blks, 6160 inodes. > >> > >> with soft updates > >> > >> super-block backups (for fsck_ffs -b #) at: > >> > >> 144, 98704, 197264, 295824, 394384, 492944, 591504, 690064, > >> 788624, > >> 887184, > >> > >> 985744, 1084304, 1182864, 1281424, 1379984, 1478544, 1577104, > >> 1675664, > >> > >> 1774224, 1872784, 1971344, 2069904, 2168464, 2267024, 2365584, > >> 2464144, > >> > >> 2562704, 2661264, 2759824, 2858384, 2956944, 3055504, 3154064, > >> 3252624, > >> > >> 3351184, 3449744, 3548304, 3646864, 3745424, 3843984, 3942544, > >> 4041104, > >> > >> 4139664 > >> > >> $ sudo mount /dev/md0 /mnt > >> > >> $ cd /mnt > >> > >> $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 > >> > >> Iozone: Performance Test of File I/O > >> > >> Version $Revision: 3.420 $ > >> > >> [...] > >> > >> random > >> random > >> > >> KB reclen write rewrite read reread > >> read > >> write > >> > >> 524288 4 560145 1114593 933699 831902 > >> 56347 > >> 158904 > >> > >> > >> iozone test complete. > >> > >> > >> But introduce NFS into the mix and everything falls apart. > >> > >> Client: > >> > >> $ sudo mount -o tcp,nfsv3 f12.phxi:/mnt /mnt > >> > >> $ cd /mnt > >> > >> $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 > >> > >> Iozone: Performance Test of File I/O > >> > >> Version $Revision: 3.420 $ > >> > >> [...] > >> > >> random > >> random > >> > >> KB reclen write rewrite read reread > >> read > >> write > >> > >> 524288 4 67246 2923 103295 1272407 > >> 172475 > >> 196 > >> > >> > >> And the above took 48 minutes to run, compared to 14 seconds for > >> the > >> local version. So it's 200x slower over NFS. The random write > >> test > >> is over 800x slower. Of course NFS is slower, that's expected, > >> but it > >> definitely wasn't this exaggerated in previous releases. > >> > >> To emphasize that iozone reflects real workloads here, I tried > >> doing > >> an svn co of the 9-STABLE source tree over NFS but after two hours > >> it > >> was still in llvm so I gave up. > >> > >> While all this not-much-of-anything NFS traffic is going on, both > >> systems are essentially idle. The process on the client sits in > >> "newnfs" wait state with nearly no CPU. The server is completely > >> idle > >> except for the occasional 0.10% in an nfsd thread, which otherwise > >> spend their lives in rpcsvc wait state. > >> > >> Server iostat: > >> > >> $ iostat -x -w 10 md0 > >> > >> extended device statistics > >> > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> > >> [...] > >> > >> md0 0.0 36.0 0.0 0.0 0 1.2 0 > >> md0 0.0 38.8 0.0 0.0 0 1.5 0 > >> md0 0.0 73.6 0.0 0.0 0 1.0 0 > >> md0 0.0 53.3 0.0 0.0 0 2.5 0 > >> md0 0.0 33.7 0.0 0.0 0 1.1 0 > >> md0 0.0 45.5 0.0 0.0 0 1.8 0 > >> > >> Server nfsstat: > >> > >> $ nfsstat -s -w 10 > >> > >> GtAttr Lookup Rdlink Read Write Rename Access Rddir > >> > >> [...] > >> > >> 0 0 0 471 816 0 0 0 > >> > >> 0 0 0 480 751 0 0 0 > >> > >> 0 0 0 481 36 0 0 0 > >> > >> 0 0 0 469 550 0 0 0 > >> > >> 0 0 0 485 814 0 0 0 > >> > >> 0 0 0 467 503 0 0 0 > >> > >> 0 0 0 473 345 0 0 0 > >> > >> > >> Client nfsstat: > >> > >> $ nfsstat -c -w 10 > >> > >> GtAttr Lookup Rdlink Read Write Rename Access Rddir > >> > >> [...] > >> > >> 0 0 0 0 518 0 0 0 > >> > >> 0 0 0 0 498 0 0 0 > >> > >> 0 0 0 0 503 0 0 0 > >> > >> 0 0 0 0 474 0 0 0 > >> > >> 0 0 0 0 525 0 0 0 > >> > >> 0 0 0 0 497 0 0 0 > >> > >> > >> Server vmstat: > >> > >> $ vmstat -w 10 > >> > >> procs memory page disks > >> faults cpu > >> > >> r b w avm fre flt re pi po fr sr vt0 vt1 in > >> sy > >> cs us sy id > >> > >> [...] > >> > >> 0 4 0 634M 6043M 37 0 0 0 1 0 0 0 1561 > >> 46 > >> 3431 0 2 98 > >> > >> 0 4 0 640M 6042M 62 0 0 0 28 0 0 0 1598 > >> 94 > >> 3552 0 2 98 > >> > >> 0 4 0 648M 6042M 38 0 0 0 0 0 0 0 1609 > >> 47 > >> 3485 0 1 99 > >> > >> 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1615 > >> 46 > >> 3667 0 2 98 > >> > >> 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1606 > >> 45 > >> 3678 0 2 98 > >> > >> 0 4 0 648M 6042M 37 0 0 0 0 0 1 0 1561 > >> 45 > >> 3377 0 2 98 > >> > >> > >> Client vmstat: > >> > >> $ vmstat -w 10 > >> > >> procs memory page disks > >> faults cpu > >> > >> r b w avm fre flt re pi po fr sr md0 da0 in > >> sy > >> cs us sy id > >> > >> [...] > >> > >> 0 0 0 639M 593M 33 0 0 0 1237 0 0 0 281 > >> 5575 > >> 1043 0 3 97 > >> > >> 0 0 0 639M 591M 0 0 0 0 712 0 0 0 235 > >> 122 > >> 889 0 2 98 > >> > >> 0 0 0 639M 583M 0 0 0 0 571 0 0 1 227 > >> 120 > >> 851 0 2 98 > >> > >> 0 0 0 639M 592M 198 0 0 0 1212 0 0 0 251 > >> 2497 > >> 950 0 3 97 > >> > >> 0 0 0 639M 586M 0 0 0 0 614 0 0 0 250 > >> 121 > >> 924 0 2 98 > >> > >> 0 0 0 639M 586M 0 0 0 0 765 0 0 0 250 > >> 120 > >> 918 0 3 97 > >> > >> > >> Top on the KVM host says it is 93-95% idle and that each VM sits > >> around 7-10% CPU. So basically nobody is doing anything. There's > >> no > >> visible bottleneck, and I've no idea where to go from here to > >> figure > >> out what's going on. > >> > >> Does anyone have any suggestions for debugging this? > >> > >> Thanks! > >> _______________________________________________ > >> freebsd-net@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-net > >> To unsubscribe, send any mail to > >> "freebsd-net-unsubscribe@freebsd.org" > >> > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > ------=_Part_16590856_824730477.1390789014322 Content-Type: text/x-patch; name=4kmcl.patch Content-Disposition: attachment; filename=4kmcl.patch Content-Transfer-Encoding: base64 LS0tIGZzL25mcy9uZnNwb3J0Lmguc2F2MgkyMDE0LTAxLTI2IDE4OjQzOjQ3LjAwMDAwMDAwMCAt MDUwMAorKysgZnMvbmZzL25mc3BvcnQuaAkyMDE0LTAxLTI2IDE5OjA0OjI3LjAwMDAwMDAwMCAt MDUwMApAQCAtMTUzLDE0ICsxNTMsMjcgQEAKIAkJCU1HRVRIRFIoKG0pLCBNX1dBSVRPSywgTVRf REFUQSk7IAlcCiAJCX0gCQkJCQkJXAogCX0gd2hpbGUgKDApCi0jZGVmaW5lCU5GU01DTEdFVCht LCB3KQlkbyB7IAkJCQkJXAotCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEpOyAJCQlcCi0J CXdoaWxlICgobSkgPT0gTlVMTCApIHsgCQkJCVwKLQkJCSh2b2lkKSBuZnNfY2F0bmFwKFBaRVJP LCAwLCAibmZzbWdldCIpOwlcCi0JCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEpOyAJCVwK LQkJfSAJCQkJCQlcCi0JCU1DTEdFVCgobSksICh3KSk7CQkJCVwKKyNpZiBNSlVNUEFHRVNJWkUg PiBNQ0xCWVRFUworI2RlZmluZQlORlNNQ0xHRVQobSwgdykJZG8gewkgCQkJCQlcCisJCShtKSA9 IG1fZ2V0amNsKE1fV0FJVE9LLCBNVF9EQVRBLCAwLCBNSlVNUEFHRVNJWkUpOwlcCisJCXdoaWxl ICgobSkgPT0gTlVMTCkgewkgCQkJCVwKKwkJCSh2b2lkKW5mc19jYXRuYXAoUFpFUk8sIDAsICJu ZnNtZ2V0Iik7CQlcCisJCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEpOwkgCQlcCisJCQlp ZiAoKG0pICE9IE5VTEwpCQkJCVwKKwkJCQlNQ0xHRVQoKG0pLCAodykpOwkJCVwKKwkJfQkgCQkJ CQkJXAogCX0gd2hpbGUgKDApCisjZWxzZQorI2RlZmluZQlORlNNQ0xHRVQobSwgdykJZG8gewkg CQkJCQlcCisJCShtKSA9IG1fZ2V0amNsKE1fV0FJVE9LLCBNVF9EQVRBLCAwLCBNQ0xCWVRFUyk7 CQlcCisJCXdoaWxlICgobSkgPT0gTlVMTCkgewkgCQkJCVwKKwkJCSh2b2lkKW5mc19jYXRuYXAo UFpFUk8sIDAsICJuZnNtZ2V0Iik7CQlcCisJCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEp OwkgCQlcCisJCQlpZiAoKG0pICE9IE5VTEwpCQkJCVwKKwkJCQlNQ0xHRVQoKG0pLCAodykpOwkJ CVwKKwkJfQkgCQkJCQkJXAorCX0gd2hpbGUgKDApCisjZW5kaWYKICNkZWZpbmUJTkZTTUNMR0VU SERSKG0sIHcpIGRvIHsgCQkJCVwKIAkJTUdFVEhEUigobSksIE1fV0FJVE9LLCBNVF9EQVRBKTsJ CVwKIAkJd2hpbGUgKChtKSA9PSBOVUxMICkgeyAJCQkJXAotLS0gZnMvbmZzc2VydmVyL25mc19u ZnNkcG9ydC5jLnNhdjIJMjAxNC0wMS0yNiAxODo1NDoyOS4wMDAwMDAwMDAgLTA1MDAKKysrIGZz L25mc3NlcnZlci9uZnNfbmZzZHBvcnQuYwkyMDE0LTAxLTI2IDE4OjU2OjA4LjAwMDAwMDAwMCAt MDUwMApAQCAtNTY2LDggKzU2Niw3IEBAIG5mc3Zub19yZWFkbGluayhzdHJ1Y3Qgdm5vZGUgKnZw LCBzdHJ1Y3QKIAlsZW4gPSAwOwogCWkgPSAwOwogCXdoaWxlIChsZW4gPCBORlNfTUFYUEFUSExF TikgewotCQlORlNNR0VUKG1wKTsKLQkJTUNMR0VUKG1wLCBNX1dBSVRPSyk7CisJCU5GU01DTEdF VChtcCwgTV9XQUlUT0spOwogCQltcC0+bV9sZW4gPSBORlNNU0laKG1wKTsKIAkJaWYgKGxlbiA9 PSAwKSB7CiAJCQltcDMgPSBtcDIgPSBtcDsKQEAgLTYzNiw4ICs2MzUsNyBAQCBuZnN2bm9fcmVh ZChzdHJ1Y3Qgdm5vZGUgKnZwLCBvZmZfdCBvZmYsCiAJICovCiAJaSA9IDA7CiAJd2hpbGUgKGxl ZnQgPiAwKSB7Ci0JCU5GU01HRVQobSk7Ci0JCU1DTEdFVChtLCBNX1dBSVRPSyk7CisJCU5GU01D TEdFVChtLCBNX1dBSVRPSyk7CiAJCW0tPm1fbGVuID0gMDsKIAkJc2l6ID0gbWluKE1fVFJBSUxJ TkdTUEFDRShtKSwgbGVmdCk7CiAJCWxlZnQgLT0gc2l6Owo= ------=_Part_16590856_824730477.1390789014322-- From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 03:23:50 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 05458AF5 for ; Mon, 27 Jan 2014 03:23:50 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id D27B4187B for ; Mon, 27 Jan 2014 03:23:49 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0R3NcFh047099 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 26 Jan 2014 19:23:39 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0R3Nct6047098; Sun, 26 Jan 2014 19:23:38 -0800 (PST) (envelope-from jmg) Date: Sun, 26 Jan 2014 19:23:38 -0800 From: John-Mark Gurney To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140127032338.GP13704@funkthat.com> Mail-Followup-To: Rick Macklem , Adam McDougall , freebsd-net@freebsd.org References: <52DC1241.7010004@egr.msu.edu> <1629593139.16590858.1390789014324.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1629593139.16590858.1390789014324.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Sun, 26 Jan 2014 19:23:39 -0800 (PST) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 03:23:50 -0000 Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 -0500: > Btw, thanks go to Garrett Wollman for suggesting the change to MJUMPAGESIZE > clusters. > > rick > ps: If the attachment doesn't make it through and you want the patch, just > email me and I'll send you a copy. The patch looks good, but we probably shouldn't change _readlink.. The chances of a link being >2k are pretty slim, and the chances of the link being >32k are even smaller... In fact, we might want to switch _readlink to MGET (could be conditional upon cnt) so that if it fits in an mbuf we don't allocate a cluster for it... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 05:50:57 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A949CD66 for ; Mon, 27 Jan 2014 05:50:57 +0000 (UTC) Received: from mail-pb0-x229.google.com (mail-pb0-x229.google.com [IPv6:2607:f8b0:400e:c01::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7B5211298 for ; Mon, 27 Jan 2014 05:50:57 +0000 (UTC) Received: by mail-pb0-f41.google.com with SMTP id up15so5468494pbc.14 for ; Sun, 26 Jan 2014 21:50:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=Bw/K5LPfCjrQMxv+KYJgyeb0epoQ04Nt6bAuB1SjFps=; b=MyYnhd1dPuBQvMdoU0cdOv2MJ1qWTOlbQ/j8eqjueHtdo6MPKLsX2hFYrzYq2e+4wD j6JGLdLd1MklYq8wbPUj79E5aYtGuCp2zOhE/vkOY67yATVT5EtdTMISLQchaExM0CFh ozcB+LQKCyxIDPsDAsEyX6QJY921hLzzA/8HNP6qARBea4IBgtQoCEFjMk+LSjWHX0A3 8dSs/x0yok+e62MUzRYoMCO2AuUP+HnofatjTqC7YQkwZzbBrt0A/dRD+6cyQ68hIiz9 klOvj1Es1OPu2rkxYL9Gk55PX1qCRO2TzHMNKXchgKccqqoOW7Dog949ednxcVnLy0dx 8nDQ== X-Received: by 10.68.198.97 with SMTP id jb1mr28355539pbc.104.1390801856275; Sun, 26 Jan 2014 21:50:56 -0800 (PST) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id e6sm28142111pbg.4.2014.01.26.21.50.52 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 26 Jan 2014 21:50:55 -0800 (PST) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Mon, 27 Jan 2014 14:50:47 +0900 From: Yonghyeon PYUN Date: Mon, 27 Jan 2014 14:50:47 +0900 To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140127055047.GA1368@michelle.cdnetworks.com> References: <52DC1241.7010004@egr.msu.edu> <1629593139.16590858.1390789014324.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1629593139.16590858.1390789014324.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 05:50:57 -0000 On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > Adam McDougall wrote: > > Also try rsize=32768,wsize=32768 in your mount options, made a huge > > difference for me. I've noticed slow file transfers on NFS in 9 and > > finally did some searching a couple months ago, someone suggested it > > and > > they were on to something. > > > I have a "hunch" that might explain why 64K NFS reads/writes perform > poorly for some network environments. > A 64K NFS read reply/write request consists of a list of 34 mbufs when > passed to TCP via sosend() and a total data length of around 65680bytes. > Looking at a couple of drivers (virtio and ixgbe), they seem to expect > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I think > (I don't have anything that does TSO to confirm this) that NFS will pass > a list that is longer (34 plus a TCP/IP header). > At a glance, it appears that the drivers call m_defrag() or m_collapse() > when the mbuf list won't fit in their scatter table (32 or 33 elements) > and if this fails, just silently drop the data without sending it. > If I'm right, there would considerable overhead from m_defrag()/m_collapse() > and near disaster if they fail to fix the problem and the data is silently > dropped instead of xmited. > I think the actual number of DMA segments allocated for the mbuf chain is determined by bus_dma(9). bus_dma(9) will coalesce current segment with previous segment if possible. I'm not sure whether you're referring to ixgbe(4) or ix(4) but I see the total length of all segment size of ix(4) is 65535 so it has no room for ethernet/VLAN header of the mbuf chain. The driver should be fixed to transmit a 64KB datagram. I think the use of m_defrag(9) in TSO is suboptimal. All TSO capable controllers are able to handle multiple TX buffers so it should have used m_collapse(9) rather than copying entire chain with m_defrag(9). > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE clusters, > so the mbuf count drops from 34 to 18. > Could we make it conditional on size? > If anyone has a TSO scatter/gather enabled net interface and can test this > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is enabled > and see what effect it has, that would be appreciated. > > Btw, thanks go to Garrett Wollman for suggesting the change to MJUMPAGESIZE > clusters. > > rick > ps: If the attachment doesn't make it through and you want the patch, just > email me and I'll send you a copy. > From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 10:21:36 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0705A872 for ; Mon, 27 Jan 2014 10:21:36 +0000 (UTC) Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [119.145.14.66]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9EEA216C2 for ; Mon, 27 Jan 2014 10:21:34 +0000 (UTC) Received: from 172.24.2.119 (EHLO szxeml209-edg.china.huawei.com) ([172.24.2.119]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id AJX01520; Mon, 27 Jan 2014 18:15:00 +0800 (CST) Received: from SZXEML405-HUB.china.huawei.com (10.82.67.60) by szxeml209-edg.china.huawei.com (172.24.2.184) with Microsoft SMTP Server (TLS) id 14.3.158.1; Mon, 27 Jan 2014 18:14:56 +0800 Received: from [127.0.0.1] (10.177.18.75) by szxeml405-hub.china.huawei.com (10.82.67.60) with Microsoft SMTP Server id 14.3.158.1; Mon, 27 Jan 2014 18:14:51 +0800 Message-ID: <52E6319A.8070601@huawei.com> Date: Mon, 27 Jan 2014 18:14:50 +0800 From: Wang Weidong User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Vincenzo Maffione Subject: Re: netmap: I got some troubles with netmap References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> <52DD1914.7090506@iet.unipi.it> <52E1E272.8060009@huawei.com> <52E4BC38.7040407@huawei.com> In-Reply-To: Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.177.18.75] X-CFilter-Loop: Reflected Cc: =?ISO-8859-1?Q?facolt=E0?= , Giuseppe Lettieri , Luigi Rizzo , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 10:21:36 -0000 On 2014/1/27 17:37, Vincenzo Maffione wrote: > > > > 2014/1/26 Wang Weidong > > > On 2014/1/24 22:56, Vincenzo Maffione wrote: > > > > > > > > 2014/1/24 Wang Weidong >> > > > > On 2014/1/20 20:39, Giuseppe Lettieri wrote: > > > Hi Wang, > > > > > > OK, you are using the netmap support in the upstream qemu git. That does not yet include all our modifications, some of which are very important for high throughput with VALE. In particular, the upstream qemu does not include the batching improvements in the frontend/backend interface, and it does not include the "map ring" optimization of the e1000 frontend. Please find attached a gzipped patch that contains all of our qemu code. The patch is against the latest upstream master (commit 1cf892ca). > > > > > > Please ./configure the patched qemu with the following option, in addition to any other option you may need: > > > > > > --enable-e1000-paravirt --enable-netmap \ > > > --extra-cflags=-I/path/to/netmap/sys/directory > > > > > > Note that --enable-e1000-paravirt is needed to enable the "map ring" optimization in the e1000 frontend, even if you are not going to use the e1000-paravirt device. > > > > > > Now you should be able to rerun your tests. I am also attaching a README file that describes some more tests you may want to run. > > > > > > > Hello, > > > > > > Yes, I patch the qemu-netmap-bc767e701.patch to the qemu, download the 20131019-tinycore-netmap.hdd. > > And I do some test that: > > > > 1. I use the bridge below: > > qemu-system-x86_64 -m 2048 -boot c -net nic -net bridge,br=br1 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > > test between two vms. > > br1 without device. > > Use pktgen, I got the 237.95 kpps. > > Use the netserver/netperf I got the speed 1037M bits/sec with TCP_STREAM. The max speed is up to 1621M. > > Use the netserver/netperf I got the speed 3296/s with TCP_RR > > Use the netserver/netperf I got the speed 234M/86M bits/sec with UDP_STREAM > > > > When I add a device from host to the br1, the speed is 159.86 kpps. > > Use the netserver/netperf I got the speed 720M bits/sec with TCP_STREAM. The max speed is up to 1000M. > > Use the netserver/netperf I got the speed 3556/s with TCP_RR > > Use the netserver/netperf I got the speed 181M/181M bits/sec with UDP_STREAM > > > > What do you think of these data? > > > > > > You are using the old/deprecated QEMU command line syntax (-net), and therefore honestly It's not clear to me what kind of network configuration you are running. > > > > Please use our scripts "launch-qemu.sh", "prep-taps.sh", according to what described in the README.images file (attached). > > Alternatively, use the syntax like in the following examples > > > > (#1) qemu-system-x86_64 archdisk.qcow -enable-kvm -device virtio-net-pci,netdev=mynet -netdev tap,ifname=tap01,id=mynet,script=no,downscript=no -smp 2 > > (#2) qemu-system-x86_64 archdisk.qcow -enable-kvm -device e1000,mitigation=off,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 > > > Here I use the 20131019-tinycore-netmap.hdd (download from the http://info.iet.unipi.it/~luigi/netmap/) instead archdisk.qcow with #2. > And I can't do the "cpufreq-set -g performance # on linux" as the README.image. > > > This is our fault, thanks for reporting (probably the tinycore kernel doesn't include the cpufreq governors). However, if you use a different linux O.S. as host machine you should be able to cpufreq-set -gperformance on the host machine, while keep using tinycore into the vms. > > Although, I use the pkt-gen test the vms, I got the tx speed is ~3Mpps while the other vm's rx speed is only 1.44Mpps. > > Is it right? I can't get ~4Mpps is the reason that I can't set the "CPU power saving". > > > It can be OK, depending on your machine and maybe on the CPU power saving.k On my machine > - Processor: Intel i7-3770K CPU @ 3.50GHz (8 cores) > - Memory @ 1333 MHz > - Host O.S.: Archlinux with Linux 3.12 > I get: > > - 3.9 Mpps on TX, 3.5 Mpps RX when guests are given 1 vCPU each > - 4.5 Mpps on TX, 3.2 Mpps RX when guests are given 2 vCPU each. > > > > > > so that it's clear to us what network frontend (e.g. emulated NIC) and network backend (e.g. netmap, tap, vde, ecc..) you are using. > > In example #1 we are using virtio-net as frontend and tap as backend, while in example #2 we are using e1000 as frontend and netmap as backend. > > Also consider giving more than one core (e.g. -smp 2) to each guest, to mitigate receiver livelock problems. > > > > > > > > 2. I use the vale below: > > qemu-system-x86_64 -m 2048 -boot c -net nic -net netmap,vale0:0 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > > > > Same for here, it's not clear what you are using. I guess each guest has an e1000 device and is connected to a different port of the same vale switch (e.g. vale0:0 and vale0:1)? > > > > Test with 2 vms from the same host > > vale0 without device. > > I use the pkt-gen, the speed is 938 Kpps > > > > > > You should get ~4Mpps with e1000 frontend + netmap backend on a reasonably good machine. Make sure you have ./configure'd QEMU with --enable-e1000-paravirt. > > > > > > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 195M/195M, then add -- -m 8, I only got 1.07M/1.07M. > > When use the smaller msg size, the speed will smaller? > > > > > > If you use e1000 with netperf (without pkt-gen) your performance is doomed to be horrible. Use e1000-paravirt (as a frontend) instead if you are interested in netperf experiment. > > Also consider that the point in using the "-- -m8" options is experimenting high packet rates, so what you should measure here is not the througput in Mbps, but the packet rate: netperf reports the number of packets sent and received, so you can obtain the packet rate by dividing by the running time. > > The throughput in Mbps is uninteresting, if you want high bulk throughput you just don't use "-- -m 8", but leave the defaults. > > Using virtio-net in this case will help because of the TSO offloadings. > > > Here, I am a little interested in netperf. So I did that: > qemu-system-x86_64 20131019-tinycore-netmap.hdd -enable-kvm -device *e1000-paravirt*,mitigation=off,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 -m 2048 -vnc :0 > qemu-system-x86_64 20131019-tinycore-netmap.hdd -enable-kvm -device *e1000-paravirt*,mitigation=off,mac=00:AA:BB:CC:DD:02,netdev=mynet -netdev netmap,ifname=vale0:02,id=mynet -smp 2 -m 2048 -vnc :1 > I think this cmd is error. because after I set ipv4 addrs to the device, But I find two vms can't communicate with each other. > > > This is the command that I use on my host machine (can be generated using the script launch-qemu.sh) > > qemu-system-x86_64 20131019-tinycore-netmap.hdd -enable-kvm -device e1000-paravirt,mitigation=off,ioeventfd=on,v1000=off,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 -m 3G -vga std > > and it works to me. It may be that your problem is due to the fact that you are using a netmap version which mismatches with the QEMU version. > Please use the netmap version I attached. > > Also try "virtio-net-pci" as frontend, like in the following > > qemu-system-x86_64 20131019-tinycore-netmap.hdd -enable-kvm -device virtio-net-pci,mrg_rxbuf=on,ioeventfd=on,mac=00:AA:BB:CC:DD:01,netdev=mynet -netdev netmap,ifname=vale0:01,id=mynet -smp 2 -m 3G -vga std > > Consider that: > - e1000-paravirt is optimized for high packet rate (e.g. UDP_STREAM -- -m8) > - virtio-net-pci is optimized for TCP_STREAM and TCP_RR. > and you should be able to deduce this from the numbers you measure. > > > Cheers, > Vincenzo > > Nice, I will test again. Regards, Wang > > > Thanks, > Wang > > > cheers > > Vincenzo > > > > > > > > with vale-ctl -a vale0:eth2, > > use pkt-gen, the speed is 928 Kpps > > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 209M/208M, then add -- -m 8, I only got 1.06M/1.06M. > > > > with vale-ctl -h vale0:eth2, > > use pkt-gen, the speed is 928 Kpps > > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 192M/192M, then add -- -m 8, I only got 1.06M/1.06M. > > > > Test with 2 vms form two host, > > I only can test it by vale-ctl -h vale0:eth2 and set eth2 into promisc > > use pkt-gen with the default params, the speed is about 750 Kpps > > use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 160M/160M > > Is this right? > > > > > > 3. I can't use the l2 utils. > > When I do the "sudo l2open -t eth0 l2recv[l2send], I got that "l2open ioctl(TUNSETIFF...): Invalid argument" > > and "use l2open -r eth0 l2recv", wait a moment (only several seconds), I got the result: > > TEST-RESULT: 0.901 kpps 1pkts > > select/read=100.00 err=0 > > > > And I can't find the l2 utils from the net? Is it implemented by your team? > > > > All of them is tested on vms. > > > > Cheers. > > Wang > > > > > > > > > > Cheers, > > > Giuseppe > > > > > > Il 17/01/2014 04:39, Wang Weidong ha scritto: > > >> On 2014/1/16 18:24, facoltà wrote: > > [...] > > >> > > >> > > > > > > > > > > > > > > > > > > -- > > Vincenzo Maffione > > > > > > -- > Vincenzo Maffione From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 11:06:51 2014 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2761BC86 for ; Mon, 27 Jan 2014 11:06:51 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 1171F1A8C for ; Mon, 27 Jan 2014 11:06:51 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0RB6oqh013078 for ; Mon, 27 Jan 2014 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0RB6oEf013075 for freebsd-net@FreeBSD.org; Mon, 27 Jan 2014 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 27 Jan 2014 11:06:50 GMT Message-Id: <201401271106.s0RB6oEf013075@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-net@FreeBSD.org Subject: Current problem reports assigned to freebsd-net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 11:06:51 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/185496 net [re] RTL8169 doesn't receive unicast ethernet packets o kern/185427 net [igb] freebsd 8.4, 9.1 and 9.2 panic Double-Fault with o kern/185023 net [tun] Closing tun interface deconfigures IP address o kern/185022 net [tun] ls /dev/tun creates tun interface o kern/184311 net [bge] [panic] kernel panic with bge(4) on SunFire X210 o kern/184084 net [ral] kernel crash by ral (RT3090) o bin/183687 net [patch] route(8): route add -net 172.20 add wrong host o kern/183659 net [tcp] ]TCP stack lock contention with short-lived conn o conf/183407 net [rc.d] [patch] Routing restart returns non-zero exitco o kern/183391 net [oce] 10gigabit networking problems with Emulex OCE 11 o kern/183390 net [ixgbe] 10gigabit networking problems o kern/182917 net [igb] strange out traffic with igb interfaces o kern/182847 net [netinet6] [patch] Remove dead code o kern/182665 net [wlan] Kernel panic when creating second wlandev. o kern/182382 net [tcp] sysctl to set TCP CC method on BIG ENDIAN system o kern/182297 net [cm] ArcNet driver fails to detect the link address - o kern/182212 net [patch] [ng_mppc] ng_mppc(4) blocks on network errors o kern/181970 net [re] LAN Realtek® 8111G is not supported by re driver o kern/181931 net [vlan] [lagg] vlan over lagg over mlxen crashes the ke o kern/181823 net [ip6] [patch] make ipv6 mroute return same errror code o kern/181741 net [kernel] [patch] Packet loss when 'control' messages a o kern/181703 net [re] [patch] Fix Realtek 8111G Ethernet controller not o kern/181657 net [bpf] [patch] BPF_COP/BPF_COPX instruction reservation o kern/181257 net [bge] bge link status change o kern/181236 net [igb] igb driver unstable work o kern/181135 net [netmap] [patch] sys/dev/netmap patch for Linux compat o kern/181131 net [netmap] [patch] sys/dev/netmap memory allocation impr o kern/181006 net [run] [patch] mbuf leak in run(4) driver o kern/180893 net [if_ethersubr] [patch] Packets received with own LLADD o kern/180844 net [panic] [re] Intermittent panic (re driver?) o kern/180775 net [bxe] if_bxe driver broken with Broadcom BCM57711 card o kern/180722 net [bluetooth] bluetooth takes 30-50 attempts to pair to s kern/180468 net [request] LOCAL_PEERCRED support for PF_INET o kern/180065 net [netinet6] [patch] Multicast loopback to own host brok o kern/179926 net [lacp] [patch] active aggregator selection bug o kern/179824 net [ixgbe] System (9.1-p4) hangs on heavy ixgbe network t o kern/179733 net [lagg] [patch] interface loses capabilities when proto o kern/179429 net [tap] STP enabled tap bridge o kern/179299 net [igb] Intel X540-T2 - unstable driver a kern/179264 net [vimage] [pf] Core dump with Packet filter and VIMAGE o kern/178947 net [arp] arp rejecting not working o kern/178782 net [ixgbe] 82599EB SFP does not work with passthrough und o kern/178612 net [run] kernel panic due the problems with run driver o kern/178472 net [ip6] [patch] make return code consistent with IPv4 co o kern/178079 net [tcp] Switching TCP CC algorithm panics on sparc64 wit s kern/178071 net FreeBSD unable to recongize Kontron (Industrial Comput o kern/177905 net [xl] [panic] ifmedia_set when pluging CardBus LAN card o kern/177618 net [bridge] Problem with bridge firewall with trunk ports o kern/177402 net [igb] [pf] problem with ethernet driver igb + pf / alt o kern/177400 net [jme] JMC25x 1000baseT establishment issues o kern/177366 net [ieee80211] negative malloc(9) statistics for 80211nod f kern/177362 net [netinet] [patch] Wrong control used to return TOS o kern/177194 net [netgraph] Unnamed netgraph nodes for vlan interfaces o kern/177184 net [bge] [patch] enable wake on lan o kern/177139 net [igb] igb drops ethernet ports 2 and 3 o kern/176884 net [re] re0 flapping up/down o kern/176671 net [epair] MAC address for epair device not unique o kern/176484 net [ipsec] [enc] [patch] panic: IPsec + enc(4); device na o kern/176446 net [netinet] [patch] Concurrency in ixgbe driving out-of- o kern/176420 net [kernel] [patch] incorrect errno for LOCAL_PEERCRED o kern/176419 net [kernel] [patch] socketpair support for LOCAL_PEERCRED o kern/176401 net [netgraph] page fault in netgraph o kern/176167 net [ipsec][lagg] using lagg and ipsec causes immediate pa o kern/176027 net [em] [patch] flow control systcl consistency for em dr o kern/176026 net [tcp] [patch] TCP wrappers caused quite a lot of warni o kern/175864 net [re] Intel MB D510MO, onboard ethernet not working aft o kern/175852 net [amd64] [patch] in_cksum_hdr() behaves differently on o kern/175734 net no ethernet detected on system with EG20T PCH chipset o kern/175267 net [pf] [tap] pf + tap keep state problem o kern/175236 net [epair] [gif] epair and gif Devices On Bridge o kern/175182 net [panic] kernel panic on RADIX_MPATH when deleting rout o kern/175153 net [tcp] will there miss a FIN when do TSO? o kern/174959 net [net] [patch] rnh_walktree_from visits spurious nodes o kern/174958 net [net] [patch] rnh_walktree_from makes unreasonable ass o kern/174897 net [route] Interface routes are broken o kern/174851 net [bxe] [patch] UDP checksum offload is wrong in bxe dri o kern/174850 net [bxe] [patch] bxe driver does not receive multicasts o kern/174849 net [bxe] [patch] bxe driver can hang kernel when reset o kern/174822 net [tcp] Page fault in tcp_discardcb under high traffic o kern/174602 net [gif] [ipsec] traceroute issue on gif tunnel with ipse o kern/174535 net [tcp] TCP fast retransmit feature works strange o kern/173871 net [gif] process of 'ifconfig gif0 create hangs' when if_ o kern/173475 net [tun] tun(4) stays opened by PID after process is term o kern/173201 net [ixgbe] [patch] Missing / broken ixgbe sysctl's and tu o kern/173137 net [em] em(4) unable to run at gigabit with 9.1-RC2 o kern/173002 net [patch] data type size problem in if_spppsubr.c o kern/172895 net [ixgb] [ixgbe] do not properly determine link-state o kern/172683 net [ip6] Duplicate IPv6 Link Local Addresses o kern/172675 net [netinet] [patch] sysctl_tcp_hc_list (net.inet.tcp.hos p kern/172113 net [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4 o kern/171840 net [ip6] IPv6 packets transmitting only on queue 0 o kern/171739 net [bce] [panic] bce related kernel panic o kern/171711 net [dummynet] [panic] Kernel panic in dummynet o kern/171532 net [ndis] ndis(4) driver includes 'pccard'-specific code, o kern/171531 net [ndis] undocumented dependency for ndis(4) o kern/171524 net [ipmi] ipmi driver crashes kernel by reboot or shutdow s kern/171508 net [epair] [request] Add the ability to name epair device o kern/171228 net [re] [patch] if_re - eeprom write issues o kern/170701 net [ppp] killl ppp or reboot with active ppp connection c o kern/170267 net [ixgbe] IXGBE_LE32_TO_CPUS is probably an unintentiona o kern/170081 net [fxp] pf/nat/jails not working if checksum offloading o kern/169898 net ifconfig(8) fails to set MTU on multiple interfaces. o kern/169676 net [bge] [hang] system hangs, fully or partially after re o kern/169620 net [ng] [pf] ng_l2tp incoming packet bypass pf firewall o kern/169459 net [ppp] umodem/ppp/3g stopped working after update from o kern/169438 net [ipsec] ipv4-in-ipv6 tunnel mode IPsec does not work p kern/168294 net [ixgbe] [patch] ixgbe driver compiled in kernel has no o kern/168246 net [em] Multiple em(4) not working with qemu o kern/168245 net [arp] [regression] Permanent ARP entry not deleted on o kern/168244 net [arp] [regression] Unable to manually remove permanent o kern/168183 net [bce] bce driver hang system o kern/167603 net [ip] IP fragment reassembly's broken: file transfer ov o kern/167500 net [em] [panic] Kernel panics in em driver o kern/167325 net [netinet] [patch] sosend sometimes return EINVAL with o kern/167202 net [igmp]: Sending multiple IGMP packets crashes kernel o kern/166462 net [gre] gre(4) when using a tunnel source address from c o kern/166285 net [arp] FreeBSD v8.1 REL p8 arp: unknown hardware addres o kern/166255 net [net] [patch] It should be possible to disable "promis p kern/165903 net mbuf leak o kern/165622 net [ndis][panic][patch] Unregistered use of FPU in kernel s kern/165562 net [request] add support for Intel i350 in FreeBSD 7.4 o kern/165526 net [bxe] UDP packets checksum calculation whithin if_bxe o kern/165488 net [ppp] [panic] Fatal trap 12 jails and ppp , kernel wit o kern/165305 net [ip6] [request] Feature parity between IP_TOS and IPV6 o kern/165296 net [vlan] [patch] Fix EVL_APPLY_VLID, update EVL_APPLY_PR o kern/165181 net [igb] igb freezes after about 2 weeks of uptime o kern/165174 net [patch] [tap] allow tap(4) to keep its address on clos o kern/165152 net [ip6] Does not work through the issue of ipv6 addresse o kern/164495 net [igb] connect double head igb to switch cause system t o kern/164490 net [pfil] Incorrect IP checksum on pfil pass from ip_outp o kern/164475 net [gre] gre misses RUNNING flag after a reboot o kern/164265 net [netinet] [patch] tcp_lro_rx computes wrong checksum i o kern/163903 net [igb] "igb0:tx(0)","bpf interface lock" v2.2.5 9-STABL o kern/163481 net freebsd do not add itself to ping route packet o kern/162927 net [tun] Modem-PPP error ppp[1538]: tun0: Phase: Clearing o kern/162558 net [dummynet] [panic] seldom dummynet panics o kern/162153 net [em] intel em driver 7.2.4 don't compile o kern/162110 net [igb] [panic] RELENG_9 panics on boot in IGB driver - o kern/162028 net [ixgbe] [patch] misplaced #endif in ixgbe.c o kern/161277 net [em] [patch] BMC cannot receive IPMI traffic after loa o kern/160873 net [igb] igb(4) from HEAD fails to build on 7-STABLE o kern/160750 net Intel PRO/1000 connection breaks under load until rebo o kern/160693 net [gif] [em] Multicast packet are not passed from GIF0 t o kern/160293 net [ieee80211] ppanic] kernel panic during network setup o kern/160206 net [gif] gifX stops working after a while (IPv6 tunnel) o kern/159817 net [udp] write UDPv4: No buffer space available (code=55) o kern/159629 net [ipsec] [panic] kernel panic with IPsec in transport m o kern/159621 net [tcp] [panic] panic: soabort: so_count o kern/159603 net [netinet] [patch] in_ifscrubprefix() - network route c o kern/159601 net [netinet] [patch] in_scrubprefix() - loopback route re o kern/159294 net [em] em watchdog timeouts o kern/159203 net [wpi] Intel 3945ABG Wireless LAN not support IBSS o kern/158930 net [bpf] BPF element leak in ifp->bpf_if->bif_dlist o kern/158726 net [ip6] [patch] ICMPv6 Router Announcement flooding limi o kern/158694 net [ix] [lagg] ix0 is not working within lagg(4) o kern/158665 net [ip6] [panic] kernel pagefault in in6_setscope() o kern/158635 net [em] TSO breaks BPF packet captures with em driver f kern/157802 net [dummynet] [panic] kernel panic in dummynet o kern/157785 net amd64 + jail + ipfw + natd = very slow outbound traffi o kern/157418 net [em] em driver lockup during boot on Supermicro X9SCM- o kern/157410 net [ip6] IPv6 Router Advertisements Cause Excessive CPU U o kern/157287 net [re] [panic] INVARIANTS panic (Memory modified after f o kern/157200 net [network.subr] [patch] stf(4) can not communicate betw o kern/157182 net [lagg] lagg interface not working together with epair o kern/156877 net [dummynet] [panic] dummynet move_pkt() null ptr derefe o kern/156667 net [em] em0 fails to init on CURRENT after March 17 o kern/156408 net [vlan] Routing failure when using VLANs vs. Physical e o kern/156328 net [icmp]: host can ping other subnet but no have IP from o kern/156317 net [ip6] Wrong order of IPv6 NS DAD/MLD Report o kern/156279 net [if_bridge][divert][ipfw] unable to correctly re-injec o kern/156226 net [lagg]: failover does not announce the failover to swi o kern/156030 net [ip6] [panic] Crash in nd6_dad_start() due to null ptr o kern/155680 net [multicast] problems with multicast s kern/155642 net [new driver] [request] Add driver for Realtek RTL8191S o kern/155597 net [panic] Kernel panics with "sbdrop" message o kern/155420 net [vlan] adding vlan break existent vlan o kern/155177 net [route] [panic] Panic when inject routes in kernel o kern/155010 net [msk] ntfs-3g via iscsi using msk driver cause kernel o kern/154943 net [gif] ifconfig gifX create on existing gifX clears IP s kern/154851 net [new driver] [request]: Port brcm80211 driver from Lin o kern/154850 net [netgraph] [patch] ng_ether fails to name nodes when t o kern/154679 net [em] Fatal trap 12: "em1 taskq" only at startup (8.1-R o kern/154600 net [tcp] [panic] Random kernel panics on tcp_output o kern/154557 net [tcp] Freeze tcp-session of the clients, if in the gat o kern/154443 net [if_bridge] Kernel module bridgestp.ko missing after u o kern/154286 net [netgraph] [panic] 8.2-PRERELEASE panic in netgraph o kern/154255 net [nfs] NFS not responding o kern/154214 net [stf] [panic] Panic when creating stf interface o kern/154185 net race condition in mb_dupcl p kern/154169 net [multicast] [ip6] Node Information Query multicast add o kern/154134 net [ip6] stuck kernel state in LISTEN on ipv6 daemon whic o kern/154091 net [netgraph] [panic] netgraph, unaligned mbuf? o conf/154062 net [vlan] [patch] change to way of auto-generatation of v o kern/153937 net [ral] ralink panics the system (amd64 freeBSDD 8.X) wh o kern/153936 net [ixgbe] [patch] MPRC workaround incorrectly applied to o kern/153816 net [ixgbe] ixgbe doesn't work properly with the Intel 10g o kern/153772 net [ixgbe] [patch] sysctls reference wrong XON/XOFF varia o kern/153497 net [netgraph] netgraph panic due to race conditions o kern/153454 net [patch] [wlan] [urtw] Support ad-hoc and hostap modes o kern/153308 net [em] em interface use 100% cpu o kern/153244 net [em] em(4) fails to send UDP to port 0xffff o kern/152893 net [netgraph] [panic] 8.2-PRERELEASE panic in netgraph o kern/152853 net [em] tftpd (and likely other udp traffic) fails over e o kern/152828 net [em] poor performance on 8.1, 8.2-PRE o kern/152569 net [net]: Multiple ppp connections and routing table prob o kern/152235 net [arp] Permanent local ARP entries are not properly upd o kern/152141 net [vlan] [patch] encapsulate vlan in ng_ether before out o kern/152036 net [libc] getifaddrs(3) returns truncated sockaddrs for n o kern/151690 net [ep] network connectivity won't work until dhclient is o kern/151681 net [nfs] NFS mount via IPv6 leads to hang on client with o kern/151593 net [igb] [panic] Kernel panic when bringing up igb networ o kern/150920 net [ixgbe][igb] Panic when packets are dropped with heade o kern/150557 net [igb] igb0: Watchdog timeout -- resetting o kern/150251 net [patch] [ixgbe] Late cable insertion broken o kern/150249 net [ixgbe] Media type detection broken o bin/150224 net ppp(8) does not reassign static IP after kill -KILL co f kern/149969 net [wlan] [ral] ralink rt2661 fails to maintain connectio o kern/149643 net [rum] device not sending proper beacon frames in ap mo o kern/149609 net [panic] reboot after adding second default route o kern/149117 net [inet] [patch] in_pcbbind: redundant test o kern/149086 net [multicast] Generic multicast join failure in 8.1 o kern/148018 net [flowtable] flowtable crashes on ia64 o kern/147912 net [boot] FreeBSD 8 Beta won't boot on Thinkpad i1300 11 o kern/147894 net [ipsec] IPv6-in-IPv4 does not work inside an ESP-only o kern/147155 net [ip6] setfb not work with ipv6 o kern/146845 net [libc] close(2) returns error 54 (connection reset by f kern/146792 net [flowtable] flowcleaner 100% cpu's core load o kern/146719 net [pf] [panic] PF or dumynet kernel panic o kern/146534 net [icmp6] wrong source address in echo reply o kern/146427 net [mwl] Additional virtual access points don't work on m f kern/146394 net [vlan] IP source address for outgoing connections o bin/146377 net [ppp] [tun] Interface doesn't clear addresses when PPP o kern/146358 net [vlan] wrong destination MAC address o kern/146165 net [wlan] [panic] Setting bssid in adhoc mode causes pani o kern/146037 net [panic] mpd + CoA = kernel panic o kern/145825 net [panic] panic: soabort: so_count o kern/145728 net [lagg] Stops working lagg between two servers. p kern/145600 net TCP/ECN behaves different to CE/CWR than ns2 reference f kern/144917 net [flowtable] [panic] flowtable crashes system [regressi o kern/144882 net MacBookPro =>4.1 does not connect to BSD in hostap wit o kern/144874 net [if_bridge] [patch] if_bridge frees mbuf after pfil ho o conf/144700 net [rc.d] async dhclient breaks stuff for too many people o kern/144616 net [nat] [panic] ip_nat panic FreeBSD 7.2 f kern/144315 net [ipfw] [panic] freebsd 8-stable reboot after add ipfw o kern/144231 net bind/connect/sendto too strict about sockaddr length o kern/143846 net [gif] bringing gif3 tunnel down causes gif0 tunnel to s kern/143673 net [stf] [request] there should be a way to support multi o kern/143622 net [pfil] [patch] unlock pfil lock while calling firewall o kern/143593 net [ipsec] When using IPSec, tcpdump doesn't show outgoin o kern/143591 net [ral] RT2561C-based DLink card (DWL-510) fails to work o kern/143208 net [ipsec] [gif] IPSec over gif interface not working o kern/143034 net [panic] system reboots itself in tcp code [regression] o kern/142877 net [hang] network-related repeatable 8.0-STABLE hard hang o kern/142774 net Problem with outgoing connections on interface with mu o kern/142772 net [libc] lla_lookup: new lle malloc failed f kern/142518 net [em] [lagg] Problem on 8.0-STABLE with em and lagg o kern/142018 net [iwi] [patch] Possibly wrong interpretation of beacon- o kern/141861 net [wi] data garbled with WEP and wi(4) with Prism 2.5 f kern/141741 net Etherlink III NIC won't work after upgrade to FBSD 8, o kern/140742 net rum(4) Two asus-WL167G adapters cannot talk to each ot o kern/140682 net [netgraph] [panic] random panic in netgraph f kern/140634 net [vlan] destroying if_lagg interface with if_vlan membe o kern/140619 net [ifnet] [patch] refine obsolete if_var.h comments desc o kern/140346 net [wlan] High bandwidth use causes loss of wlan connecti o kern/140142 net [ip6] [panic] FreeBSD 7.2-amd64 panic w/IPv6 o kern/140066 net [bwi] install report for 8.0 RC 2 (multiple problems) o kern/139387 net [ipsec] Wrong lenth of PF_KEY messages in promiscuous o bin/139346 net [patch] arp(8) add option to remove static entries lis o kern/139268 net [if_bridge] [patch] allow if_bridge to forward just VL p kern/139204 net [arp] DHCP server replies rejected, ARP entry lost bef o kern/139117 net [lagg] + wlan boot timing (EBUSY) o kern/138850 net [dummynet] dummynet doesn't work correctly on a bridge o kern/138782 net [panic] sbflush_internal: cc 0 || mb 0xffffff004127b00 o kern/138688 net [rum] possibly broken on 8 Beta 4 amd64: able to wpa a o kern/138678 net [lo] FreeBSD does not assign linklocal address to loop o kern/138407 net [gre] gre(4) interface does not come up after reboot o kern/138332 net [tun] [lor] ifconfig tun0 destroy causes LOR if_adata/ o kern/138266 net [panic] kernel panic when udp benchmark test used as r f kern/138029 net [bpf] [panic] periodically kernel panic and reboot o kern/137881 net [netgraph] [panic] ng_pppoe fatal trap 12 p bin/137841 net [patch] wpa_supplicant(8) cannot verify SHA256 signed p kern/137776 net [rum] panic in rum(4) driver on 8.0-BETA2 o bin/137641 net ifconfig(8): various problems with "vlan_device.vlan_i o kern/137392 net [ip] [panic] crash in ip_nat.c line 2577 o kern/137372 net [ral] FreeBSD doesn't support wireless interface from o kern/137089 net [lagg] lagg falsely triggers IPv6 duplicate address de o kern/136911 net [netgraph] [panic] system panic on kldload ng_bpf.ko t o kern/136618 net [pf][stf] panic on cloning interface without unit numb o kern/135502 net [periodic] Warning message raised by rtfree function i o kern/134583 net [hang] Machine with jail freezes after random amount o o kern/134531 net [route] [panic] kernel crash related to routes/zebra o kern/134157 net [dummynet] dummynet loads cpu for 100% and make a syst o kern/133969 net [dummynet] [panic] Fatal trap 12: page fault while in o kern/133968 net [dummynet] [panic] dummynet kernel panic o kern/133736 net [udp] ip_id not protected ... o kern/133595 net [panic] Kernel Panic at pcpu.h:195 o kern/133572 net [ppp] [hang] incoming PPTP connection hangs the system o kern/133490 net [bpf] [panic] 'kmem_map too small' panic on Dell r900 o kern/133235 net [netinet] [patch] Process SIOCDLIFADDR command incorre f kern/133213 net arp and sshd errors on 7.1-PRERELEASE o kern/133060 net [ipsec] [pfsync] [panic] Kernel panic with ipsec + pfs o kern/132889 net [ndis] [panic] NDIS kernel crash on load BCM4321 AGN d o conf/132851 net [patch] rc.conf(5): allow to setfib(1) for service run o kern/132734 net [ifmib] [panic] panic in net/if_mib.c o kern/132705 net [libwrap] [patch] libwrap - infinite loop if hosts.all o kern/132672 net [ndis] [panic] ndis with rt2860.sys causes kernel pani o kern/132354 net [nat] Getting some packages to ipnat(8) causes crash o kern/131781 net [ndis] ndis keeps dropping the link o kern/131776 net [wi] driver fails to init o kern/131753 net [altq] [panic] kernel panic in hfsc_dequeue o bin/131365 net route(8): route add changes interpretation of network f kern/130820 net [ndis] wpa_supplicant(8) returns 'no space on device' o kern/130628 net [nfs] NFS / rpc.lockd deadlock on 7.1-R o kern/130525 net [ndis] [panic] 64 bit ar5008 ndisgen-erated driver cau o kern/130311 net [wlan_xauth] [panic] hostapd restart causing kernel pa o kern/130109 net [ipfw] Can not set fib for packets originated from loc f kern/130059 net [panic] Leaking 50k mbufs/hour f kern/129719 net [nfs] [panic] Panic during shutdown, tcp_ctloutput: in o kern/129517 net [ipsec] [panic] double fault / stack overflow f kern/129508 net [carp] [panic] Kernel panic with EtherIP (may be relat o kern/129219 net [ppp] Kernel panic when using kernel mode ppp o kern/129197 net [panic] 7.0 IP stack related panic o kern/129036 net [ipfw] 'ipfw fwd' does not change outgoing interface n o bin/128954 net ifconfig(8) deletes valid routes o bin/128602 net [an] wpa_supplicant(8) crashes with an(4) o kern/128448 net [nfs] 6.4-RC1 Boot Fails if NFS Hostname cannot be res o bin/128295 net [patch] ifconfig(8) does not print TOE4 or TOE6 capabi o bin/128001 net wpa_supplicant(8), wlan(4), and wi(4) issues o kern/127826 net [iwi] iwi0 driver has reduced performance and connecti o kern/127815 net [gif] [patch] if_gif does not set vlan attributes from o kern/127724 net [rtalloc] rtfree: 0xc5a8f870 has 1 refs f bin/127719 net [arp] arp: Segmentation fault (core dumped) f kern/127528 net [icmp]: icmp socket receives icmp replies not owned by p kern/127360 net [socket] TOE socket options missing from sosetopt() o bin/127192 net routed(8) removes the secondary alias IP of interface f kern/127145 net [wi]: prism (wi) driver crash at bigger traffic o kern/126895 net [patch] [ral] Add antenna selection (marked as TBD) o kern/126874 net [vlan]: Zebra problem if ifconfig vlanX destroy o kern/126695 net rtfree messages and network disruption upon use of if_ o kern/126339 net [ipw] ipw driver drops the connection o kern/126075 net [inet] [patch] internet control accesses beyond end of o bin/125922 net [patch] Deadlock in arp(8) o kern/125920 net [arp] Kernel Routing Table loses Ethernet Link status o kern/125845 net [netinet] [patch] tcp_lro_rx() should make use of hard o kern/125258 net [socket] socket's SO_REUSEADDR option does not work o kern/125239 net [gre] kernel crash when using gre o kern/124341 net [ral] promiscuous mode for wireless device ral0 looses o kern/124225 net [ndis] [patch] ndis network driver sometimes loses net o kern/124160 net [libc] connect(2) function loops indefinitely o kern/124021 net [ip6] [panic] page fault in nd6_output() o kern/123968 net [rum] [panic] rum driver causes kernel panic with WPA. o kern/123892 net [tap] [patch] No buffer space available o kern/123890 net [ppp] [panic] crash & reboot on work with PPP low-spee o kern/123858 net [stf] [patch] stf not usable behind a NAT o kern/123758 net [panic] panic while restarting net/freenet6 o bin/123633 net ifconfig(8) doesn't set inet and ether address in one o kern/123559 net [iwi] iwi periodically disassociates/associates [regre o bin/123465 net [ip6] route(8): route add -inet6 -interfac o kern/123463 net [ipsec] [panic] repeatable crash related to ipsec-tool o conf/123330 net [nsswitch.conf] Enabling samba wins in nsswitch.conf c o kern/123160 net [ip] Panic and reboot at sysctl kern.polling.enable=0 o kern/122989 net [swi] [panic] 6.3 kernel panic in swi1: net o kern/122954 net [lagg] IPv6 EUI64 incorrectly chosen for lagg devices f kern/122780 net [lagg] tcpdump on lagg interface during high pps wedge o kern/122685 net It is not visible passing packets in tcpdump(1) o kern/122319 net [wi] imposible to enable ad-hoc demo mode with Orinoco o kern/122290 net [netgraph] [panic] Netgraph related "kmem_map too smal o kern/122252 net [ipmi] [bge] IPMI problem with BCM5704 (does not work o kern/122033 net [ral] [lor] Lock order reversal in ral0 at bootup ieee o bin/121895 net [patch] rtsol(8)/rtsold(8) doesn't handle managed netw s kern/121774 net [swi] [panic] 6.3 kernel panic in swi1: net o kern/121555 net [panic] Fatal trap 12: current process = 12 (swi1: net o kern/121534 net [ipl] [nat] FreeBSD Release 6.3 Kernel Trap 12: o kern/121443 net [gif] [lor] icmp6_input/nd6_lookup o kern/121437 net [vlan] Routing to layer-2 address does not work on VLA o bin/121359 net [patch] [security] ppp(8): fix local stack overflow in o kern/121257 net [tcp] TSO + natd -> slow outgoing tcp traffic o kern/121181 net [panic] Fatal trap 3: breakpoint instruction fault whi o kern/120966 net [rum] kernel panic with if_rum and WPA encryption o kern/120566 net [request]: ifconfig(8) make order of arguments more fr o kern/120304 net [netgraph] [patch] netgraph source assumes 32-bit time o kern/120266 net [udp] [panic] gnugk causes kernel panic when closing U o bin/120060 net routed(8) deletes link-level routes in the presence of o kern/119945 net [rum] [panic] rum device in hostap mode, cause kernel o kern/119791 net [nfs] UDP NFS mount of aliased IP addresses from a Sol o kern/119617 net [nfs] nfs error on wpa network when reseting/shutdown f kern/119516 net [ip6] [panic] _mtx_lock_sleep: recursed on non-recursi o kern/119432 net [arp] route add -host -iface causes arp e o kern/119225 net [wi] 7.0-RC1 no carrier with Prism 2.5 wifi card [regr o kern/118727 net [netgraph] [patch] [request] add new ng_pf module o kern/117423 net [vlan] Duplicate IP on different interfaces o bin/117339 net [patch] route(8): loading routing management commands o bin/116643 net [patch] [request] fstat(1): add INET/INET6 socket deta o kern/116185 net [iwi] if_iwi driver leads system to reboot o kern/115239 net [ipnat] panic with 'kmem_map too small' using ipnat o kern/115019 net [netgraph] ng_ether upper hook packet flow stops on ad o kern/115002 net [wi] if_wi timeout. failed allocation (busy bit). ifco o kern/114915 net [patch] [pcn] pcn (sys/pci/if_pcn.c) ethernet driver f o kern/113432 net [ucom] WARNING: attempt to net_add_domain(netgraph) af o kern/112722 net [ipsec] [udp] IP v4 udp fragmented packet reject o kern/112686 net [patm] patm driver freezes System (FreeBSD 6.2-p4) i38 o bin/112557 net [patch] ppp(8) lock file should not use symlink name o kern/112528 net [nfs] NFS over TCP under load hangs with "impossible p o kern/111537 net [inet6] [patch] ip6_input() treats mbuf cluster wrong o kern/111457 net [ral] ral(4) freeze o kern/110284 net [if_ethersubr] Invalid Assumption in SIOCSIFADDR in et o kern/110249 net [kernel] [regression] [patch] setsockopt() error regre o kern/109470 net [wi] Orinoco Classic Gold PC Card Can't Channel Hop o bin/108895 net pppd(8): PPPoE dead connections on 6.2 [regression] f kern/108197 net [panic] [gif] [ip6] if_delmulti reference counting pan o kern/107944 net [wi] [patch] Forget to unlock mutex-locks o conf/107035 net [patch] bridge(8): bridge interface given in rc.conf n o kern/106444 net [netgraph] [panic] Kernel Panic on Binding to an ip to o kern/106316 net [dummynet] dummynet with multipass ipfw drops packets o kern/105945 net Address can disappear from network interface s kern/105943 net Network stack may modify read-only mbuf chain copies o bin/105925 net problems with ifconfig(8) and vlan(4) [regression] o kern/104851 net [inet6] [patch] On link routes not configured when usi o kern/104751 net [netgraph] kernel panic, when getting info about my tr o kern/104738 net [inet] [patch] Reentrant problem with inet_ntoa in the o kern/103191 net Unpredictable reboot o kern/103135 net [ipsec] ipsec with ipfw divert (not NAT) encodes a pac o kern/102540 net [netgraph] [patch] supporting vlan(4) by ng_fec(4) o conf/102502 net [netgraph] [patch] ifconfig name does't rename netgrap o kern/102035 net [plip] plip networking disables parallel port printing o kern/100709 net [libc] getaddrinfo(3) should return TTL info o kern/100519 net [netisr] suggestion to fix suboptimal network polling o kern/98597 net [inet6] Bug in FreeBSD 6.1 IPv6 link-local DAD procedu o bin/98218 net wpa_supplicant(8) blacklist not working o kern/97306 net [netgraph] NG_L2TP locks after connection with failed o conf/97014 net [gif] gifconfig_gif? in rc.conf does not recognize IPv f kern/96268 net [socket] TCP socket performance drops by 3000% if pack o kern/95519 net [ral] ral0 could not map mbuf o kern/95288 net [pppd] [tty] [panic] if_ppp panic in sys/kern/tty_subr o kern/95277 net [netinet] [patch] IP Encapsulation mask_match() return o kern/95267 net packet drops periodically appear f kern/93378 net [tcp] Slow data transfer in Postfix and Cyrus IMAP (wo o kern/93019 net [ppp] ppp and tunX problems: no traffic after restarti o kern/92880 net [libc] [patch] almost rewritten inet_network(3) functi s kern/92279 net [dc] Core faults everytime I reboot, possible NIC issu o kern/91859 net [ndis] if_ndis does not work with Asus WL-138 o kern/91364 net [ral] [wep] WF-511 RT2500 Card PCI and WEP o kern/91311 net [aue] aue interface hanging o kern/87421 net [netgraph] [panic]: ng_ether + ng_eiface + if_bridge o kern/86871 net [tcp] [patch] allocation logic for PCBs in TIME_WAIT s o kern/86427 net [lor] Deadlock with FASTIPSEC and nat o kern/85780 net 'panic: bogus refcnt 0' in routing/ipv6 o bin/85445 net ifconfig(8): deprecated keyword to ifconfig inoperativ o bin/82975 net route change does not parse classfull network as given o kern/82881 net [netgraph] [panic] ng_fec(4) causes kernel panic after o kern/82468 net Using 64MB tcp send/recv buffers, trafficflow stops, i o bin/82185 net [patch] ndp(8) can delete the incorrect entry o kern/81095 net IPsec connection stops working if associated network i o kern/78968 net FreeBSD freezes on mbufs exhaustion (network interface o kern/78090 net [ipf] ipf filtering on bridged packets doesn't work if o kern/77341 net [ip6] problems with IPV6 implementation o kern/75873 net Usability problem with non-RFC-compliant IP spoof prot s kern/75407 net [an] an(4): no carrier after short time a kern/71474 net [route] route lookup does not skip interfaces marked d o kern/71469 net default route to internet magically disappears with mu o kern/68889 net [panic] m_copym, length > size of mbuf chain o kern/66225 net [netgraph] [patch] extend ng_eiface(4) control message o kern/65616 net IPSEC can't detunnel GRE packets after real ESP encryp s kern/60293 net [patch] FreeBSD arp poison patch a kern/56233 net IPsec tunnel (ESP) over IPv6: MTU computation is wrong s bin/41647 net ifconfig(8) doesn't accept lladdr along with inet addr o kern/39937 net ipstealth issue a kern/38554 net [patch] changing interface ipaddress doesn't seem to w o kern/31940 net ip queue length too short for >500kpps o kern/31647 net [libc] socket calls can return undocumented EINVAL o kern/30186 net [libc] getaddrinfo(3) does not handle incorrect servna f kern/24959 net [patch] proper TCP_NOPUSH/TCP_CORK compatibility o conf/23063 net [arp] [patch] for static ARP tables in rc.network o kern/21998 net [socket] [patch] ident only for outgoing connections o kern/5877 net [socket] sb_cc counts control data as well as data dat 475 problems total. From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 15:47:30 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 94E89FBE for ; Mon, 27 Jan 2014 15:47:30 +0000 (UTC) Received: from bounce122.photobucket.com (bounce122.photobucket.com [66.11.51.122]) by mx1.freebsd.org (Postfix) with SMTP id 60ACA15C4 for ; Mon, 27 Jan 2014 15:47:30 +0000 (UTC) Received: (qmail 17647 invoked from network); 27 Jan 2014 15:46:11 -0000 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=photobucket.com; s=default; h=Comment:DomainKey-Signature: Received:Received:To:Subject:MIME-Version:Content-Type:From: Message-Id:Date; bh=EM79JQfrwyP0wd0iWN+Y1r5ZPDY=; b=CpH9CMYMBDiV gkj5y0XmnlpkBw4tG2ZFvkue72G+xEb+4JxW/ZmPuzMZec+DoXcSfNNLWcZP2jfr kfzCntT2j1eYzkyJO1wInelMg6e65IeMBUfpqSC4i5JkIR3l3EmKy4/6yLrhjn9E VPSSLJUJSZVbJ3K7Sywsn6jg/7sI4+s= Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=photobucket.com; b=Iz5S9k0t+IzU2w9pJD1aYiNxW1LWxYuDikUkbVIvhyUF2g6Cuy/RY7SqCN9lNQrQJYx1I1mNnozawmQzkfuMvRYqhnhVTAQVwEKPEe43FPsT/IRHKaTItEVDv9UfttnkOKsTxl/lNlnp2VBAdN1x3sg8ZBIXueRLVYALQSp+Atg= ; X-Mailer-Info: AwVkZQDknmSypzMbZzIlMztfqTIvYaSzo3WlMKANM3WuYKSzo3WlMKZfMj Received: from unknown (HELO den2tools01.photobucket.com) (10.2.24.106) by mailer.photobucket.com with SMTP; 27 Jan 2014 15:46:11 -0000 Received: by den2tools01.photobucket.com (Postfix, from userid 99) id C898D604A5; Mon, 27 Jan 2014 08:46:11 -0700 (MST) To: freebsd-net@freebsd.org Subject: mya.01@hotmail.com shared a photo with you on Photobucket MIME-Version: 1.0 From: updates@photobucket.com X-PBContext: 2 Message-Id: <20140127154611.C898D604A5@den2tools01.photobucket.com> Date: Mon, 27 Jan 2014 08:46:11 -0700 (MST) Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.17 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 15:47:30 -0000 Hello! Good news! mya.01@hotmail.com wants to share a Photobucket photo with you: "Good day,=20 Nice to meet you, my name is Mary, I found your contact and I picked intere= st to contact you via this medium. I've something very important which I wo= uld love to share with you therefore, I would appreciate if you respond bac= k to me through this E-mail (mya.01@hotmail.com), & I'll write you back= with my full details. I am waiting anxiously for your response.=20 Truly yours,=20 Mary." http://s1276.photobucket.com/user/staffpicks/media/Animated_GIFs/mtuhir.gif= .html?evt=3Demail_share ____________________________________________ Photobucket.com - http://.photobucket.com= From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 20:23:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id F40EDDC8; Mon, 27 Jan 2014 20:23:08 +0000 (UTC) Received: from mail-qc0-x22c.google.com (mail-qc0-x22c.google.com [IPv6:2607:f8b0:400d:c01::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A4C801DB2; Mon, 27 Jan 2014 20:23:08 +0000 (UTC) Received: by mail-qc0-f172.google.com with SMTP id c9so8901270qcz.31 for ; Mon, 27 Jan 2014 12:23:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=RxCkGlcbLExsPnrxDNQgyQLespxWMTFSwvuizWE5Nnc=; b=lXszfcnSV4653QWrbUQQ6DJ98UXQCXBtFtc5N1kPoEIMQ7aKRzLZKOkKeEFouqDwwo 0OpIaq4u7Dfm519qhifPY5fLmUsqoyosgEvoazKHQXZlL+JYI08bsFAhqkuEPFcU+LEC cedDuoCvYjsNAT+RXHZK1zvBzZKWcvBf/BSxLsapQTRpDqTMriKbri1xSywfONpXjeTx uqodbxQQpVeQkJQm5gclue0JLMGtSD1hJ3vZaYoU7nq90OSnSnOLDKiYhlNsXp2kdmjM ZfmSoiR5chW/FkbBdHNvURS9BXnRRQI7YSvVZ97+PCDgTTc7kf0+nHzOWnya7un9RT5R lsiQ== MIME-Version: 1.0 X-Received: by 10.224.122.208 with SMTP id m16mr46425650qar.55.1390854187930; Mon, 27 Jan 2014 12:23:07 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Mon, 27 Jan 2014 12:23:07 -0800 (PST) Date: Mon, 27 Jan 2014 12:23:07 -0800 X-Google-Sender-Auth: VrWtfFIAKoQaqamrf79udvoViok Message-ID: Subject: flowtable - FL_HASH_ALL From: Adrian Chadd To: "freebsd-arch@freebsd.org" , FreeBSD Net Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 20:23:09 -0000 Hi, What's FL_HASH_ALL supposed to do? Is the flowtable code going to do any kind of 4-tuple hashing if it isn't set? -a From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 21:09:41 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0C24DDEC; Mon, 27 Jan 2014 21:09:41 +0000 (UTC) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7AE331146; Mon, 27 Jan 2014 21:09:40 +0000 (UTC) Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1]) by acme.spoerlein.net (8.14.7/8.14.7) with ESMTP id s0RL9cF5027666 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 27 Jan 2014 22:09:38 +0100 (CET) (envelope-from uqs@FreeBSD.org) Date: Mon, 27 Jan 2014 22:09:37 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Luigi Rizzo Subject: Re: unused in_cksum_update() ? Message-ID: <20140127210937.GB93124@acme.spoerlein.net> Mail-Followup-To: Luigi Rizzo , Gleb Smirnoff , wollman@freebsd.org, current@freebsd.org, net@freebsd.org References: <20140109192114.GA49934@onelab2.iet.unipi.it> <20140110103140.GD73147@FreeBSD.org> <20140110182448.GA62317@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140110182448.GA62317@onelab2.iet.unipi.it> User-Agent: Mutt/1.5.22 (2013-10-16) Cc: wollman@freebsd.org, current@freebsd.org, net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 21:09:41 -0000 On Fri, 2014-01-10 at 19:24:48 +0100, Luigi Rizzo wrote: > On Fri, Jan 10, 2014 at 02:31:40PM +0400, Gleb Smirnoff wrote: > > On Thu, Jan 09, 2014 at 08:21:14PM +0100, Luigi Rizzo wrote: > > L> a lot of arch-specific headers (sys/${ARCH}/include/in_cksum.h) > > L> have a lengthy definition for > > L> > > L> in_cksum_update(struct ip *ip) > > L> > > L> which seems completely unused in our source tree. > > L> Time to remove it perhaps ? > > L> > > L> grep cannot find any use at least since stable/8 > > > > I'd prefer not to hurry with its removal. Might be that pf will use it. > > Since it lives in a header file, it doesn't add a single bit to kernel > > size. > > we should care more about obfuscation and correcteness, and this is > a killer in both respects. > Depending on $arch the function is not even available or wrong: > > In particular, the basic code follows the description in > http://tools.ietf.org/html/rfc1141A with ntohs/htons to deal > with endianness (note that the '256' should not be converted): > > tmp = ntohs(sum)+256; > tmp = tmp + (tmp >> 16); > sum = htons(tmp); // also truncates high bits > > It is correctly implemented (but in a totally generic way, so no > point to have it in the arch-specific files) for amd64, i386, > ia64, mips, powerpc; it is not implemented for arm, and it is wrong > for sparc64 (where the 256 is incorrectly replaced by a 1). > > In terms of usage: the svn repo suggests that it was added in r15884 > in 1996 (stable/2.2 is the first branch where it appears): > > http://svnweb.freebsd.org/base/head/sys/i386/include/in_cksum.h?r1=15884&r2=15883&pathrev=15884 > > as far as i can tell never used anywhere, and copied from place to > place when we started to support different architectures. > > Shall we wait until it becomes 18 ? :) > > I am adding Garret to the list as he may have more details. Git's "pickaxe" is a very good tool for this sort of code archeology. There's only a handful of commits that touched anything related to "in_cksum_update". I'm not going to dump the output of git log -S"in_cksum_update" here, just the revisions that add/remove that string. r15884 r36849 r66458 r86144 r99040 r158458 r163022 r178172 r180010 hth Uli From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 23:27:29 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4F99E9CE for ; Mon, 27 Jan 2014 23:27:29 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 125001D18 for ; Mon, 27 Jan 2014 23:27:28 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEABnr5lKDaFve/2dsb2JhbABag0RWgn25EE+BMXSCJQEBAQMBAQEBICsgCwUWGAICDRkCIwYBCSYOAgUEARwEh1ADCQgNqXWXJg2FVheBKYtOgTQQAgEbNAeCb4FJBIlIjAxngx6LK4VBg0seMYE9 X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90909892" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 18:27:21 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2F446B40EF; Mon, 27 Jan 2014 18:27:19 -0500 (EST) Date: Mon, 27 Jan 2014 18:27:19 -0500 (EST) From: Rick Macklem To: pyunyh@gmail.com Message-ID: <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> In-Reply-To: <20140127055047.GA1368@michelle.cdnetworks.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Daniel Braniss , freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 23:27:29 -0000 pyunyh@gmail.com wrote: > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > Adam McDougall wrote: > > > Also try rsize=32768,wsize=32768 in your mount options, made a > > > huge > > > difference for me. I've noticed slow file transfers on NFS in 9 > > > and > > > finally did some searching a couple months ago, someone suggested > > > it > > > and > > > they were on to something. > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > perform > > poorly for some network environments. > > A 64K NFS read reply/write request consists of a list of 34 mbufs > > when > > passed to TCP via sosend() and a total data length of around > > 65680bytes. > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > expect > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > think > > (I don't have anything that does TSO to confirm this) that NFS will > > pass > > a list that is longer (34 plus a TCP/IP header). > > At a glance, it appears that the drivers call m_defrag() or > > m_collapse() > > when the mbuf list won't fit in their scatter table (32 or 33 > > elements) > > and if this fails, just silently drop the data without sending it. > > If I'm right, there would considerable overhead from > > m_defrag()/m_collapse() > > and near disaster if they fail to fix the problem and the data is > > silently > > dropped instead of xmited. > > > > I think the actual number of DMA segments allocated for the mbuf > chain is determined by bus_dma(9). bus_dma(9) will coalesce > current segment with previous segment if possible. > Ok, I'll have to take a look, but I thought that an array of sized by "num_segs" is passed in as an argument. (And num_segs is set to either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) It looked to me that the ixgbe driver called itself ix, so it isn't obvious to me which we are talking about. (I know that Daniel Braniss had an ix0 and ix1, which were fixed for NFS by disabling TSO.) I'll admit I mostly looked at virtio's network driver, since that was the one being used by J David. Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been cropping up for quite a while, and I am just trying to find out why. (I have no hardware/software that exhibits the problem, so I can only look at the sources and ask others to try testing stuff.) > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > see the total length of all segment size of ix(4) is 65535 so > it has no room for ethernet/VLAN header of the mbuf chain. The > driver should be fixed to transmit a 64KB datagram. Well, if_hw_tsomax is set to 65535 by the generic code (the driver doesn't set it) and the code in tcp_output() seems to subtract the size of an tcp/ip header from that before passing data to the driver, so I think the mbuf chain passed to the driver will fit in one ip datagram. (I'd assume all sorts of stuff would break for TSO enabled drivers if that wasn't the case?) > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > capable controllers are able to handle multiple TX buffers so it > should have used m_collapse(9) rather than copying entire chain > with m_defrag(9). > I haven't looked at these closely yet (plan on doing so to-day), but even m_collapse() looked like it copied data between mbufs and that is certainly suboptimal, imho. I don't see why a driver can't split the mbuf list, if there are too many entries for the scatter/gather and do it in two iterations (much like tcp_output() does already, since the data length exceeds 65535 - tcp/ip header size). However, at this point, I just want to find out if the long chain of mbufs is why TSO is problematic for these drivers, since I'll admit I'm getting tired of telling people to disable TSO (and I suspect some don't believe me and never try it). > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > clusters, > > so the mbuf count drops from 34 to 18. > > > > Could we make it conditional on size? > Not sure what you mean? If you mean "the size of the read/write", that would be possible for NFSv3, but less so for NFSv4. (The read/write is just one Op. in the compound for NFSv4 and there is no way to predict how much more data is going to be generated by subsequent Ops.) If by "size" you mean amount of memory in the machine then, yes, it certainly could be conditional on that. (I plan to try and look at the allocator to-day as well, but if others know of disadvantages with using MJUMPAGESIZE instead of MCLBYTES, please speak up.) Garrett Wollman already alluded to the MCLBYTES case being pre-allocated, but I'll admit I have no idea what the implications of that are at this time. > > If anyone has a TSO scatter/gather enabled net interface and can > > test this > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is > > enabled > > and see what effect it has, that would be appreciated. > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > MJUMPAGESIZE > > clusters. > > > > rick > > ps: If the attachment doesn't make it through and you want the > > patch, just > > email me and I'll send you a copy. > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 23:47:15 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id ADE3C47E for ; Mon, 27 Jan 2014 23:47:15 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 706201E4E for ; Mon, 27 Jan 2014 23:47:14 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEALHv5lKDaFve/2dsb2JhbABXA4NEVoJ9uRFPgTF0giUBAQEDAQEBASArIAsFFhgCAg0ZAikBCSYGCAcEARwBA4dcCA2peJ0IF4EpjQIKBgIBGyQQBxGCHkCBSQSJSIwMhAWQbINLHjF7Qg X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90913571" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 18:47:10 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 7775DB40E7; Mon, 27 Jan 2014 18:47:10 -0500 (EST) Date: Mon, 27 Jan 2014 18:47:10 -0500 (EST) From: Rick Macklem To: John-Mark Gurney Message-ID: <222089865.17245782.1390866430479.JavaMail.root@uoguelph.ca> In-Reply-To: <20140127032338.GP13704@funkthat.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 23:47:15 -0000 John-Mark Gurney wrote: > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 -0500: > > Btw, thanks go to Garrett Wollman for suggesting the change to > > MJUMPAGESIZE > > clusters. > > > > rick > > ps: If the attachment doesn't make it through and you want the > > patch, just > > email me and I'll send you a copy. > > The patch looks good, but we probably shouldn't change _readlink.. > The chances of a link being >2k are pretty slim, and the chances of > the link being >32k are even smaller... > Yea, I already thought of that, actually. However, see below w.r.t. NFSv4. However, at this point I mostly want to find out if it the long mbuf chain that causes problems for TSO enabled network interfaces. > In fact, we might want to switch _readlink to MGET (could be > conditional > upon cnt) so that if it fits in an mbuf we don't allocate a cluster > for > it... > For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in a compound RPC. As such, there is no way to know how much additional RPC message there will be. So, although the readlink reply won't use much of the 4K allocation, replies for subsequent Ops. in the compound certainly could. (Is it more efficient to allocate 4K now and use part of it for subsequent message reply stuff or allocate additional mbuf clusters later for subsequent stuff, as required? On a small memory constrained machine, I suspect the latter is correct, but for the kind of hardware that has TSO scatter/gather enabled network interfaces, I'm not so sure. At this point, I wouldn't even say that using 4K clusters is going to be a win and my hunch is that any win wouldn't apply to small memory constrained machines.) My test server has 256Mbytes of ram and it certainly doesn't show any improvement (big surprise;-), but it also doesn't show any degradation for the limited testing I've done. Again, my main interest at this point is whether reducing the number of mbufs in the chain fixes the TSO issues. I think the question of whether or not 4K clusters are performance improvement in general, is an interesting one that comes later. rick > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 00:28:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4CCA03B0 for ; Tue, 28 Jan 2014 00:28:31 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2815E1156 for ; Tue, 28 Jan 2014 00:28:30 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0S0SRwC063807 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 27 Jan 2014 16:28:27 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0S0SQ4E063806; Mon, 27 Jan 2014 16:28:26 -0800 (PST) (envelope-from jmg) Date: Mon, 27 Jan 2014 16:28:26 -0800 From: John-Mark Gurney To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140128002826.GU13704@funkthat.com> Mail-Followup-To: Rick Macklem , freebsd-net@freebsd.org, Adam McDougall References: <20140127032338.GP13704@funkthat.com> <222089865.17245782.1390866430479.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <222089865.17245782.1390866430479.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Mon, 27 Jan 2014 16:28:27 -0800 (PST) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 00:28:31 -0000 Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 -0500: > John-Mark Gurney wrote: > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 -0500: > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > MJUMPAGESIZE > > > clusters. > > > > > > rick > > > ps: If the attachment doesn't make it through and you want the > > > patch, just > > > email me and I'll send you a copy. > > > > The patch looks good, but we probably shouldn't change _readlink.. > > The chances of a link being >2k are pretty slim, and the chances of > > the link being >32k are even smaller... > > > Yea, I already thought of that, actually. However, see below w.r.t. > NFSv4. > > However, at this point I > mostly want to find out if it the long mbuf chain that causes problems > for TSO enabled network interfaces. I agree, though a long mbuf chain is more of a driver issue than an NFS issue... > > In fact, we might want to switch _readlink to MGET (could be > > conditional > > upon cnt) so that if it fits in an mbuf we don't allocate a cluster > > for > > it... > > > For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in > a compound RPC. As such, there is no way to know how much additional > RPC message there will be. So, although the readlink reply won't use > much of the 4K allocation, replies for subsequent Ops. in the compound > certainly could. (Is it more efficient to allocate 4K now and use > part of it for subsequent message reply stuff or allocate additional > mbuf clusters later for subsequent stuff, as required? On a small > memory constrained machine, I suspect the latter is correct, but for > the kind of hardware that has TSO scatter/gather enabled network > interfaces, I'm not so sure. At this point, I wouldn't even say > that using 4K clusters is going to be a win and my hunch is that > any win wouldn't apply to small memory constrained machines.) Though the code that was patched wasn't using any partial buffers, it was always allocating a new buffer... If the code in _read/_readlinks starts using a previous mbuf chain, then obviously things are different and I'd agree, always allocating a 2k/4k cluster makes sense... > My test server has 256Mbytes of ram and it certainly doesn't show > any improvement (big surprise;-), but it also doesn't show any > degradation for the limited testing I've done. I'm not too surprised, unless you're on a heavy server pushing >200MB/sec, the allocation cost is probably cheap enough that it doesn't show up... going to 4k means immediately half as many mbufs are needed/allocated, and as they are page sized, don't have the problems of physical memory fragmentation, nor do they have to do an IPI/tlb shoot down in the case of multipage allocations... (I'm dealing w/ this for geli.) > Again, my main interest at this point is whether reducing the > number of mbufs in the chain fixes the TSO issues. I think > the question of whether or not 4K clusters are performance > improvement in general, is an interesting one that comes later. Another thing I noticed is that we are getting an mbuf and then allocating a cluster... Is there a reason we aren't using something like m_getm or m_getcl? We have a special uma zone that has mbuf and mbuf cluster already paired meaning we save some lock operations for each segment allocated... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 00:58:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 605F0A75 for ; Tue, 28 Jan 2014 00:58:27 +0000 (UTC) Received: from mail-pd0-x231.google.com (mail-pd0-x231.google.com [IPv6:2607:f8b0:400e:c02::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2E548136F for ; Tue, 28 Jan 2014 00:58:27 +0000 (UTC) Received: by mail-pd0-f177.google.com with SMTP id x10so6402562pdj.36 for ; Mon, 27 Jan 2014 16:58:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=Kydc5GLBfufbqmaeY5S8edqaJ9Hxqemk8fq/lmUiyf0=; b=gyDLZhkqNDC8DzYk7gK7OCFptNRyBL5B2rM6S4Z944zcQsAguwg0SNGrVdewA8KJr7 U5pyJ4TVYlDSx9mIFM0jmXnN4KZJ33ibL10LdLAs6TPkVaKoquMXBvjbTb2Fuh6Hi/L5 fn2mGQQJixs246OmMrXPrwahd1re7TKdpd8UTgG0u0xZ6CHicG4PPt93L1/mdhrB0DfR jADrjQvBS+im5BHS9XjJHmqCUkPc/8mz/70nIukldH4wGmXgkvJKeHzNG5O7LbE+w11Z BazE+lrVjddQy5xoyXScA4vHxA9tiPFKBuTM+XCSKzyXdioyvx56v3xHTAxXvHTrqGBd ws3Q== X-Received: by 10.67.5.131 with SMTP id cm3mr33310107pad.92.1390870706768; Mon, 27 Jan 2014 16:58:26 -0800 (PST) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id un5sm97038939pab.3.2014.01.27.16.58.22 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 27 Jan 2014 16:58:25 -0800 (PST) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Tue, 28 Jan 2014 09:58:18 +0900 From: Yonghyeon PYUN Date: Tue, 28 Jan 2014 09:58:18 +0900 To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140128005818.GB2722@michelle.cdnetworks.com> References: <20140127055047.GA1368@michelle.cdnetworks.com> <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="CE+1k2dSO48ffgeK" Content-Disposition: inline In-Reply-To: <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i Cc: Daniel Braniss , freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 00:58:27 -0000 --CE+1k2dSO48ffgeK Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > pyunyh@gmail.com wrote: > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > Adam McDougall wrote: > > > > Also try rsize=32768,wsize=32768 in your mount options, made a > > > > huge > > > > difference for me. I've noticed slow file transfers on NFS in 9 > > > > and > > > > finally did some searching a couple months ago, someone suggested > > > > it > > > > and > > > > they were on to something. > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > perform > > > poorly for some network environments. > > > A 64K NFS read reply/write request consists of a list of 34 mbufs > > > when > > > passed to TCP via sosend() and a total data length of around > > > 65680bytes. > > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > > expect > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > > think > > > (I don't have anything that does TSO to confirm this) that NFS will > > > pass > > > a list that is longer (34 plus a TCP/IP header). > > > At a glance, it appears that the drivers call m_defrag() or > > > m_collapse() > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > elements) > > > and if this fails, just silently drop the data without sending it. > > > If I'm right, there would considerable overhead from > > > m_defrag()/m_collapse() > > > and near disaster if they fail to fix the problem and the data is > > > silently > > > dropped instead of xmited. > > > > > > > I think the actual number of DMA segments allocated for the mbuf > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > current segment with previous segment if possible. > > > Ok, I'll have to take a look, but I thought that an array of sized > by "num_segs" is passed in as an argument. (And num_segs is set to > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > It looked to me that the ixgbe driver called itself ix, so it isn't > obvious to me which we are talking about. (I know that Daniel Braniss > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > It's ix(4). ixbge(4) is a different driver. > I'll admit I mostly looked at virtio's network driver, since that > was the one being used by J David. > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been > cropping up for quite a while, and I am just trying to find out why. > (I have no hardware/software that exhibits the problem, so I can > only look at the sources and ask others to try testing stuff.) > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > > see the total length of all segment size of ix(4) is 65535 so > > it has no room for ethernet/VLAN header of the mbuf chain. The > > driver should be fixed to transmit a 64KB datagram. > Well, if_hw_tsomax is set to 65535 by the generic code (the driver > doesn't set it) and the code in tcp_output() seems to subtract the > size of an tcp/ip header from that before passing data to the driver, > so I think the mbuf chain passed to the driver will fit in one > ip datagram. (I'd assume all sorts of stuff would break for TSO > enabled drivers if that wasn't the case?) I believe the generic code is doing right. I'm under the impression the non-working TSO indicates a bug in driver. Some drivers didn't account for additional ethernet/VLAN header so the total size of DMA segments exceeded 65535. I've attached a diff for ix(4). It wasn't tested at all as I don't have hardware to test. > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > capable controllers are able to handle multiple TX buffers so it > > should have used m_collapse(9) rather than copying entire chain > > with m_defrag(9). > > > I haven't looked at these closely yet (plan on doing so to-day), but > even m_collapse() looked like it copied data between mbufs and that > is certainly suboptimal, imho. I don't see why a driver can't split > the mbuf list, if there are too many entries for the scatter/gather > and do it in two iterations (much like tcp_output() does already, > since the data length exceeds 65535 - tcp/ip header size). > It can split the mbuf list if controllers supports increased number of TX buffers. Because controller shall consume the same number of DMA descriptors for the mbuf list, drivers tend to impose a limit on the number of TX buffers to save resources. > However, at this point, I just want to find out if the long chain > of mbufs is why TSO is problematic for these drivers, since I'll > admit I'm getting tired of telling people to disable TSO (and I > suspect some don't believe me and never try it). > TSO capable controllers tend to have various limitations(the first TX buffer should have complete ethernet/IP/TCP header, ip_len of IP header should be reset to 0, TCP pseudo checksum should be recomputed etc) and cheap controllers need more assistance from driver to let its firmware know various IP/TCP header offset location in the mbuf. Because this requires a IP/TCP header parsing, it's error prone and very complex. > > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > > clusters, > > > so the mbuf count drops from 34 to 18. > > > > > > > Could we make it conditional on size? > > > Not sure what you mean? If you mean "the size of the read/write", > that would be possible for NFSv3, but less so for NFSv4. (The read/write > is just one Op. in the compound for NFSv4 and there is no way to > predict how much more data is going to be generated by subsequent Ops.) > Sorry, I should have been more clearer. You already answered my question. Thanks. > If by "size" you mean amount of memory in the machine then, yes, it > certainly could be conditional on that. (I plan to try and look at > the allocator to-day as well, but if others know of disadvantages with > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > Garrett Wollman already alluded to the MCLBYTES case being pre-allocated, > but I'll admit I have no idea what the implications of that are at this > time. > > > > If anyone has a TSO scatter/gather enabled net interface and can > > > test this > > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is > > > enabled > > > and see what effect it has, that would be appreciated. > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > MJUMPAGESIZE > > > clusters. > > > > > > rick > > > ps: If the attachment doesn't make it through and you want the > > > patch, just > > > email me and I'll send you a copy. > > > --CE+1k2dSO48ffgeK Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="ix.TSO.diff" Index: sys/dev/ixgbe/ixv.h =================================================================== --- sys/dev/ixgbe/ixv.h (revision 260903) +++ sys/dev/ixgbe/ixv.h (working copy) @@ -172,7 +172,7 @@ #define IXV_SCATTER 32 #define IXV_RX_HDR 128 #define MSIX_BAR 3 -#define IXV_TSO_SIZE 65535 +#define IXV_TSO_SIZE (65535 + sizeof(struct ether_vlan_header)) #define IXV_BR_SIZE 4096 #define IXV_LINK_ITR 2000 #define TX_BUFFER_SIZE ((u32) 1514) --CE+1k2dSO48ffgeK-- From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 01:15:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6EA777B for ; Tue, 28 Jan 2014 01:15:31 +0000 (UTC) Received: from mail-wg0-x231.google.com (mail-wg0-x231.google.com [IPv6:2a00:1450:400c:c00::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id ED54C152A for ; Tue, 28 Jan 2014 01:15:30 +0000 (UTC) Received: by mail-wg0-f49.google.com with SMTP id a1so6512572wgh.28 for ; Mon, 27 Jan 2014 17:15:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=6naP9ogt3J4PoGh2KiN44jICL+TA1rD3fsABnhYXksQ=; b=WbzWP8iERj4Kxry2nmLqjDYKJIAi16Y4MHGyhlUTZ736pYJCWNoro/oGJ16zOWlB3k +9934vBCyERFjxcdlFbOkaYp0nOcOM3s2BWSMQyaUfaH1tBgk8R5LCkOOETI3Ys7XxsM libD7TUf6/03XuJtfggo6VdnLPwHT7wttGbtlhKURTGVW4pGjMZP1Zk7W08LGT5sab9i Yz4XAnrTb99aR1dbTs4AB4BFp26V8XKurGpfSLlsV4RWrP3JcxntcAoD523elrVFT4+m 8Sim3okcVXdxkVbXEkLXxMM0NNiX7ga7xxByP9tjgFHMFE8x1YJsTPpuCgqzn9g5abB2 +jhg== MIME-Version: 1.0 X-Received: by 10.180.105.65 with SMTP id gk1mr14076259wib.12.1390871729390; Mon, 27 Jan 2014 17:15:29 -0800 (PST) Received: by 10.194.20.162 with HTTP; Mon, 27 Jan 2014 17:15:29 -0800 (PST) In-Reply-To: <20140128005818.GB2722@michelle.cdnetworks.com> References: <20140127055047.GA1368@michelle.cdnetworks.com> <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> <20140128005818.GB2722@michelle.cdnetworks.com> Date: Mon, 27 Jan 2014 17:15:29 -0800 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: Jack Vogel To: Pyun YongHyeon Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: Daniel Braniss , FreeBSD Net , Adam McDougall , Rick Macklem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 01:15:31 -0000 That header file is for the VF driver :) which I don't believe is being used in this case. The driver is capable of handling 256K but its limited by the stack to 64K (look in ixgbe.h), so its not a few bytes off due to the vlan header. The scatter size is not an arbitrary one, its due to hardware limitations in Niantic (82599). Turning off TSO in the 10G environment is not practical, you will have trouble getting good performance. Jack On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN wrote: > On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > > pyunyh@gmail.com wrote: > > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > > Adam McDougall wrote: > > > > > Also try rsize=32768,wsize=32768 in your mount options, made a > > > > > huge > > > > > difference for me. I've noticed slow file transfers on NFS in 9 > > > > > and > > > > > finally did some searching a couple months ago, someone suggested > > > > > it > > > > > and > > > > > they were on to something. > > > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > > perform > > > > poorly for some network environments. > > > > A 64K NFS read reply/write request consists of a list of 34 mbufs > > > > when > > > > passed to TCP via sosend() and a total data length of around > > > > 65680bytes. > > > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > > > expect > > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > > > think > > > > (I don't have anything that does TSO to confirm this) that NFS will > > > > pass > > > > a list that is longer (34 plus a TCP/IP header). > > > > At a glance, it appears that the drivers call m_defrag() or > > > > m_collapse() > > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > > elements) > > > > and if this fails, just silently drop the data without sending it. > > > > If I'm right, there would considerable overhead from > > > > m_defrag()/m_collapse() > > > > and near disaster if they fail to fix the problem and the data is > > > > silently > > > > dropped instead of xmited. > > > > > > > > > > I think the actual number of DMA segments allocated for the mbuf > > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > > current segment with previous segment if possible. > > > > > Ok, I'll have to take a look, but I thought that an array of sized > > by "num_segs" is passed in as an argument. (And num_segs is set to > > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > > It looked to me that the ixgbe driver called itself ix, so it isn't > > obvious to me which we are talking about. (I know that Daniel Braniss > > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > > > > It's ix(4). ixbge(4) is a different driver. > > > I'll admit I mostly looked at virtio's network driver, since that > > was the one being used by J David. > > > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been > > cropping up for quite a while, and I am just trying to find out why. > > (I have no hardware/software that exhibits the problem, so I can > > only look at the sources and ask others to try testing stuff.) > > > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > > > see the total length of all segment size of ix(4) is 65535 so > > > it has no room for ethernet/VLAN header of the mbuf chain. The > > > driver should be fixed to transmit a 64KB datagram. > > Well, if_hw_tsomax is set to 65535 by the generic code (the driver > > doesn't set it) and the code in tcp_output() seems to subtract the > > size of an tcp/ip header from that before passing data to the driver, > > so I think the mbuf chain passed to the driver will fit in one > > ip datagram. (I'd assume all sorts of stuff would break for TSO > > enabled drivers if that wasn't the case?) > > I believe the generic code is doing right. I'm under the > impression the non-working TSO indicates a bug in driver. Some > drivers didn't account for additional ethernet/VLAN header so the > total size of DMA segments exceeded 65535. I've attached a diff > for ix(4). It wasn't tested at all as I don't have hardware to > test. > > > > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > > capable controllers are able to handle multiple TX buffers so it > > > should have used m_collapse(9) rather than copying entire chain > > > with m_defrag(9). > > > > > I haven't looked at these closely yet (plan on doing so to-day), but > > even m_collapse() looked like it copied data between mbufs and that > > is certainly suboptimal, imho. I don't see why a driver can't split > > the mbuf list, if there are too many entries for the scatter/gather > > and do it in two iterations (much like tcp_output() does already, > > since the data length exceeds 65535 - tcp/ip header size). > > > > It can split the mbuf list if controllers supports increased number > of TX buffers. Because controller shall consume the same number of > DMA descriptors for the mbuf list, drivers tend to impose a limit > on the number of TX buffers to save resources. > > > However, at this point, I just want to find out if the long chain > > of mbufs is why TSO is problematic for these drivers, since I'll > > admit I'm getting tired of telling people to disable TSO (and I > > suspect some don't believe me and never try it). > > > > TSO capable controllers tend to have various limitations(the first > TX buffer should have complete ethernet/IP/TCP header, ip_len of IP > header should be reset to 0, TCP pseudo checksum should be > recomputed etc) and cheap controllers need more assistance from > driver to let its firmware know various IP/TCP header offset > location in the mbuf. Because this requires a IP/TCP header > parsing, it's error prone and very complex. > > > > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > > > clusters, > > > > so the mbuf count drops from 34 to 18. > > > > > > > > > > Could we make it conditional on size? > > > > > Not sure what you mean? If you mean "the size of the read/write", > > that would be possible for NFSv3, but less so for NFSv4. (The read/write > > is just one Op. in the compound for NFSv4 and there is no way to > > predict how much more data is going to be generated by subsequent Ops.) > > > > Sorry, I should have been more clearer. You already answered my > question. Thanks. > > > If by "size" you mean amount of memory in the machine then, yes, it > > certainly could be conditional on that. (I plan to try and look at > > the allocator to-day as well, but if others know of disadvantages with > > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > > > Garrett Wollman already alluded to the MCLBYTES case being pre-allocated, > > but I'll admit I have no idea what the implications of that are at this > > time. > > > > > > If anyone has a TSO scatter/gather enabled net interface and can > > > > test this > > > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is > > > > enabled > > > > and see what effect it has, that would be appreciated. > > > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > > MJUMPAGESIZE > > > > clusters. > > > > > > > > rick > > > > ps: If the attachment doesn't make it through and you want the > > > > patch, just > > > > email me and I'll send you a copy. > > > > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 01:33:01 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8F30C387 for ; Tue, 28 Jan 2014 01:33:01 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 261B11674 for ; Tue, 28 Jan 2014 01:33:00 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAFMI51KDaFve/2dsb2JhbABXA4NEVoJ9uRJPgTJ0giUBAQEDAQEBASArHgIIAxsYAgINGQIpAQkmBggHBAEcAQOHXAgNqWufYReBKYx0CgYCARskEAcRgh5AgUkEiUiMDIQFkGyDSx4xe0I X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90927656" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 20:32:59 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 13B00B3F26; Mon, 27 Jan 2014 20:32:59 -0500 (EST) Date: Mon, 27 Jan 2014 20:32:59 -0500 (EST) From: Rick Macklem To: John-Mark Gurney Message-ID: <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> In-Reply-To: <20140128002826.GU13704@funkthat.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 01:33:01 -0000 John-Mark Gurney wrote: > Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 -0500: > > John-Mark Gurney wrote: > > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 > > > -0500: > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > > MJUMPAGESIZE > > > > clusters. > > > > > > > > rick > > > > ps: If the attachment doesn't make it through and you want the > > > > patch, just > > > > email me and I'll send you a copy. > > > > > > The patch looks good, but we probably shouldn't change > > > _readlink.. > > > The chances of a link being >2k are pretty slim, and the chances > > > of > > > the link being >32k are even smaller... > > > > > Yea, I already thought of that, actually. However, see below w.r.t. > > NFSv4. > > > > However, at this point I > > mostly want to find out if it the long mbuf chain that causes > > problems > > for TSO enabled network interfaces. > > I agree, though a long mbuf chain is more of a driver issue than an > NFS issue... > Yes, if my hunch is correct, it is. If my hunch gets verified, I will be posting w.r.t. how best to deal with the problem. I suspect a patch like this one might serve as a useful work-around while the drivers gets fixed, if the hunch is correct. > > > In fact, we might want to switch _readlink to MGET (could be > > > conditional > > > upon cnt) so that if it fits in an mbuf we don't allocate a > > > cluster > > > for > > > it... > > > > > For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in > > a compound RPC. As such, there is no way to know how much > > additional > > RPC message there will be. So, although the readlink reply won't > > use > > much of the 4K allocation, replies for subsequent Ops. in the > > compound > > certainly could. (Is it more efficient to allocate 4K now and use > > part of it for subsequent message reply stuff or allocate > > additional > > mbuf clusters later for subsequent stuff, as required? On a small > > memory constrained machine, I suspect the latter is correct, but > > for > > the kind of hardware that has TSO scatter/gather enabled network > > interfaces, I'm not so sure. At this point, I wouldn't even say > > that using 4K clusters is going to be a win and my hunch is that > > any win wouldn't apply to small memory constrained machines.) > > Though the code that was patched wasn't using any partial buffers, > it was always allocating a new buffer... If the code in > _read/_readlinks starts using a previous mbuf chain, then obviously > things are different and I'd agree, always allocating a 2k/4k > cluster makes sense... > Yes, but nd_mb and nd_bpos are set, which means subsequent replies can use the remainder of the cluster. Why does it always allocate a new cluster? Well, because the code is OLD. It was written for OpenBSD2.6 and, at that time, I tried to make it portable across the BSDen. I'm not so concerned w.r.t. its portability now, since no one else is porting it and I don't plan to, but I still think it would be nice if it were portable to other BSDen. Back when I wrote it, I believe that MCLBYTES was 1K and an entire cluster was needed. (To be honest, I found out that FreeBSD's NCLBYTES is 2K about 2 days ago, when I started looking at this stuff.) Could it now look to see if enough bytes (a little over 1K) were available in the current cluster and use that. Yes, but it would reduce the portability of the code and I don't think it would make a measurable difference performance wise. > > My test server has 256Mbytes of ram and it certainly doesn't show > > any improvement (big surprise;-), but it also doesn't show any > > degradation for the limited testing I've done. > > I'm not too surprised, unless you're on a heavy server pushing > >200MB/sec, the allocation cost is probably cheap enough that it > doesn't show up... going to 4k means immediately half as many mbufs > are needed/allocated, and as they are page sized, don't have the > problems of physical memory fragmentation, nor do they have to do an > IPI/tlb shoot down in the case of multipage allocations... (I'm > dealing w/ this for geli.) > Yes, Garrett Wollman proposed this and I suspect there might be a performance gain for larger systems. He has a more involved patch. To be honest, if Garrett is convinced that his patch is of benefit performance wise, I will do a separate posting w.r.t. it and whether or not it is appropriate to be committed to head, etc. > > Again, my main interest at this point is whether reducing the > > number of mbufs in the chain fixes the TSO issues. I think > > the question of whether or not 4K clusters are performance > > improvement in general, is an interesting one that comes later. > > Another thing I noticed is that we are getting an mbuf and then > allocating a cluster... Is there a reason we aren't using something > like m_getm or m_getcl? We have a special uma zone that has > mbuf and mbuf cluster already paired meaning we save some lock > operations for each segment allocated... > See above w.r.t. OLD portable code. There was a time when MGETCL() wasn't guaranteed to succeed even when M_WAITOK is specified. This is also why there is that weird loop in the NFSMCLGET() macro. (I think there was a time in FreeBSD's past when allocation was never guaranteed and the rest of the code doesn't tolerate a NULL mbuf ptr. Something like M_TRYWAIT in old versions of FreeBSD?) Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list. rick > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 01:46:23 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4344A4CC for ; Tue, 28 Jan 2014 01:46:23 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id EB3001726 for ; Tue, 28 Jan 2014 01:46:22 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEALkK51KDaFve/2dsb2JhbABahBqCfblhgTJ0giUBAQEDASNWGxgCAg0EFQIjNhAJh3EDCQipfJcmDYguF4Epi0GBMwEKBgEcNAcKgmWBSQSJSIxzjkmFQYNLHoEtCBci X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90929623" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 20:46:21 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 62FD0B3F4F; Mon, 27 Jan 2014 20:46:21 -0500 (EST) Date: Mon, 27 Jan 2014 20:46:21 -0500 (EST) From: Rick Macklem To: pyunyh@gmail.com Message-ID: <944293786.17288188.1390873581393.JavaMail.root@uoguelph.ca> In-Reply-To: <20140128005818.GB2722@michelle.cdnetworks.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Daniel Braniss , freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 01:46:23 -0000 pyunyh@gmail.com wrote: > On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > > pyunyh@gmail.com wrote: > > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > > Adam McDougall wrote: > > > > > Also try rsize=32768,wsize=32768 in your mount options, made > > > > > a > > > > > huge > > > > > difference for me. I've noticed slow file transfers on NFS > > > > > in 9 > > > > > and > > > > > finally did some searching a couple months ago, someone > > > > > suggested > > > > > it > > > > > and > > > > > they were on to something. > > > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > > perform > > > > poorly for some network environments. > > > > A 64K NFS read reply/write request consists of a list of 34 > > > > mbufs > > > > when > > > > passed to TCP via sosend() and a total data length of around > > > > 65680bytes. > > > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > > > expect > > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > > > think > > > > (I don't have anything that does TSO to confirm this) that NFS > > > > will > > > > pass > > > > a list that is longer (34 plus a TCP/IP header). > > > > At a glance, it appears that the drivers call m_defrag() or > > > > m_collapse() > > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > > elements) > > > > and if this fails, just silently drop the data without sending > > > > it. > > > > If I'm right, there would considerable overhead from > > > > m_defrag()/m_collapse() > > > > and near disaster if they fail to fix the problem and the data > > > > is > > > > silently > > > > dropped instead of xmited. > > > > > > > > > > I think the actual number of DMA segments allocated for the mbuf > > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > > current segment with previous segment if possible. > > > Btw, I looked at ixgbe.c and it uses bus_dmamap_load_mbuf_sg(), which seems to used the fixed size scatter/gather list provided as an argument. > > Ok, I'll have to take a look, but I thought that an array of sized > > by "num_segs" is passed in as an argument. (And num_segs is set to > > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > > It looked to me that the ixgbe driver called itself ix, so it isn't > > obvious to me which we are talking about. (I know that Daniel > > Braniss > > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > > > > It's ix(4). ixbge(4) is a different driver. > Ok, well I was looking at ixgbe.c and that one seems like it might have the problem, for the 82599 case. > > I'll admit I mostly looked at virtio's network driver, since that > > was the one being used by J David. > > > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been > > cropping up for quite a while, and I am just trying to find out > > why. > > (I have no hardware/software that exhibits the problem, so I can > > only look at the sources and ask others to try testing stuff.) > > > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > > > see the total length of all segment size of ix(4) is 65535 so > > > it has no room for ethernet/VLAN header of the mbuf chain. The > > > driver should be fixed to transmit a 64KB datagram. > > Well, if_hw_tsomax is set to 65535 by the generic code (the driver > > doesn't set it) and the code in tcp_output() seems to subtract the > > size of an tcp/ip header from that before passing data to the > > driver, > > so I think the mbuf chain passed to the driver will fit in one > > ip datagram. (I'd assume all sorts of stuff would break for TSO > > enabled drivers if that wasn't the case?) > > I believe the generic code is doing right. I'm under the > impression the non-working TSO indicates a bug in driver. Some > drivers didn't account for additional ethernet/VLAN header so the > total size of DMA segments exceeded 65535. I've attached a diff > for ix(4). It wasn't tested at all as I don't have hardware to > test. > I agree that if my hunch is correct, the drivers aren't correct. But since the problem seems to have shown up a lot and it is always reported as an NFS issue, I really want to get to the bottom of it. And, if changing to 4K clusters is useful work-around for any breakage in the drivers, then that might be useful. If the problem isn't the number of mbufs in the mbuf chain, then changing to 4K clusters won't have any effect, since the total data length in the chain remains the same. That will tell us that the problem is something else. > > > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > > capable controllers are able to handle multiple TX buffers so it > > > should have used m_collapse(9) rather than copying entire chain > > > with m_defrag(9). > > > > > I haven't looked at these closely yet (plan on doing so to-day), > > but > > even m_collapse() looked like it copied data between mbufs and that > > is certainly suboptimal, imho. I don't see why a driver can't split > > the mbuf list, if there are too many entries for the scatter/gather > > and do it in two iterations (much like tcp_output() does already, > > since the data length exceeds 65535 - tcp/ip header size). > > > > It can split the mbuf list if controllers supports increased number > of TX buffers. Because controller shall consume the same number of > DMA descriptors for the mbuf list, drivers tend to impose a limit > on the number of TX buffers to save resources. > > > However, at this point, I just want to find out if the long chain > > of mbufs is why TSO is problematic for these drivers, since I'll > > admit I'm getting tired of telling people to disable TSO (and I > > suspect some don't believe me and never try it). > > > > TSO capable controllers tend to have various limitations(the first > TX buffer should have complete ethernet/IP/TCP header, ip_len of IP > header should be reset to 0, TCP pseudo checksum should be > recomputed etc) and cheap controllers need more assistance from > driver to let its firmware know various IP/TCP header offset > location in the mbuf. Because this requires a IP/TCP header > parsing, it's error prone and very complex. > > > > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > > > clusters, > > > > so the mbuf count drops from 34 to 18. > > > > > > > > > > Could we make it conditional on size? > > > > > Not sure what you mean? If you mean "the size of the read/write", > > that would be possible for NFSv3, but less so for NFSv4. (The > > read/write > > is just one Op. in the compound for NFSv4 and there is no way to > > predict how much more data is going to be generated by subsequent > > Ops.) > > > > Sorry, I should have been more clearer. You already answered my > question. Thanks. > > > If by "size" you mean amount of memory in the machine then, yes, it > > certainly could be conditional on that. (I plan to try and look at > > the allocator to-day as well, but if others know of disadvantages > > with > > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > > > Garrett Wollman already alluded to the MCLBYTES case being > > pre-allocated, > > but I'll admit I have no idea what the implications of that are at > > this > > time. > > > > > > If anyone has a TSO scatter/gather enabled net interface and > > > > can > > > > test this > > > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO > > > > is > > > > enabled > > > > and see what effect it has, that would be appreciated. > > > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > > MJUMPAGESIZE > > > > clusters. > > > > > > > > rick > > > > ps: If the attachment doesn't make it through and you want the > > > > patch, just > > > > email me and I'll send you a copy. > > > > > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 01:51:14 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8F09C74C for ; Tue, 28 Jan 2014 01:51:14 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 2485217BE for ; Tue, 28 Jan 2014 01:51:13 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="91495648" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 27 Jan 2014 20:51:12 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 3AC49B4032; Mon, 27 Jan 2014 20:51:12 -0500 (EST) Date: Mon, 27 Jan 2014 20:51:12 -0500 (EST) From: Rick Macklem To: Jack Vogel Message-ID: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Daniel Braniss , FreeBSD Net , Adam McDougall , Pyun YongHyeon X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 01:51:14 -0000 Jack Vogel wrote: > That header file is for the VF driver :) which I don't believe is > being > used in this case. > The driver is capable of handling 256K but its limited by the stack > to 64K > (look in > ixgbe.h), so its not a few bytes off due to the vlan header. > > The scatter size is not an arbitrary one, its due to hardware > limitations > in Niantic > (82599). Turning off TSO in the 10G environment is not practical, > you will > have > trouble getting good performance. > > Jack > Well, if you look at this thread, Daniel got much better performance by turning off TSO. However, I agree that this is not an ideal solution. http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B rick > > > On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN > wrote: > > > On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > > > pyunyh@gmail.com wrote: > > > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > > > Adam McDougall wrote: > > > > > > Also try rsize=32768,wsize=32768 in your mount options, > > > > > > made a > > > > > > huge > > > > > > difference for me. I've noticed slow file transfers on NFS > > > > > > in 9 > > > > > > and > > > > > > finally did some searching a couple months ago, someone > > > > > > suggested > > > > > > it > > > > > > and > > > > > > they were on to something. > > > > > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > > > perform > > > > > poorly for some network environments. > > > > > A 64K NFS read reply/write request consists of a list of 34 > > > > > mbufs > > > > > when > > > > > passed to TCP via sosend() and a total data length of around > > > > > 65680bytes. > > > > > Looking at a couple of drivers (virtio and ixgbe), they seem > > > > > to > > > > > expect > > > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. > > > > > I > > > > > think > > > > > (I don't have anything that does TSO to confirm this) that > > > > > NFS will > > > > > pass > > > > > a list that is longer (34 plus a TCP/IP header). > > > > > At a glance, it appears that the drivers call m_defrag() or > > > > > m_collapse() > > > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > > > elements) > > > > > and if this fails, just silently drop the data without > > > > > sending it. > > > > > If I'm right, there would considerable overhead from > > > > > m_defrag()/m_collapse() > > > > > and near disaster if they fail to fix the problem and the > > > > > data is > > > > > silently > > > > > dropped instead of xmited. > > > > > > > > > > > > > I think the actual number of DMA segments allocated for the > > > > mbuf > > > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > > > current segment with previous segment if possible. > > > > > > > Ok, I'll have to take a look, but I thought that an array of > > > sized > > > by "num_segs" is passed in as an argument. (And num_segs is set > > > to > > > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > > > It looked to me that the ixgbe driver called itself ix, so it > > > isn't > > > obvious to me which we are talking about. (I know that Daniel > > > Braniss > > > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > > > > > > > It's ix(4). ixbge(4) is a different driver. > > > > > I'll admit I mostly looked at virtio's network driver, since that > > > was the one being used by J David. > > > > > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have > > > been > > > cropping up for quite a while, and I am just trying to find out > > > why. > > > (I have no hardware/software that exhibits the problem, so I can > > > only look at the sources and ask others to try testing stuff.) > > > > > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but > > > > I > > > > see the total length of all segment size of ix(4) is 65535 so > > > > it has no room for ethernet/VLAN header of the mbuf chain. The > > > > driver should be fixed to transmit a 64KB datagram. > > > Well, if_hw_tsomax is set to 65535 by the generic code (the > > > driver > > > doesn't set it) and the code in tcp_output() seems to subtract > > > the > > > size of an tcp/ip header from that before passing data to the > > > driver, > > > so I think the mbuf chain passed to the driver will fit in one > > > ip datagram. (I'd assume all sorts of stuff would break for TSO > > > enabled drivers if that wasn't the case?) > > > > I believe the generic code is doing right. I'm under the > > impression the non-working TSO indicates a bug in driver. Some > > drivers didn't account for additional ethernet/VLAN header so the > > total size of DMA segments exceeded 65535. I've attached a diff > > for ix(4). It wasn't tested at all as I don't have hardware to > > test. > > > > > > > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > > > capable controllers are able to handle multiple TX buffers so > > > > it > > > > should have used m_collapse(9) rather than copying entire chain > > > > with m_defrag(9). > > > > > > > I haven't looked at these closely yet (plan on doing so to-day), > > > but > > > even m_collapse() looked like it copied data between mbufs and > > > that > > > is certainly suboptimal, imho. I don't see why a driver can't > > > split > > > the mbuf list, if there are too many entries for the > > > scatter/gather > > > and do it in two iterations (much like tcp_output() does already, > > > since the data length exceeds 65535 - tcp/ip header size). > > > > > > > It can split the mbuf list if controllers supports increased number > > of TX buffers. Because controller shall consume the same number of > > DMA descriptors for the mbuf list, drivers tend to impose a limit > > on the number of TX buffers to save resources. > > > > > However, at this point, I just want to find out if the long chain > > > of mbufs is why TSO is problematic for these drivers, since I'll > > > admit I'm getting tired of telling people to disable TSO (and I > > > suspect some don't believe me and never try it). > > > > > > > TSO capable controllers tend to have various limitations(the first > > TX buffer should have complete ethernet/IP/TCP header, ip_len of IP > > header should be reset to 0, TCP pseudo checksum should be > > recomputed etc) and cheap controllers need more assistance from > > driver to let its firmware know various IP/TCP header offset > > location in the mbuf. Because this requires a IP/TCP header > > parsing, it's error prone and very complex. > > > > > > > Anyhow, I have attached a patch that makes NFS use > > > > > MJUMPAGESIZE > > > > > clusters, > > > > > so the mbuf count drops from 34 to 18. > > > > > > > > > > > > > Could we make it conditional on size? > > > > > > > Not sure what you mean? If you mean "the size of the read/write", > > > that would be possible for NFSv3, but less so for NFSv4. (The > > > read/write > > > is just one Op. in the compound for NFSv4 and there is no way to > > > predict how much more data is going to be generated by subsequent > > > Ops.) > > > > > > > Sorry, I should have been more clearer. You already answered my > > question. Thanks. > > > > > If by "size" you mean amount of memory in the machine then, yes, > > > it > > > certainly could be conditional on that. (I plan to try and look > > > at > > > the allocator to-day as well, but if others know of disadvantages > > > with > > > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > > > > > Garrett Wollman already alluded to the MCLBYTES case being > > > pre-allocated, > > > but I'll admit I have no idea what the implications of that are > > > at this > > > time. > > > > > > > > If anyone has a TSO scatter/gather enabled net interface and > > > > > can > > > > > test this > > > > > patch on it with NFS I/O (default of 64K rsize/wsize) when > > > > > TSO is > > > > > enabled > > > > > and see what effect it has, that would be appreciated. > > > > > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change > > > > > to > > > > > MJUMPAGESIZE > > > > > clusters. > > > > > > > > > > rick > > > > > ps: If the attachment doesn't make it through and you want > > > > > the > > > > > patch, just > > > > > email me and I'll send you a copy. > > > > > > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 02:14:55 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2A16EA9F for ; Tue, 28 Jan 2014 02:14:55 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id F32271953 for ; Tue, 28 Jan 2014 02:14:54 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0S2Ep3C065392 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 27 Jan 2014 18:14:51 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0S2EobG065391; Mon, 27 Jan 2014 18:14:50 -0800 (PST) (envelope-from jmg) Date: Mon, 27 Jan 2014 18:14:50 -0800 From: John-Mark Gurney To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140128021450.GY13704@funkthat.com> Mail-Followup-To: Rick Macklem , freebsd-net@freebsd.org, Adam McDougall References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Mon, 27 Jan 2014 18:14:51 -0800 (PST) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 02:14:55 -0000 Rick Macklem wrote this message on Mon, Jan 27, 2014 at 20:32 -0500: > John-Mark Gurney wrote: > > Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 -0500: > > > John-Mark Gurney wrote: > > > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 > > > > -0500: > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > > > MJUMPAGESIZE > > > > > clusters. > > > > > > > > > > rick > > > > > ps: If the attachment doesn't make it through and you want the > > > > > patch, just > > > > > email me and I'll send you a copy. > > > > > > > > The patch looks good, but we probably shouldn't change > > > > _readlink.. > > > > The chances of a link being >2k are pretty slim, and the chances > > > > of > > > > the link being >32k are even smaller... > > > > > > > Yea, I already thought of that, actually. However, see below w.r.t. > > > NFSv4. > > > > > > However, at this point I > > > mostly want to find out if it the long mbuf chain that causes > > > problems > > > for TSO enabled network interfaces. > > > > I agree, though a long mbuf chain is more of a driver issue than an > > NFS issue... > > > Yes, if my hunch is correct, it is. If my hunch gets verified, I will > be posting w.r.t. how best to deal with the problem. I suspect a patch > like this one might serve as a useful work-around while the drivers > gets fixed, if the hunch is correct. It would be nice to have a way to force such a segment to go out to the drivers to make debugging/testing drivers easier... I'm not sure the best way to handle that though... > > > > In fact, we might want to switch _readlink to MGET (could be > > > > conditional > > > > upon cnt) so that if it fits in an mbuf we don't allocate a > > > > cluster > > > > for > > > > it... > > > > > > > For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in > > > a compound RPC. As such, there is no way to know how much > > > additional > > > RPC message there will be. So, although the readlink reply won't > > > use > > > much of the 4K allocation, replies for subsequent Ops. in the > > > compound > > > certainly could. (Is it more efficient to allocate 4K now and use > > > part of it for subsequent message reply stuff or allocate > > > additional > > > mbuf clusters later for subsequent stuff, as required? On a small > > > memory constrained machine, I suspect the latter is correct, but > > > for > > > the kind of hardware that has TSO scatter/gather enabled network > > > interfaces, I'm not so sure. At this point, I wouldn't even say > > > that using 4K clusters is going to be a win and my hunch is that > > > any win wouldn't apply to small memory constrained machines.) > > > > Though the code that was patched wasn't using any partial buffers, > > it was always allocating a new buffer... If the code in > > _read/_readlinks starts using a previous mbuf chain, then obviously > > things are different and I'd agree, always allocating a 2k/4k > > cluster makes sense... > > > Yes, but nd_mb and nd_bpos are set, which means subsequent replies can > use the remainder of the cluster. Couldn't we scan the list of replies, find out how much data we need, m_getm the space for it all (which will use 4k clusters as necessary)? > Why does it always allocate a new cluster? Well, because the code is > OLD. It was written for OpenBSD2.6 and, at that time, I tried to make > it portable across the BSDen. I'm not so concerned w.r.t. its portability > now, since no one else is porting it and I don't plan to, but I still > think it would be nice if it were portable to other BSDen. > Back when I wrote it, I believe that MCLBYTES was 1K and an entire > cluster was needed. (To be honest, I found out that FreeBSD's NCLBYTES > is 2K about 2 days ago, when I started looking at this stuff.) > > Could it now look to see if enough bytes (a little over 1K) were available > in the current cluster and use that. Yes, but it would reduce the portability > of the code and I don't think it would make a measurable difference performance > wise. Are you sure it would reduce the portability? I can't think of a way it would... Some code will always need to be written for portability.. > > > My test server has 256Mbytes of ram and it certainly doesn't show > > > any improvement (big surprise;-), but it also doesn't show any > > > degradation for the limited testing I've done. > > > > I'm not too surprised, unless you're on a heavy server pushing > > >200MB/sec, the allocation cost is probably cheap enough that it > > doesn't show up... going to 4k means immediately half as many mbufs > > are needed/allocated, and as they are page sized, don't have the > > problems of physical memory fragmentation, nor do they have to do an > > IPI/tlb shoot down in the case of multipage allocations... (I'm > > dealing w/ this for geli.) > > > Yes, Garrett Wollman proposed this and I suspect there might be a > performance gain for larger systems. He has a more involved patch. > To be honest, if Garrett is convinced that his patch is of benefit > performance wise, I will do a separate posting w.r.t. it and whether > or not it is appropriate to be committed to head, etc. > > > > Again, my main interest at this point is whether reducing the > > > number of mbufs in the chain fixes the TSO issues. I think > > > the question of whether or not 4K clusters are performance > > > improvement in general, is an interesting one that comes later. > > > > Another thing I noticed is that we are getting an mbuf and then > > allocating a cluster... Is there a reason we aren't using something > > like m_getm or m_getcl? We have a special uma zone that has > > mbuf and mbuf cluster already paired meaning we save some lock > > operations for each segment allocated... > > > See above w.r.t. OLD portable code. There was a time when MGETCL() > wasn't guaranteed to succeed even when M_WAITOK is specified. > This is also why there is that weird loop in the NFSMCLGET() macro. Correct, but as you wrapped them in NFS* macros, it doesn't mean you can't merge the MGETCL w/ NFSMCLGET into a new function that merges the two... It's just another (not too difficult) wrapper that the porter has to write... Though apparently portability has been given up since you use MCLGET directly in nfsserver/nfs_nfsdport.c instead of NFSMCLGET... Sounds like nfsport.h needs some updating.... > (I think there was a time in FreeBSD's past when allocation was never > guaranteed and the rest of the code doesn't tolerate a NULL mbuf ptr. > Something like M_TRYWAIT in old versions of FreeBSD?) Correct, there was a time that M_WAITOK could still return, but it was many years ago and many releases ago... > Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list. Interestingly, m_getm2 will use 4k clusters as necessary, and in the _readlink case, do the correct thing... Hmmm... m_getm2 isn't documented... It was added by andre almost 7 years ago... It does appear to be a public interface as ofed, sctp iscsi and ng(_tty) all use it, though only sctp appears to use it any differently than m_getm.. The rest could simply use m_getm instead of m_getm2... Considering it was committed the day before SCTP was committed, I'm not too surprised... P.S. if someone wants to submit a patch to mbuf.9 to update the docs that would be helpful... I'll review and commit... and m_append is also undocumented... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 04:27:33 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 70081213 for ; Tue, 28 Jan 2014 04:27:33 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EE7C914CC for ; Tue, 28 Jan 2014 04:27:32 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0S4RUAj077762; Mon, 27 Jan 2014 23:27:30 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0S4RTVn077761; Mon, 27 Jan 2014 23:27:29 -0500 (EST) (envelope-from wollman) Date: Mon, 27 Jan 2014 23:27:29 -0500 (EST) Message-Id: <201401280427.s0S4RTVn077761@hergotha.csail.mit.edu> From: wollman@freebsd.org To: rmacklem@uoguelph.ca Subject: Re: Terrible NFS performance under 9.2-RELEASE? In-Reply-To: <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> References: <20140128002826.GU13704@funkthat.com> Organization: none X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Mon, 27 Jan 2014 23:27:30 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 04:27:33 -0000 In article <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca>, Rick Macklem writes: >Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list. I do two things in my version that should provide an improvement. The first is, as you say, using m_getm2() to allocate a list of mbufs. The second is to use a fixed-size iovec array and a special-purpose UMA zone to allocate the iovec and a preinitialized uio as a single allocation. I haven't tested this approach at all (not even compilation testing), so I don't know whether it will work or not, and I don't know if it actually provides the sort of performance improvement I expect. The real big improvement, which I have not tried to implement, would be to use physical pages (via sfbufs) by sharing the inner loop of sendfile(2). Since I use ZFS as my backing filesystem, I'm not sure this would have any benefit for me, but it should be a measurable improvement for UFS-backed NFS servers. My patch follows. Note that I haven't even compile-tested it yet, and there is likely to be some fuzz if you apply it to stock kernel sources. -GAWollman --- nfs_nfsdport.c.orig 2014-01-26 23:38:58.296234939 -0500 +++ nfs_nfsdport.c 2014-01-26 23:46:17.901236792 -0500 @@ -50,6 +50,14 @@ FEATURE(nfsd, "NFSv4 server"); +#define NFS_NIOVEC (NFS_SRVMAXDATA / MCLBYTES + 2) +struct nfsd_iovec { + struct uio nfsiov_uio; + struct iovec nfsiov_iov[NFS_NIOVEC]; +}; +static struct uma_zone *nfsd_iovec_zone; +static void nfsd_iovec_construct(struct uio **, struct mbuf **, struct mbuf **, + int); extern u_int32_t newnfs_true, newnfs_false, newnfs_xdrneg1; extern int nfsrv_useacl; extern int newnfs_numnfsd; @@ -626,7 +634,7 @@ struct iovec *iv2; int error = 0, len, left, siz, tlen, ioflag = 0; struct mbuf *m2 = NULL, *m3; - struct uio io, *uiop = &io; + struct uio *uiop; struct nfsheur *nh; len = left = NFSM_RNDUP(cnt); @@ -634,49 +642,11 @@ /* * Generate the mbuf list with the uio_iov ref. to it. */ - i = 0; - while (left > 0) { - NFSMGET(m); - MCLGET(m, M_WAIT); - m->m_len = 0; - siz = min(M_TRAILINGSPACE(m), left); - left -= siz; - i++; - if (m3) - m2->m_next = m; - else - m3 = m; - m2 = m; - } - MALLOC(iv, struct iovec *, i * sizeof (struct iovec), - M_TEMP, M_WAITOK); - uiop->uio_iov = iv2 = iv; - m = m3; - left = len; - i = 0; - while (left > 0) { - if (m == NULL) - panic("nfsvno_read iov"); - siz = min(M_TRAILINGSPACE(m), left); - if (siz > 0) { - iv->iov_base = mtod(m, caddr_t) + m->m_len; - iv->iov_len = siz; - m->m_len += siz; - left -= siz; - iv++; - i++; - } - m = m->m_next; - } - uiop->uio_iovcnt = i; + nfsd_iovec_construct(&uiop, &m3, &m2, len); uiop->uio_offset = off; - uiop->uio_resid = len; - uiop->uio_rw = UIO_READ; - uiop->uio_segflg = UIO_SYSSPACE; nh = nfsrv_sequential_heuristic(uiop, vp); ioflag |= nh->nh_seqcount << IO_SEQSHIFT; error = VOP_READ(vp, uiop, IO_NODELOCKED | ioflag, cred); - FREE((caddr_t)iv2, M_TEMP); if (error) { m_freem(m3); *mpp = NULL; @@ -695,6 +665,7 @@ *mpendp = m2; out: + uma_zfree(nfsd_iovec_zone, uiop); /* now safe to free */ NFSEXITCODE(error); return (error); } @@ -3284,6 +3255,74 @@ } } +/* + * UMA initializer for nfsd_iovec objects. + */ +static int +nfsd_iovec_init(void *mem, int size, int flags) +{ + int i; + struct nfsd_iovec *nfsiov = mem; + struct uio *uio = &nfsiov->nfsiov_uio; + + KASSERT(size == sizeof(struct nfsd_iovec)); + uio->uio_iov = nfsiov->nfsiov_iovec; + uio->uio_iovcnt = 0; + /* don't care about state of uio_offset */ + uio->uio_resid = 0; + uio->uio_segflg = UIO_SYSSPACE; + uio->uio_rw = UIO_READ; + uio->uio_td = NULL; + return (0); +} + +/* + * The destructor doesn't need to do anything different from the + * initializer. + */ +static int +nfsd_iovec_dtor(void *mem, int size, void *arg) +{ + return (nfsd_iovec_init(mem, size, 0)); +} + +static void +nfsd_iovec_construct(struct uio **uiop, struct mbuf **mp, struct mbuf **tailp, + int left) +{ + struct nfsd_iovec *nfsiov; + struct iovec *iov; + struct mbuf *m, *m2; + struct uio *uio; + int siz; + + /* uma_zalloc is guaranteed to succeed or deadlock with M_WAITOK */ + nfsiov = uma_zalloc(nfsd_iovec_zone, NULL, M_WAITOK); + *uiop = uio = &nfsiov->nfsiov_uio; + for (;;) { + m = m_getm2(NULL, left, M_WAITOK, MT_DATA, 0); + if (m != NULL) /* should always be taken with M_WAITOK */ + break; + nfs_catnap(PZERO, 0, "nfsiovec"); + } + *mp = m; + uio->uio_resid = left; + iov = uio->uio_iov; + + while (m != NULL && left > 0) { + if (++uio->uio_iovcnt > NFSIOV_NIOVEC) + panic("nfsd_iovec_construct: mbuf chain exceeded size"); + iov->iov_base = mtod(m, char *); + m->m_len = iov->iov_len = siz = min(M_TRAILINGSPACE(m), left); + left -= siz; + iov++; + m2 = m->m_next; + if ((m2 = m->m_next) == NULL && tailp != NULL) /* last one? */ + *tailp = m; + m = m2; + } +} + extern int (*nfsd_call_nfsd)(struct thread *, struct nfssvc_args *); /* @@ -3319,6 +3358,10 @@ vn_deleg_ops.vndeleg_recall = nfsd_recalldelegation; vn_deleg_ops.vndeleg_disable = nfsd_disabledelegation; #endif + nfsd_iovec_zone = uma_zcreate("nfsd iovec", + sizeof(struct nfsd_iovec), NULL /* ctor */, + nfsd_iovec_dtor, nfsd_iovec_init, NULL /* fini */, + sizeof(void *) - 1 /* alignment mask */, 0 /* flags */); nfsd_call_servertimer = nfsrv_servertimer; nfsd_call_nfsd = nfssvc_nfsd; loaded = 1; @@ -3347,6 +3390,9 @@ if (nfsrvd_pool != NULL) svcpool_destroy(nfsrvd_pool); + /* Release memory in the iovec zone */ + uma_zdestroy(nfsd_iovec_zone); + /* and get rid of the locks */ for (i = 0; i < NFSRVCACHE_HASHSIZE; i++) mtx_destroy(&nfsrc_tcpmtx[i]); From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 06:28:44 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C96C9B14; Tue, 28 Jan 2014 06:28:44 +0000 (UTC) Received: from mail-ie0-x232.google.com (mail-ie0-x232.google.com [IPv6:2607:f8b0:4001:c03::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8DC941C9A; Tue, 28 Jan 2014 06:28:44 +0000 (UTC) Received: by mail-ie0-f178.google.com with SMTP id x13so7121234ief.37 for ; Mon, 27 Jan 2014 22:28:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=MIYIa4w/O3tQmMg88kcxETTAMgoN0ggwnVRCLgpfLfk=; b=QkYKol6EZhtDNhAWdiAL+dfYPeurYjjSDdyIza32q1N6oobbrzBH2jpHm0vZEsVbo+ siYsVuPa45xQv+HzopCFgKthF5//1PvAASBzc21Oy9ZM7ZxbyBjnCuo81K0hfhTA5c6W G7Pe06UcA88V0Xh1Ci0G6eRj+2HCkO2p9/OIX7stGwvAIql5mI8qeiL9W2Z5fjRKG8TN Ik00GfTKdib6dtt6Zc1DO4JSpDUZW83PueTIw/oYbbvb7Jncl3z+jx5jgDbeYFOTrM3t GoUyXqhcJ4OmnuOZcbWes7hw+zjjDX6sqtIpCJUzgSpWMgW8HSCG1lvIUMgzH76BDoVG e+Ew== MIME-Version: 1.0 X-Received: by 10.50.13.9 with SMTP id d9mr21433238igc.25.1390890522907; Mon, 27 Jan 2014 22:28:42 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Mon, 27 Jan 2014 22:28:42 -0800 (PST) In-Reply-To: <201401280427.s0S4RTVn077761@hergotha.csail.mit.edu> References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> <201401280427.s0S4RTVn077761@hergotha.csail.mit.edu> Date: Tue, 28 Jan 2014 01:28:42 -0500 X-Google-Sender-Auth: mRtwRKtE_tp1KE4p2pUa9Xn1F4I Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: wollman@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Rick Macklem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 06:28:44 -0000 A few questions as I try to parse the various patches floating around for this. What's the difference between src/sys/nfsserver and src/sys/fs/nfsserver? It looks like maybe the former is for NFSv3 and the latter is for NFSv4? If so, these patches appear to be for the NFSv4 server. Since we are using the NFSv3 server exclusively, does that mean we would need to do something similar somewhere in the neighborhood of line 930 of src/sys/nfsserver/nfs_serv.c? Also, these patches appear server-side. To make sure things flow smoothly, will additional client-side changes be necessary? There is some MGET/MCLGET in src/sys/nfsclient/nfs_subs.c. (The equivalent in src/sys/fs/nfsclient/nfs_clcomsubs.c appear to be using the NFSMGET/NFSMCLGET macros, so presumably those are handled?) In any case, the switch from 2k to 4k mbufs and m_getm2 seems well worthwhile regardless of whether it addresses this specific issue. It should reduce a lot of overhead in many common cases. If my understanding isn't too far off, I can take a whack at testing the result, but only on NFSv3. Thanks! From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 06:55:14 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E22C8DA3 for ; Tue, 28 Jan 2014 06:55:14 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9A0641E40 for ; Tue, 28 Jan 2014 06:55:14 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0S6tCsE079255; Tue, 28 Jan 2014 01:55:12 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0S6tBWj079252; Tue, 28 Jan 2014 01:55:11 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21223.21583.878646.673835@hergotha.csail.mit.edu> Date: Tue, 28 Jan 2014 01:55:11 -0500 From: Garrett Wollman To: J David Subject: Re: Terrible NFS performance under 9.2-RELEASE? In-Reply-To: References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> <201401280427.s0S4RTVn077761@hergotha.csail.mit.edu> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Tue, 28 Jan 2014 01:55:12 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 06:55:14 -0000 < said: > What's the difference between src/sys/nfsserver and > src/sys/fs/nfsserver? It looks like maybe the former is for NFSv3 and > the latter is for NFSv4? /sys/nfs* is the "old" (read: obsolete) NFS client and server. /sys/fs/nfs* is the "new" (default) NFS client and server. Both implementations do both NFSv2 and NFSv3; only the "new" implementation does NFSv4. Even if you are only using NFSv3, you want to be using the "new" implementation, and the "old" one should go away before the stable/11 branch happens. We're running a mix of 9.1 (with some earlier versions of Rick's DRC patches and FHA for NFSv3) and 9.2 (with the DRC patches) currently, and I'm looking through a bunch of changes to pull forward from stable/9. -GAWollman From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 06:59:18 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A8F99E93; Tue, 28 Jan 2014 06:59:18 +0000 (UTC) Received: from mail-ie0-x22b.google.com (mail-ie0-x22b.google.com [IPv6:2607:f8b0:4001:c03::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 6BA2D1E5B; Tue, 28 Jan 2014 06:59:18 +0000 (UTC) Received: by mail-ie0-f171.google.com with SMTP id as1so7191458iec.2 for ; Mon, 27 Jan 2014 22:59:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=/8sKFAiOTxgqgN0SxKq5sKBZZiVNiFDOP5opMVNTEO8=; b=rt+CP83wxfjJyiXH+yxoFI21VFG3DuZZ+HT6If4A0AsF3nzjwwtozwALJYNG4DaRoB ExpENXBcr1hnLisx2Mt+2P/xuuwYE3nH6oJNA9jNqKNPRYNDQVtowuxfPxPPJSMJwZ4e g55D71k9S/u7TS0ewUXKZm/dfCJ7uiNDqSlBkFjgcIEFG4wjc4amJCSxp6+q8h/MdY4z CF34PhZrtaHO6qSoBvJQe5rfFT7c8j0rKeYZDy7nT6pFEySNqo5dqbitykrL4lbrzZ1n hRW6p49um8YsoC4sRcJg0PlNxPnt8QeZwi5G/vxTnsE2Vjh6sGVBgLatdKJIXk62rPVL RFBw== MIME-Version: 1.0 X-Received: by 10.42.121.147 with SMTP id j19mr25148037icr.13.1390892357869; Mon, 27 Jan 2014 22:59:17 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Mon, 27 Jan 2014 22:59:17 -0800 (PST) In-Reply-To: References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> <201401280427.s0S4RTVn077761@hergotha.csail.mit.edu> Date: Tue, 28 Jan 2014 01:59:17 -0500 X-Google-Sender-Auth: A8EVTsraayRwaKYEwWl0Vd1hdM4 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: wollman@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Rick Macklem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 06:59:18 -0000 Another way to test this is to instrument the virtio driver, which turned out to be very straightforward: Index: if_vtnet.c =================================================================== --- if_vtnet.c (revision 260701) +++ if_vtnet.c (working copy) @@ -1886,6 +1887,7 @@ return (virtqueue_enqueue(vq, txhdr, &sg, sg.sg_nseg, 0)); fail: + sc->vtnet_stats.tx_excess_mbuf_drop++; m_freem(*m_head); *m_head = NULL; @@ -2645,6 +2647,9 @@ SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "tx_task_rescheduled", CTLFLAG_RD, &stats->tx_task_rescheduled, "Times the transmit interrupt task rescheduled itself"); + SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "tx_excess_mbuf_drop", + CTLFLAG_RD, &stats->tx_excess_mbuf_drop, + "Times packets were dropped due to excess mbufs"); } static int Index: if_vtnetvar.h =================================================================== --- if_vtnetvar.h (revision 260701) +++ if_vtnetvar.h (working copy) @@ -48,6 +48,7 @@ unsigned long tx_csum_bad_ethtype; unsigned long tx_tso_bad_ethtype; unsigned long tx_task_rescheduled; + unsigned long tx_excess_mbuf_drop; }; struct vtnet_softc { This patch didn't seem harmful from a performance standpoint since if things are working, the counter increment never gets hit. With this change, I re-ran some 64k tests. I found that the number of drops was very small, but not zero. On the client, doing the write-append test (which has no reads), it seems like it slowly builds up 8 with what appears to be some sort of back off (each one takes longer to appear than the last): $ sysctl dev.vtnet.1.tx_excess_mbuf_drop dev.vtnet.1.tx_excess_mbuf_drop: 8 But after 8, it appears congestion control is clamped down so hard that no more happen. Once read activity starts, the server builds up more: dev.vtnet.1.tx_excess_mbuf_drop: 53 So while there aren't a lot of these, they definitely do exist and there's just no way they're good for performance. Thanks! From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 11:46:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BE4105B3 for ; Tue, 28 Jan 2014 11:46:31 +0000 (UTC) Received: from sam.nabble.com (sam.nabble.com [216.139.236.26]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A140813DC for ; Tue, 28 Jan 2014 11:46:31 +0000 (UTC) Received: from [192.168.236.26] (helo=sam.nabble.com) by sam.nabble.com with esmtp (Exim 4.72) (envelope-from ) id 1W877e-0004Xu-4D for freebsd-net@freebsd.org; Tue, 28 Jan 2014 03:46:30 -0800 Date: Tue, 28 Jan 2014 03:46:30 -0800 (PST) From: Beeblebrox To: freebsd-net@freebsd.org Message-ID: <1390909590119-5880672.post@n5.nabble.com> Subject: Jails on fib problem MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 11:46:31 -0000 Hi. I'm trying to setup a pool of jails, with the gateway to the jails as a fib device. All jails reside on cloned interface IP xxx.xxx.x.1/28 as gateway (fib 1). Jail IP's start from xxx.xxx.x.2/32. The fib seems to be limited to one jail only. That is, the first jail to grab the fib seems to keep control of it and traffic from other jails does not get routed to the public gateway. Do I need to be using one-fib-per-jail? Does each /32 jail require its own fib device? Thanks. ----- FreeBSD-11-current_amd64_root-on-zfs_RadeonKMS -- View this message in context: http://freebsd.1045724.n5.nabble.com/Jails-on-fib-problem-tp5880672.html Sent from the freebsd-net mailing list archive at Nabble.com. From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 13:07:48 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B359D243 for ; Tue, 28 Jan 2014 13:07:48 +0000 (UTC) Received: from mail2.dataoppdrag.no (mail2.dataoppdrag.no [IPv6:2a02:f58:7:2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 6E6F01ADA for ; Tue, 28 Jan 2014 13:07:48 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail2.dataoppdrag.no (Postfix) with ESMTP id A88004330A for ; Tue, 28 Jan 2014 14:07:39 +0100 (CET) Received: from mail2.dataoppdrag.no ([127.0.0.1]) by localhost (mail2.dataoppdrag.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pS4AXgRqKAhE for ; Tue, 28 Jan 2014 14:07:39 +0100 (CET) Received: from [172.20.10.252] (42-80-141-95.net.dataoppdrag.no [95.141.80.42]) by mail2.dataoppdrag.no (Postfix) with ESMTP id 82B8F43307 for ; Tue, 28 Jan 2014 14:07:39 +0100 (CET) Message-ID: <52E7AB9B.5050707@dataoppdrag.no> Date: Tue, 28 Jan 2014 14:07:39 +0100 From: Ole Myhre User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: carp and rtadvd Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 13:07:48 -0000 Hi, I have a simple setup with two 10.0-RELEASE firewalls running carp, a virtual IPv6 address and running rtadvd: (applied to both firewalls) # kldload carp # ifconfig em2 inet6 2001:db8::1/64 vhid 1 up # sysctl net.inet6.ip6.forwarding=1 # echo 'rtadvd_enable="YES"' >> /etc/rc.conf # echo 'rtadvd_interfaces="em2"' >> /etc/rc.conf # service rtadvd start This works fine, one firewall is MASTER, the other BACKUP and the clients behind em2 gets a prefix in the 2001:db8::/64 subnet. However both firewalls are sending router advertisements (only one being MASTER) with the LL-address of the physical em2 interface as the gateway. This causes clients that supports multiple default gateways to select both firewalls as their default gateway, and sending traffic to both the MASTER and BACKUP firewall. Is there a way to make only the MASTER send router advertisements or (preferably only the MASTER) sending router advertisements with a virtual LL-address? Thanks, Ole Myhre From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 13:16:02 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 74C6E39F for ; Tue, 28 Jan 2014 13:16:02 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 419161C87 for ; Tue, 28 Jan 2014 13:16:02 +0000 (UTC) Received: from Julian-MBP3.local (50-196-156-133-static.hfc.comcastbusiness.net [50.196.156.133]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.7) with ESMTP id s0SD0ETh016578 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 28 Jan 2014 05:00:26 -0800 (PST) (envelope-from julian@freebsd.org) Message-ID: <52E7A9D8.30604@freebsd.org> Date: Tue, 28 Jan 2014 21:00:08 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Beeblebrox , freebsd-net@freebsd.org Subject: Re: Jails on fib problem References: <1390909590119-5880672.post@n5.nabble.com> In-Reply-To: <1390909590119-5880672.post@n5.nabble.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 13:16:02 -0000 On 1/28/14, 7:46 PM, Beeblebrox wrote: > Hi. > I'm trying to setup a pool of jails, with the gateway to the jails as a fib > device. what's a fib device? Do you mean each jail has a different default fib? you are not using vimage jails? > All jails reside on cloned interface IP xxx.xxx.x.1/28 as gateway > (fib 1). so they all have the same address?? can you even do that? or you mean that they all have the same default route? > Jail IP's start from xxx.xxx.x.2/32. The fib seems to be limited to > one jail only. That is, the first jail to grab the fib seems to keep control > of it and traffic from other jails does not get routed to the public > gateway. multiple jails can use the same fib, but I think you are confused bout what is going on. > Do I need to be using one-fib-per-jail? Does each /32 jail require its own > fib device? fibs don't have devices. I'm having a hard time working out what you are trying to do. > Thanks. > > > > ----- > FreeBSD-11-current_amd64_root-on-zfs_RadeonKMS > -- > View this message in context: http://freebsd.1045724.n5.nabble.com/Jails-on-fib-problem-tp5880672.html > Sent from the freebsd-net mailing list archive at Nabble.com. > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 13:18:59 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C751453F for ; Tue, 28 Jan 2014 13:18:59 +0000 (UTC) Received: from mail-pd0-x232.google.com (mail-pd0-x232.google.com [IPv6:2607:f8b0:400e:c02::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9D3221CA8 for ; Tue, 28 Jan 2014 13:18:59 +0000 (UTC) Received: by mail-pd0-f178.google.com with SMTP id y13so345466pdi.9 for ; Tue, 28 Jan 2014 05:18:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=NppNPFbtAZaMNcuSlf8zVCmYHc78jXk09R1HsX3FnEM=; b=n0tQ+VlKn94saS80o1hM9IHXKOzg1jTEDbV9FhiBpJzhGTu8n7n4nBsiedCjRXZZqN ks2qW6aSZkTlxgkg0J53U+uh/YLOcBt2bt3rq08BhFa/dYn9w4dWbUOX6ZRmHVijct6k 2KDM4OZ5pBlAqA7HiHKClSHy6/bh38143yxiHumA2DAHwf0yNAkDAumFJrwWNUv+KK5D NHpIdP7Muk6jtQX8K9+fM804NGwYBA2cUP4ojr3U2PqK2znIrFR2wBsQgW+p8zEPcmpp UjdtoWdzYH4Ue0n6vGpFiqzmJOOJWDYpSTSj0tRp5duof2yBgdRLbUkUZYoF3GWRElvl 3gSQ== MIME-Version: 1.0 X-Received: by 10.66.221.199 with SMTP id qg7mr1530641pac.88.1390915139241; Tue, 28 Jan 2014 05:18:59 -0800 (PST) Sender: ermal.luci@gmail.com Received: by 10.70.46.42 with HTTP; Tue, 28 Jan 2014 05:18:59 -0800 (PST) In-Reply-To: <52E7AB9B.5050707@dataoppdrag.no> References: <52E7AB9B.5050707@dataoppdrag.no> Date: Tue, 28 Jan 2014 14:18:59 +0100 X-Google-Sender-Auth: ZJ4gFAQdDsvV9AUAKF9BCJTvfqE Message-ID: Subject: Re: carp and rtadvd From: =?ISO-8859-1?Q?Ermal_Lu=E7i?= To: Ole Myhre Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: freebsd-net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 13:18:59 -0000 On Tue, Jan 28, 2014 at 2:07 PM, Ole Myhre wrote: > Hi, > > I have a simple setup with two 10.0-RELEASE firewalls running carp, a > virtual IPv6 address and running rtadvd: > > (applied to both firewalls) > > # kldload carp > # ifconfig em2 inet6 2001:db8::1/64 vhid 1 up > # sysctl net.inet6.ip6.forwarding=1 > # echo 'rtadvd_enable="YES"' >> /etc/rc.conf > # echo 'rtadvd_interfaces="em2"' >> /etc/rc.conf > # service rtadvd start > > This works fine, one firewall is MASTER, the other BACKUP and the > clients behind em2 gets a prefix in the 2001:db8::/64 subnet. However > both firewalls are sending router advertisements (only one being MASTER) > with the LL-address of the physical em2 interface as the gateway. This > causes clients that supports multiple default gateways to select both > firewalls as their default gateway, and sending traffic to both the > MASTER and BACKUP firewall. > > Is there a way to make only the MASTER send router advertisements or > (preferably only the MASTER) sending router advertisements with a > virtual LL-address? > > You have to use the rtadvd patched from pfSense. Look at our tools repo to get the code. > Thanks, > Ole Myhre > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > -- Ermal From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 14:55:59 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A5DA17F9; Tue, 28 Jan 2014 14:55:59 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 42B59151A; Tue, 28 Jan 2014 14:55:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAFjE51KDaFve/2dsb2JhbABag0RWgn25DE+BJXSCJQEBAQMBAQEBIAQnIAsFFhgCAg0ZAikBCSYGCAcEARwEh1wIDal5n3MXgSmNBQEBGwEzB4JvgUkEiUmMDIQFkG2DSx4xgQQ5 X-IronPort-AV: E=Sophos;i="4.95,736,1384318800"; d="scan'208";a="91045198" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 28 Jan 2014 09:55:57 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 70951B4089; Tue, 28 Jan 2014 09:55:57 -0500 (EST) Date: Tue, 28 Jan 2014 09:55:57 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1098090585.17554698.1390920957454.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, wollman@freebsd.org, Bryan Venteicher X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 14:55:59 -0000 J David wrote: > Another way to test this is to instrument the virtio driver, which > turned out to be very straightforward: > > Index: if_vtnet.c > > =================================================================== > > --- if_vtnet.c (revision 260701) > > +++ if_vtnet.c (working copy) > > @@ -1886,6 +1887,7 @@ > > return (virtqueue_enqueue(vq, txhdr, &sg, sg.sg_nseg, 0)); > > > > fail: > > + sc->vtnet_stats.tx_excess_mbuf_drop++; > > m_freem(*m_head); > > *m_head = NULL; > > > > @@ -2645,6 +2647,9 @@ > > SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "tx_task_rescheduled", > > CTLFLAG_RD, &stats->tx_task_rescheduled, > > "Times the transmit interrupt task rescheduled itself"); > > + SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "tx_excess_mbuf_drop", > > + CTLFLAG_RD, &stats->tx_excess_mbuf_drop, > > + "Times packets were dropped due to excess mbufs"); > > } > > > > static int > > Index: if_vtnetvar.h > > =================================================================== > > --- if_vtnetvar.h (revision 260701) > > +++ if_vtnetvar.h (working copy) > > @@ -48,6 +48,7 @@ > > unsigned long tx_csum_bad_ethtype; > > unsigned long tx_tso_bad_ethtype; > > unsigned long tx_task_rescheduled; > > + unsigned long tx_excess_mbuf_drop; > > }; > > > > struct vtnet_softc { > > > This patch didn't seem harmful from a performance standpoint since if > things are working, the counter increment never gets hit. > > With this change, I re-ran some 64k tests. I found that the number > of > drops was very small, but not zero. > > On the client, doing the write-append test (which has no reads), it > seems like it slowly builds up 8 with what appears to be some sort of > back off (each one takes longer to appear than the last): > > > $ sysctl dev.vtnet.1.tx_excess_mbuf_drop > > dev.vtnet.1.tx_excess_mbuf_drop: 8 > > > But after 8, it appears congestion control is clamped down so hard > that no more happen. > > Once read activity starts, the server builds up more: > > dev.vtnet.1.tx_excess_mbuf_drop: 53 > > > So while there aren't a lot of these, they definitely do exist and > there's just no way they're good for performance. > It would be nice to also count the number of times m_collapse() gets called, since that will generate a lot of overhead that I think will show up on your test, since you don't have any disk activity. And I'd state that having any of these is near-disastrous for performance, since it means a timeout/retransmit of a TCP segment. For a lan environment, I would consider 1 timeout/retransmit in a million packets as a lot. rick ps: I've cc'd Bryan, since he's the guy handling virtio, I think. > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 15:10:22 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9104BD63; Tue, 28 Jan 2014 15:10:22 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 4075F1666; Tue, 28 Jan 2014 15:10:21 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAN/H51KDaFve/2dsb2JhbABahBqCfblbgSV0giUBAQEEI1YbGAICDRkCWQaIGKoJn3MXgSmNIjQHgm+BSQSJSaB+g0segW4 X-IronPort-AV: E=Sophos;i="4.95,736,1384318800"; d="scan'208";a="91051562" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 28 Jan 2014 10:10:20 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id EC933B4022; Tue, 28 Jan 2014 10:10:20 -0500 (EST) Date: Tue, 28 Jan 2014 10:10:20 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1614542711.17567039.1390921820957.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, wollman@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 15:10:22 -0000 J David wrote: > A few questions as I try to parse the various patches floating around > for this. > > What's the difference between src/sys/nfsserver and > src/sys/fs/nfsserver? It looks like maybe the former is for NFSv3 > and > the latter is for NFSv4? > > If so, these patches appear to be for the NFSv4 server. Since we are > using the NFSv3 server exclusively, does that mean we would need to > do > something similar somewhere in the neighborhood of line 930 of > src/sys/nfsserver/nfs_serv.c? > > Also, these patches appear server-side. To make sure things flow > smoothly, will additional client-side changes be necessary? There is > some MGET/MCLGET in src/sys/nfsclient/nfs_subs.c. (The equivalent in > src/sys/fs/nfsclient/nfs_clcomsubs.c appear to be using the > NFSMGET/NFSMCLGET macros, so presumably those are handled?) > > In any case, the switch from 2k to 4k mbufs and m_getm2 seems well > worthwhile regardless of whether it addresses this specific issue. > It > should reduce a lot of overhead in many common cases. > > If my understanding isn't too far off, I can take a whack at testing > the result, but only on NFSv3. > > Thanks! > I think Garrett clarified which sources are which. The attached simple patch makes both the new/default client and new/default server use MJUMPAGESIZE clusters. (It is the one I already mentioned, called 4kmcl.patch.) Garrett's patch using m_getm2() would only affect the server side read, but not client write or server side readdir. (It can probably be combined with my simple one, but I haven't tested that.) 4kmcl.patch is not ready for head (as John Mark-Gurney pointed out, it does 4K clusters for readlink and it also does 4K clusters for all the small RPC messages), but it works ok for testing to see if it gets rid of the drops and calls to m_collapse(). Since you are using 9.2-release, you have the DRC changes. At some point, you can try setting these in the server (they reduce CPU overheads by allowing the DRC to grow, holding onto more mbufs). Btw, head (and I think stable/9,10) have been significantly changed by Alexander Motin's recent commits, although these sysctls still exist. vfs.nfsd.tcphighwater=100000 vfs.nfsd.tcpcachetimeout=600 rick From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 15:35:43 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A9D11D58 for ; Tue, 28 Jan 2014 15:35:43 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 5B7E91869 for ; Tue, 28 Jan 2014 15:35:42 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEANzN51KDaFve/2dsb2JhbABXA4NEVoJ9uQxPgSV0giUBAQEDAQEBASArHgIIAwUWGAICDRkCKQEJJgYIBwQBHAEDh1wIDaoHn3QXgSmMdAoGAgEbJBAHEYIeQIFJBIlJjAyEBZBtg0seMXtC X-IronPort-AV: E=Sophos;i="4.95,736,1384318800"; d="scan'208";a="91060302" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 28 Jan 2014 10:35:41 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 50745B4065; Tue, 28 Jan 2014 10:35:41 -0500 (EST) Date: Tue, 28 Jan 2014 10:35:41 -0500 (EST) From: Rick Macklem To: John-Mark Gurney Message-ID: <372707859.17587309.1390923341323.JavaMail.root@uoguelph.ca> In-Reply-To: <20140128021450.GY13704@funkthat.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 15:35:43 -0000 John-Mark Gurney wrote: > Rick Macklem wrote this message on Mon, Jan 27, 2014 at 20:32 -0500: > > John-Mark Gurney wrote: > > > Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 > > > -0500: > > > > John-Mark Gurney wrote: > > > > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 > > > > > -0500: > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change > > > > > > to > > > > > > MJUMPAGESIZE > > > > > > clusters. > > > > > > > > > > > > rick > > > > > > ps: If the attachment doesn't make it through and you want > > > > > > the > > > > > > patch, just > > > > > > email me and I'll send you a copy. > > > > > > > > > > The patch looks good, but we probably shouldn't change > > > > > _readlink.. > > > > > The chances of a link being >2k are pretty slim, and the > > > > > chances > > > > > of > > > > > the link being >32k are even smaller... > > > > > > > > > Yea, I already thought of that, actually. However, see below > > > > w.r.t. > > > > NFSv4. > > > > > > > > However, at this point I > > > > mostly want to find out if it the long mbuf chain that causes > > > > problems > > > > for TSO enabled network interfaces. > > > > > > I agree, though a long mbuf chain is more of a driver issue than > > > an > > > NFS issue... > > > > > Yes, if my hunch is correct, it is. If my hunch gets verified, I > > will > > be posting w.r.t. how best to deal with the problem. I suspect a > > patch > > like this one might serve as a useful work-around while the drivers > > gets fixed, if the hunch is correct. > > It would be nice to have a way to force such a segment to go out to > the drivers to make debugging/testing drivers easier... I'm not sure > the best way to handle that though... > > > > > > In fact, we might want to switch _readlink to MGET (could be > > > > > conditional > > > > > upon cnt) so that if it fits in an mbuf we don't allocate a > > > > > cluster > > > > > for > > > > > it... > > > > > > > > > For NFSv4, what was an RPC for NFSv3 becomes one of several > > > > Ops. in > > > > a compound RPC. As such, there is no way to know how much > > > > additional > > > > RPC message there will be. So, although the readlink reply > > > > won't > > > > use > > > > much of the 4K allocation, replies for subsequent Ops. in the > > > > compound > > > > certainly could. (Is it more efficient to allocate 4K now and > > > > use > > > > part of it for subsequent message reply stuff or allocate > > > > additional > > > > mbuf clusters later for subsequent stuff, as required? On a > > > > small > > > > memory constrained machine, I suspect the latter is correct, > > > > but > > > > for > > > > the kind of hardware that has TSO scatter/gather enabled > > > > network > > > > interfaces, I'm not so sure. At this point, I wouldn't even say > > > > that using 4K clusters is going to be a win and my hunch is > > > > that > > > > any win wouldn't apply to small memory constrained machines.) > > > > > > Though the code that was patched wasn't using any partial > > > buffers, > > > it was always allocating a new buffer... If the code in > > > _read/_readlinks starts using a previous mbuf chain, then > > > obviously > > > things are different and I'd agree, always allocating a 2k/4k > > > cluster makes sense... > > > > > Yes, but nd_mb and nd_bpos are set, which means subsequent replies > > can > > use the remainder of the cluster. > > Couldn't we scan the list of replies, find out how much data we need, > m_getm the space for it all (which will use 4k clusters as > necessary)? > The NFSv4 server parses the compound as it processes it. It must keep things like current-filehandle and saved-filehandle between RPCs and things like the attributes are a lot of work to parse, so I don't think two passes through a request is warranted. Also, there is no way of knowing how big a reply is until you execute the reply, although you can "guess" at it. I never intended to imply that the patch I emailed is ready for head. It does 4K clusters for all RPCs, even ones known to be small (as in client side Getattr/Lookup requests). Since messgaes are sent quickly and then mbufs released, except for the DRC in the server, I think avoiding large allocations for server replies that may be cached is the case to try and avoid. Fortunately the large replies will be for read and readdir and these don't need to be cached by the DRC. As such, a patch that uses 4K clusters in the server for read, readdir and 4K clusters for write requests in the client, should be appropriate, I think? (And, yes, I think you are correct that readlink is better off with a MCLBYTES cluster.) The coding is straightforward, but the patch will be fairly large, since readdir in the server uses NFSM_BUILD(), that in turn uses NFSMCLGET(). These will need an extra "do a big cluster" argument. For initial testing, it was just simpler to make them all big. rick > > Why does it always allocate a new cluster? Well, because the code > > is > > OLD. It was written for OpenBSD2.6 and, at that time, I tried to > > make > > it portable across the BSDen. I'm not so concerned w.r.t. its > > portability > > now, since no one else is porting it and I don't plan to, but I > > still > > think it would be nice if it were portable to other BSDen. > > Back when I wrote it, I believe that MCLBYTES was 1K and an entire > > cluster was needed. (To be honest, I found out that FreeBSD's > > NCLBYTES > > is 2K about 2 days ago, when I started looking at this stuff.) > > > > Could it now look to see if enough bytes (a little over 1K) were > > available > > in the current cluster and use that. Yes, but it would reduce the > > portability > > of the code and I don't think it would make a measurable difference > > performance > > wise. > > Are you sure it would reduce the portability? I can't think of a way > it would... Some code will always need to be written for > portability.. > Well, I had it ported to OpenBSD, FreeBSD6 and Mac OS X 10.3 by using the NFSMCLGET() macro. If it uses things like m_getm2() and separate uma zones, I don't know how much extra work would be needed for other BSDen, since I have no idea which BSDen have these things? > > > > My test server has 256Mbytes of ram and it certainly doesn't > > > > show > > > > any improvement (big surprise;-), but it also doesn't show any > > > > degradation for the limited testing I've done. > > > > > > I'm not too surprised, unless you're on a heavy server pushing > > > >200MB/sec, the allocation cost is probably cheap enough that it > > > doesn't show up... going to 4k means immediately half as many > > > mbufs > > > are needed/allocated, and as they are page sized, don't have the > > > problems of physical memory fragmentation, nor do they have to do > > > an > > > IPI/tlb shoot down in the case of multipage allocations... (I'm > > > dealing w/ this for geli.) > > > > > Yes, Garrett Wollman proposed this and I suspect there might be a > > performance gain for larger systems. He has a more involved patch. > > To be honest, if Garrett is convinced that his patch is of benefit > > performance wise, I will do a separate posting w.r.t. it and > > whether > > or not it is appropriate to be committed to head, etc. > > > > > > Again, my main interest at this point is whether reducing the > > > > number of mbufs in the chain fixes the TSO issues. I think > > > > the question of whether or not 4K clusters are performance > > > > improvement in general, is an interesting one that comes later. > > > > > > Another thing I noticed is that we are getting an mbuf and then > > > allocating a cluster... Is there a reason we aren't using > > > something > > > like m_getm or m_getcl? We have a special uma zone that has > > > mbuf and mbuf cluster already paired meaning we save some lock > > > operations for each segment allocated... > > > > > See above w.r.t. OLD portable code. There was a time when MGETCL() > > wasn't guaranteed to succeed even when M_WAITOK is specified. > > This is also why there is that weird loop in the NFSMCLGET() macro. > > Correct, but as you wrapped them in NFS* macros, it doesn't mean you > can't merge the MGETCL w/ NFSMCLGET into a new function that merges > the two... It's just another (not too difficult) wrapper that the > porter has to write... > > Though apparently portability has been given up since you use MCLGET > directly in nfsserver/nfs_nfsdport.c instead of NFSMCLGET... > > Sounds like nfsport.h needs some updating.... > The files with "port" in the names are re-written for each port. They were generated by cribbing code from the extant client/server. (Without looking, I'd guess you find MGET(), MCLGET() in the old FreeBSD server, or maybe it was inherited from OpenBSD 2.6.) Everything can be re-written, but why do so if the old code still works. I'm one guy who does this as a spare time unpaid hobby and I'm working on 4.1 server code these days. > > (I think there was a time in FreeBSD's past when allocation was > > never > > guaranteed and the rest of the code doesn't tolerate a NULL mbuf > > ptr. > > Something like M_TRYWAIT in old versions of FreeBSD?) > > Correct, there was a time that M_WAITOK could still return, but it > was > many years ago and many releases ago... > > > Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list. > > Interestingly, m_getm2 will use 4k clusters as necessary, and in > the _readlink case, do the correct thing... > > Hmmm... m_getm2 isn't documented... It was added by andre almost 7 > years ago... It does appear to be a public interface as ofed, sctp > iscsi and ng(_tty) all use it, though only sctp appears to use it any > differently than m_getm.. The rest could simply use m_getm instead > of m_getm2... Considering it was committed the day before SCTP was > committed, I'm not too surprised... > > P.S. if someone wants to submit a patch to mbuf.9 to update the docs > that would be helpful... I'll review and commit... and m_append is > also undocumented... > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 15:37:39 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E553CE08 for ; Tue, 28 Jan 2014 15:37:38 +0000 (UTC) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.12]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 687CB1882 for ; Tue, 28 Jan 2014 15:37:38 +0000 (UTC) Received: from th-04.cs.huji.ac.il ([132.65.80.125]) by kabab.cs.huji.ac.il with esmtp id 1W8AjD-0004Ad-30; Tue, 28 Jan 2014 17:37:31 +0200 Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: Daniel Braniss In-Reply-To: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca> Date: Tue, 28 Jan 2014 17:37:20 +0200 Message-Id: <59178C23-A863-40AF-922E-C0A16D12ECE9@cs.huji.ac.il> References: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1827) Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: Pyun YongHyeon , FreeBSD Net , Adam McDougall , Jack Vogel X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 15:37:39 -0000 On Jan 28, 2014, at 3:51 AM, Rick Macklem wrote: > Jack Vogel wrote: >> That header file is for the VF driver :) which I don't believe is >> being >> used in this case. >> The driver is capable of handling 256K but its limited by the stack >> to 64K >> (look in >> ixgbe.h), so its not a few bytes off due to the vlan header. >>=20 >> The scatter size is not an arbitrary one, its due to hardware >> limitations >> in Niantic >> (82599). Turning off TSO in the 10G environment is not practical, >> you will >> have >> trouble getting good performance. >>=20 >> Jack >>=20 > Well, if you look at this thread, Daniel got much better performance > by turning off TSO. However, I agree that this is not an ideal = solution. > = http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B >=20 > rick >=20 >>=20 >>=20 >> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN >> wrote: >>=20 >>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: >>>> pyunyh@gmail.com wrote: >>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: >>>>>> Adam McDougall wrote: >>>>>>> Also try rsize=3D32768,wsize=3D32768 in your mount options, >>>>>>> made a >>>>>>> huge >>>>>>> difference for me. I've noticed slow file transfers on NFS >>>>>>> in 9 >>>>>>> and >>>>>>> finally did some searching a couple months ago, someone >>>>>>> suggested >>>>>>> it >>>>>>> and >>>>>>> they were on to something. >>>>>>>=20 >>>>>> I have a "hunch" that might explain why 64K NFS reads/writes >>>>>> perform >>>>>> poorly for some network environments. >>>>>> A 64K NFS read reply/write request consists of a list of 34 >>>>>> mbufs >>>>>> when >>>>>> passed to TCP via sosend() and a total data length of around >>>>>> 65680bytes. >>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem >>>>>> to >>>>>> expect >>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. >>>>>> I >>>>>> think >>>>>> (I don't have anything that does TSO to confirm this) that >>>>>> NFS will >>>>>> pass >>>>>> a list that is longer (34 plus a TCP/IP header). >>>>>> At a glance, it appears that the drivers call m_defrag() or >>>>>> m_collapse() >>>>>> when the mbuf list won't fit in their scatter table (32 or 33 >>>>>> elements) >>>>>> and if this fails, just silently drop the data without >>>>>> sending it. >>>>>> If I'm right, there would considerable overhead from >>>>>> m_defrag()/m_collapse() >>>>>> and near disaster if they fail to fix the problem and the >>>>>> data is >>>>>> silently >>>>>> dropped instead of xmited. >>>>>>=20 >>>>>=20 >>>>> I think the actual number of DMA segments allocated for the >>>>> mbuf >>>>> chain is determined by bus_dma(9). bus_dma(9) will coalesce >>>>> current segment with previous segment if possible. >>>>>=20 >>>> Ok, I'll have to take a look, but I thought that an array of >>>> sized >>>> by "num_segs" is passed in as an argument. (And num_segs is set >>>> to >>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) >>>> It looked to me that the ixgbe driver called itself ix, so it >>>> isn't >>>> obvious to me which we are talking about. (I know that Daniel >>>> Braniss >>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.) >>>>=20 >>>=20 >>> It's ix(4). ixbge(4) is a different driver. >>>=20 this brings a sore problem, in 9.2-stable there is no man page for ix. also, the man page for ixbge does not mention the 82599EB, the only way I know it=92s the ixbge it=92s because i did: pciconf -lv then grep -r 82599EB sys/dev and found the driver source. I will try rick=92s patch over the weekend. danny >>>> I'll admit I mostly looked at virtio's network driver, since that >>>> was the one being used by J David. >>>>=20 >>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have >>>> been >>>> cropping up for quite a while, and I am just trying to find out >>>> why. >>>> (I have no hardware/software that exhibits the problem, so I can >>>> only look at the sources and ask others to try testing stuff.) >>>>=20 >>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but >>>>> I >>>>> see the total length of all segment size of ix(4) is 65535 so >>>>> it has no room for ethernet/VLAN header of the mbuf chain. The >>>>> driver should be fixed to transmit a 64KB datagram. >>>> Well, if_hw_tsomax is set to 65535 by the generic code (the >>>> driver >>>> doesn't set it) and the code in tcp_output() seems to subtract >>>> the >>>> size of an tcp/ip header from that before passing data to the >>>> driver, >>>> so I think the mbuf chain passed to the driver will fit in one >>>> ip datagram. (I'd assume all sorts of stuff would break for TSO >>>> enabled drivers if that wasn't the case?) >>>=20 >>> I believe the generic code is doing right. I'm under the >>> impression the non-working TSO indicates a bug in driver. Some >>> drivers didn't account for additional ethernet/VLAN header so the >>> total size of DMA segments exceeded 65535. I've attached a diff >>> for ix(4). It wasn't tested at all as I don't have hardware to >>> test. >>>=20 >>>>=20 >>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO >>>>> capable controllers are able to handle multiple TX buffers so >>>>> it >>>>> should have used m_collapse(9) rather than copying entire chain >>>>> with m_defrag(9). >>>>>=20 >>>> I haven't looked at these closely yet (plan on doing so to-day), >>>> but >>>> even m_collapse() looked like it copied data between mbufs and >>>> that >>>> is certainly suboptimal, imho. I don't see why a driver can't >>>> split >>>> the mbuf list, if there are too many entries for the >>>> scatter/gather >>>> and do it in two iterations (much like tcp_output() does already, >>>> since the data length exceeds 65535 - tcp/ip header size). >>>>=20 >>>=20 >>> It can split the mbuf list if controllers supports increased number >>> of TX buffers. Because controller shall consume the same number of >>> DMA descriptors for the mbuf list, drivers tend to impose a limit >>> on the number of TX buffers to save resources. >>>=20 >>>> However, at this point, I just want to find out if the long chain >>>> of mbufs is why TSO is problematic for these drivers, since I'll >>>> admit I'm getting tired of telling people to disable TSO (and I >>>> suspect some don't believe me and never try it). >>>>=20 >>>=20 >>> TSO capable controllers tend to have various limitations(the first >>> TX buffer should have complete ethernet/IP/TCP header, ip_len of IP >>> header should be reset to 0, TCP pseudo checksum should be >>> recomputed etc) and cheap controllers need more assistance from >>> driver to let its firmware know various IP/TCP header offset >>> location in the mbuf. Because this requires a IP/TCP header >>> parsing, it's error prone and very complex. >>>=20 >>>>>> Anyhow, I have attached a patch that makes NFS use >>>>>> MJUMPAGESIZE >>>>>> clusters, >>>>>> so the mbuf count drops from 34 to 18. >>>>>>=20 >>>>>=20 >>>>> Could we make it conditional on size? >>>>>=20 >>>> Not sure what you mean? If you mean "the size of the read/write", >>>> that would be possible for NFSv3, but less so for NFSv4. (The >>>> read/write >>>> is just one Op. in the compound for NFSv4 and there is no way to >>>> predict how much more data is going to be generated by subsequent >>>> Ops.) >>>>=20 >>>=20 >>> Sorry, I should have been more clearer. You already answered my >>> question. Thanks. >>>=20 >>>> If by "size" you mean amount of memory in the machine then, yes, >>>> it >>>> certainly could be conditional on that. (I plan to try and look >>>> at >>>> the allocator to-day as well, but if others know of disadvantages >>>> with >>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.) >>>>=20 >>>> Garrett Wollman already alluded to the MCLBYTES case being >>>> pre-allocated, >>>> but I'll admit I have no idea what the implications of that are >>>> at this >>>> time. >>>>=20 >>>>>> If anyone has a TSO scatter/gather enabled net interface and >>>>>> can >>>>>> test this >>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when >>>>>> TSO is >>>>>> enabled >>>>>> and see what effect it has, that would be appreciated. >>>>>>=20 >>>>>> Btw, thanks go to Garrett Wollman for suggesting the change >>>>>> to >>>>>> MJUMPAGESIZE >>>>>> clusters. >>>>>>=20 >>>>>> rick >>>>>> ps: If the attachment doesn't make it through and you want >>>>>> the >>>>>> patch, just >>>>>> email me and I'll send you a copy. >>>>>>=20 >>>=20 >>> _______________________________________________ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to >>> "freebsd-net-unsubscribe@freebsd.org" >>>=20 >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to >> "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 17:17:46 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C71FDDDB for ; Tue, 28 Jan 2014 17:17:46 +0000 (UTC) Received: from mail-vc0-x234.google.com (mail-vc0-x234.google.com [IPv6:2607:f8b0:400c:c03::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7C1A412E9 for ; Tue, 28 Jan 2014 17:17:46 +0000 (UTC) Received: by mail-vc0-f180.google.com with SMTP id ks9so438281vcb.11 for ; Tue, 28 Jan 2014 09:17:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=berentweb.com; s=google; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=SnwwIytR+mJVZbBac9wiD0A0jNZD0tbQPgMJHOblfh0=; b=emZx+8wo3aeOGoIxNTjKqGsr2NdPCatFdyx6VPLmq1NPO6r5blX2A8mR6b7ZpVaENk 4ohNYdu7Qjl0mgdxDp83Z7qMSd+Rox4nOXL9m1uPfubxlIl5UBfz5XMszpJlAcKVNMZt 1ekxhTRpoLfrCfZGib2enBEDvAJIRvRvBEkek= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:sender:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=SnwwIytR+mJVZbBac9wiD0A0jNZD0tbQPgMJHOblfh0=; b=h4PNqRXJbZIxlR4HY5fR2OWpRe+sspZfBc4q6e9RFX743OGYdQIYHnRmyx6iCigKtp 7E/OwgZdRqtBsWRiY/Q83+FEu1o/BZ380yfCelgDqeRYZD0/jHOnbnIWdl/tqeTNajG1 TMghJmfoxTPMt7UCd8mMWgdFZ8+hOTfOOWDbj3y7K1mcmhb4kLXKXI2YaFvoM2gLayPr g/7l+JaZRl2yXfjb2r4EJBKbC0bnlHsx/Xb631AhNkS68OgyVKFmE7SaBWknNDf047e3 H21LV2XIzdP1O00eEdZ0MtONQU7jQqL6CesQpzeZ5B6RNmZn+uLHPKXPOaMh8BF42SA2 oIKg== X-Gm-Message-State: ALoCoQlhuPj+UdBYzwEmkQdiCWpgJU4ANsqFGNq5Q26SFjw5kZU48VRx3jwvQu3LGW+ExHzuy5oI MIME-Version: 1.0 X-Received: by 10.52.116.71 with SMTP id ju7mr954206vdb.31.1390929465500; Tue, 28 Jan 2014 09:17:45 -0800 (PST) Sender: rsb@berentweb.com Received: by 10.220.146.145 with HTTP; Tue, 28 Jan 2014 09:17:45 -0800 (PST) X-Originating-IP: [83.66.215.241] In-Reply-To: <52E7A9D8.30604@freebsd.org> References: <1390909590119-5880672.post@n5.nabble.com> <52E7A9D8.30604@freebsd.org> Date: Tue, 28 Jan 2014 19:17:45 +0200 X-Google-Sender-Auth: -IV9VPWfFpVpJ8UvXMb-CBHEQ0U Message-ID: Subject: Re: Jails on fib problem From: Beeblebrox To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: zaphod@berentweb.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 17:17:46 -0000 > what's a fib device? Do you mean each jail has a different default fib? > you are not using vimage jails? Hi Julian. * No vimage * All jails use the same fib. /etc/rc.conf: cloned_interfaces="lo2" ifconfig_lo2="inet 127.0.1.1/28" static_routes="jail default" route_jail="default 127.0.1.1 -fib 1" route_default="default 192.168.1.1" > so they all have the same address?? can you even do that? or you mean that > they all have the same default route? I mean same default route, jail IP's start from 127.0.1.2/32 and go to 127.0.1.6/32 jail.conf assigns fib with "exec.fib = 1;" jails on the 127.0.1.1/28 subnet range should be able to route traffic through the 127.0.0.1 gateway regardless of the fact that the jails themselves reside on a /32 subnet. However, it's not working smoothly > fibs don't have devices. Yes, I know - a misnomer. setfib 1 netstat -rn Destination Gateway Flags Netif Expire default 127.0.1.1 UGS lo2 127.0.0.1 link#3 UH lo0 127.0.1.1 link#4 UH lo2 127.0.1.2 link#4 UH lo2 127.0.1.3 link#4 UH lo2 127.0.1.4 link#4 UH lo2 192.168.1.0/24 link#1 U re0 (Ext_If) 192.168.2.0/26 link#2 U re1 (Lan_If) To complicate things further, I also have a vboxnet0 for VBox guests. 127.0.1.2 is a dns jail for example. The Internal LAN clients, vboxnet0 guests and lo0 need to resolve names from that jail. From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 19:57:00 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id ED002CA9; Tue, 28 Jan 2014 19:57:00 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BD7F31353; Tue, 28 Jan 2014 19:57:00 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0SJv0oP089165; Tue, 28 Jan 2014 19:57:00 GMT (envelope-from jmg@freefall.freebsd.org) Received: (from jmg@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0SJv0WX089164; Tue, 28 Jan 2014 19:57:00 GMT (envelope-from jmg) Date: Tue, 28 Jan 2014 19:57:00 GMT Message-Id: <201401281957.s0SJv0WX089164@freefall.freebsd.org> To: jmg@FreeBSD.org, freebsd-net@FreeBSD.org, jvf@FreeBSD.org From: jmg@FreeBSD.org Subject: Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 19:57:01 -0000 Synopsis: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST Responsible-Changed-From-To: freebsd-net->jvf Responsible-Changed-By: jmg Responsible-Changed-When: Tue Jan 28 19:56:21 UTC 2014 Responsible-Changed-Why: assign this to Jack so he gets bugged about it weekly.. :) http://www.freebsd.org/cgi/query-pr.cgi?pr=176446 From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 20:29:12 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D179D7EC; Tue, 28 Jan 2014 20:29:12 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A3D311645; Tue, 28 Jan 2014 20:29:12 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0SKTCNp097102; Tue, 28 Jan 2014 20:29:12 GMT (envelope-from jmg@freefall.freebsd.org) Received: (from jmg@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0SKTBfS097101; Tue, 28 Jan 2014 20:29:11 GMT (envelope-from jmg) Date: Tue, 28 Jan 2014 20:29:11 GMT Message-Id: <201401282029.s0SKTBfS097101@freefall.freebsd.org> To: sysop@prisjakt.nu, jmg@FreeBSD.org, freebsd-net@FreeBSD.org, freebsd-j@FreeBSD.org From: jmg@FreeBSD.org Subject: Re: kern/179299: [igb] Intel X540-T2 - unstable driver X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 20:29:12 -0000 Synopsis: [igb] Intel X540-T2 - unstable driver State-Changed-From-To: open->closed State-Changed-By: jmg State-Changed-When: Tue Jan 28 20:27:25 UTC 2014 State-Changed-Why: looks like you don't have the hardware anymore.. if you can reproduce, we can open this up again.. Responsible-Changed-From-To: freebsd-net->freebsd-j Responsible-Changed-By: jmg Responsible-Changed-When: Tue Jan 28 20:27:25 UTC 2014 Responsible-Changed-Why: looks like you don't have the hardware anymore.. if you can reproduce, we can open this up again.. http://www.freebsd.org/cgi/query-pr.cgi?pr=179299 From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 20:30:01 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 65C59887 for ; Tue, 28 Jan 2014 20:30:01 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 438291654 for ; Tue, 28 Jan 2014 20:30:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0SKU17c097274 for ; Tue, 28 Jan 2014 20:30:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0SKU15n097273; Tue, 28 Jan 2014 20:30:01 GMT (envelope-from gnats) Date: Tue, 28 Jan 2014 20:30:01 GMT Message-Id: <201401282030.s0SKU15n097273@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: dfilter@FreeBSD.ORG (dfilter service) Subject: Re: kern/183659: commit references a PR X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: dfilter service List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 20:30:01 -0000 The following reply was made to PR kern/183659; it has been noted by GNATS. From: dfilter@FreeBSD.ORG (dfilter service) To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/183659: commit references a PR Date: Tue, 28 Jan 2014 20:28:45 +0000 (UTC) Author: gnn Date: Tue Jan 28 20:28:32 2014 New Revision: 261242 URL: http://svnweb.freebsd.org/changeset/base/261242 Log: Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full disucssion of this change and its measured effects. PR: 183659 Submitted by: Julian Charbon Reviewed by: jhb Modified: head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_usrreq.c Modified: head/sys/netinet/tcp_syncache.c ============================================================================== --- head/sys/netinet/tcp_syncache.c Tue Jan 28 19:12:31 2014 (r261241) +++ head/sys/netinet/tcp_syncache.c Tue Jan 28 20:28:32 2014 (r261242) @@ -682,7 +682,7 @@ syncache_socket(struct syncache *sc, str * connection when the SYN arrived. If we can't create * the connection, abort it. */ - so = sonewconn(lso, SS_ISCONNECTED); + so = sonewconn(lso, 0); if (so == NULL) { /* * Drop the connection; we will either send a RST or @@ -922,6 +922,8 @@ syncache_socket(struct syncache *sc, str INP_WUNLOCK(inp); + soisconnected(so); + TCPSTAT_INC(tcps_accepts); return (so); Modified: head/sys/netinet/tcp_usrreq.c ============================================================================== --- head/sys/netinet/tcp_usrreq.c Tue Jan 28 19:12:31 2014 (r261241) +++ head/sys/netinet/tcp_usrreq.c Tue Jan 28 20:28:32 2014 (r261242) @@ -610,13 +610,6 @@ out: /* * Accept a connection. Essentially all the work is done at higher levels; * just return the address of the peer, storing through addr. - * - * The rationale for acquiring the tcbinfo lock here is somewhat complicated, - * and is described in detail in the commit log entry for r175612. Acquiring - * it delays an accept(2) racing with sonewconn(), which inserts the socket - * before the inpcb address/port fields are initialized. A better fix would - * prevent the socket from being placed in the listen queue until all fields - * are fully initialized. */ static int tcp_usr_accept(struct socket *so, struct sockaddr **nam) @@ -633,7 +626,6 @@ tcp_usr_accept(struct socket *so, struct inp = sotoinpcb(so); KASSERT(inp != NULL, ("tcp_usr_accept: inp == NULL")); - INP_INFO_RLOCK(&V_tcbinfo); INP_WLOCK(inp); if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { error = ECONNABORTED; @@ -653,7 +645,6 @@ tcp_usr_accept(struct socket *so, struct out: TCPDEBUG2(PRU_ACCEPT); INP_WUNLOCK(inp); - INP_INFO_RUNLOCK(&V_tcbinfo); if (error == 0) *nam = in_sockaddr(port, &addr); return error; _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 20:30:04 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4A6EB89E; Tue, 28 Jan 2014 20:30:04 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 1B86D1657; Tue, 28 Jan 2014 20:30:04 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0SKU3KD097294; Tue, 28 Jan 2014 20:30:03 GMT (envelope-from jmg@freefall.freebsd.org) Received: (from jmg@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0SKU3Ji097293; Tue, 28 Jan 2014 20:30:03 GMT (envelope-from jmg) Date: Tue, 28 Jan 2014 20:30:03 GMT Message-Id: <201401282030.s0SKU3Ji097293@freefall.freebsd.org> To: jmg@FreeBSD.org, freebsd-j@FreeBSD.org, freebsd-net@FreeBSD.org From: jmg@FreeBSD.org Subject: Re: kern/179299: [igb] Intel X540-T2 - unstable driver X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 20:30:04 -0000 Synopsis: [igb] Intel X540-T2 - unstable driver Responsible-Changed-From-To: freebsd-j->freebsd-net Responsible-Changed-By: jmg Responsible-Changed-When: Tue Jan 28 20:29:36 UTC 2014 Responsible-Changed-Why: fix responsible typo.. http://www.freebsd.org/cgi/query-pr.cgi?pr=179299 From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 20:33:03 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 143EBB01; Tue, 28 Jan 2014 20:33:03 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DA09A16DB; Tue, 28 Jan 2014 20:33:02 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0SKX2hC099110; Tue, 28 Jan 2014 20:33:02 GMT (envelope-from gnn@freefall.freebsd.org) Received: (from gnn@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0SKX2W8099109; Tue, 28 Jan 2014 20:33:02 GMT (envelope-from gnn) Date: Tue, 28 Jan 2014 20:33:02 GMT Message-Id: <201401282033.s0SKX2W8099109@freefall.freebsd.org> To: jcharbon@verisign.com, gnn@FreeBSD.org, freebsd-net@FreeBSD.org From: gnn@FreeBSD.org Subject: Re: kern/183659: [tcp] ]TCP stack lock contention with short-lived connections X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 20:33:03 -0000 Synopsis: [tcp] ]TCP stack lock contention with short-lived connections State-Changed-From-To: open->patched State-Changed-By: gnn State-Changed-When: Tue Jan 28 20:32:31 UTC 2014 State-Changed-Why: Patched with commit 261242 http://www.freebsd.org/cgi/query-pr.cgi?pr=183659 From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 20:37:58 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 36574D0D for ; Tue, 28 Jan 2014 20:37:58 +0000 (UTC) Received: from na3sys010aog111.obsmtp.com (na3sys010aog111.obsmtp.com [74.125.245.90]) by mx1.freebsd.org (Postfix) with SMTP id A94931733 for ; Tue, 28 Jan 2014 20:37:57 +0000 (UTC) Received: from mail-ee0-f54.google.com ([74.125.83.54]) (using TLSv1) by na3sys010aob111.postini.com ([74.125.244.12]) with SMTP ID DSNKUugVJOaUD1rOkrp/4ioDVB5SfuULZzjs@postini.com; Tue, 28 Jan 2014 12:37:57 PST Received: by mail-ee0-f54.google.com with SMTP id e53so466820eek.41 for ; Tue, 28 Jan 2014 12:37:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=jhNQVNyXS4+2PwXqSBVa5dKLEVylrIgl8Ul9GTsnEfY=; b=PAAdPKNT5U+emTnINI5UqCZ5EdVRmuAZoILpmyD0KZ2B07uMOIQQMXptB5kkudxYsi WDtApREoPqrBL/EGVAc3jpWFZanSDr/UF2yCrazJdpsn6Ip5gVdb+4loIBLjaQ08hG6Z WWg5bfKfUMKgdRBWlqJgfE8zl6JDB6Ux5l6oTtHLVVW9NxOhfhJNNGcyKE6Bl/Hb2Rig Zw7G1/x9LfzqRZc43L5i5eRLNYN+RBy1qgr1Kxj5wV7nkw5mrIvujnn/RExWiM9NiFJi VKMyQZ7Z7HXWvyWlGrLvJktx6GjVsYPuQSSp5B0LkBbaM1cmRFoTkjROKRF8D99B4NiD 0VYw== X-Received: by 10.14.211.71 with SMTP id v47mr3897742eeo.37.1390941101826; Tue, 28 Jan 2014 12:31:41 -0800 (PST) X-Gm-Message-State: ALoCoQkVYtdDUdwoGgO/JEspVz5QxI9sLK8CiJfe8aXq2bclF7aDXCXNZZ6SMV71j6Yf47307kCy/GNBV/CUoTg53aKWk1IAc4sC8wNzqFbD+Cg7t59GnFvGlqBpeN1JcpTOOjQHIuzoaYXJOzgL8ThlTsSEVW/ns3iz6zWZW/ky88n/EahejkU= X-Received: by 10.14.211.71 with SMTP id v47mr3897732eeo.37.1390941101739; Tue, 28 Jan 2014 12:31:41 -0800 (PST) Received: from grey.home.unixconn.com (h-74-23.a183.priv.bahnhof.se. [46.59.74.23]) by mx.google.com with ESMTPSA id b41sm59859144eef.16.2014.01.28.12.31.39 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Jan 2014 12:31:40 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: Re: kern/179299: [igb] Intel X540-T2 - unstable driver From: Maxim Bourmistrov In-Reply-To: <201401282029.s0SKTBfS097101@freefall.freebsd.org> Date: Tue, 28 Jan 2014 21:31:38 +0100 Content-Transfer-Encoding: 7bit Message-Id: <8AB7DCF4-BE5C-4F11-BB66-CED78A58F51A@prisjakt.nu> References: <201401282029.s0SKTBfS097101@freefall.freebsd.org> To: jmg@FreeBSD.org X-Mailer: Apple Mail (2.1827) Cc: freebsd-net@FreeBSD.org, freebsd-j@FreeBSD.org, sysop@prisjakt.nu X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 20:37:58 -0000 Agree. On 28 jan 2014, at 21:29, jmg@FreeBSD.org wrote: > Synopsis: [igb] Intel X540-T2 - unstable driver > > State-Changed-From-To: open->closed > State-Changed-By: jmg > State-Changed-When: Tue Jan 28 20:27:25 UTC 2014 > State-Changed-Why: > looks like you don't have the hardware anymore.. if you can reproduce, > we can open this up again.. > > > Responsible-Changed-From-To: freebsd-net->freebsd-j > Responsible-Changed-By: jmg > Responsible-Changed-When: Tue Jan 28 20:27:25 UTC 2014 > Responsible-Changed-Why: > looks like you don't have the hardware anymore.. if you can reproduce, > we can open this up again.. > > http://www.freebsd.org/cgi/query-pr.cgi?pr=179299 From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 21:00:01 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C3F6D357 for ; Tue, 28 Jan 2014 21:00:01 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AFD2E18CD for ; Tue, 28 Jan 2014 21:00:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0SL01iC004063 for ; Tue, 28 Jan 2014 21:00:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0SL01HY004062; Tue, 28 Jan 2014 21:00:01 GMT (envelope-from gnats) Date: Tue, 28 Jan 2014 21:00:01 GMT Message-Id: <201401282100.s0SL01HY004062@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: Vlad Movchan Subject: Re: kern/165622: [ndis][panic][patch] Unregistered use of FPU in kernel on amd64 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: Vlad Movchan List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 21:00:01 -0000 The following reply was made to PR kern/165622; it has been noted by GNATS. From: Vlad Movchan To: bug-followup@FreeBSD.org, Vlad Movchan Cc: Subject: Re: kern/165622: [ndis][panic][patch] Unregistered use of FPU in kernel on amd64 Date: Tue, 28 Jan 2014 22:56:22 +0200 --001a11343fbc90848404f10e0ef4 Content-Type: multipart/alternative; boundary=001a11343fbc90847f04f10e0ef2 --001a11343fbc90847f04f10e0ef2 Content-Type: text/plain; charset=ISO-8859-1 Here is a corrected patch. Previous version could not be compiled on i386. --001a11343fbc90847f04f10e0ef2 Content-Type: text/html; charset=ISO-8859-1
Here is a corrected patch. Previous version could not be compiled on i386.
--001a11343fbc90847f04f10e0ef2-- --001a11343fbc90848404f10e0ef4 Content-Type: text/plain; charset=US-ASCII; name="fpu_patch3.txt" Content-Disposition: attachment; filename="fpu_patch3.txt" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hqzly1yi0 SW5kZXg6IHN5cy9jb21wYXQvbmRpcy9rZXJuX3dpbmRydi5jCj09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KLS0tIHN5cy9j b21wYXQvbmRpcy9rZXJuX3dpbmRydi5jCShyZXZpc2lvbiAyNjEyMzkpCisrKyBzeXMvY29tcGF0 L25kaXMva2Vybl93aW5kcnYuYwkod29ya2luZyBjb3B5KQpAQCAtNTYsNiArNTYsMTAgQEAKICNp bmNsdWRlIDxtYWNoaW5lL3NlZ21lbnRzLmg+CiAjZW5kaWYKIAorI2lmZGVmIF9fYW1kNjRfXwor I2luY2x1ZGUgPG1hY2hpbmUvZnB1Lmg+CisjZW5kaWYKKwogI2luY2x1ZGUgPGRldi91c2IvdXNi Lmg+CiAKICNpbmNsdWRlIDxjb21wYXQvbmRpcy9wZV92YXIuaD4KQEAgLTY2LDYgKzcwLDE2IEBA CiAjaW5jbHVkZSA8Y29tcGF0L25kaXMvaGFsX3Zhci5oPgogI2luY2x1ZGUgPGNvbXBhdC9uZGlz L3VzYmRfdmFyLmg+CiAKKyNpZmRlZiBfX2FtZDY0X18KK3N0cnVjdCBmcHVfY2NfZW50IHsKKwlj aGFyCQl1c2VkOworCXN0cnVjdCBmcHVfa2Vybl9jdHggKmN0eDsKKwlTTElTVF9FTlRSWShmcHVf Y2NfZW50KSBsaW5rOworfTsKK3N0YXRpYyBTTElTVF9IRUFEKGZwdV9jdHhfY2FjaGUsIGZwdV9j Y19lbnQpIGZwdV9jY19oZWFkOworc3RhdGljIHN0cnVjdCBtdHggZnB1X2NhY2hlX210eDsKKyNl bmRpZgorCiBzdGF0aWMgc3RydWN0IG10eCBkcnZkYl9tdHg7CiBzdGF0aWMgU1RBSUxRX0hFQUQo ZHJ2ZGIsIGRydmRiX2VudCkgZHJ2ZGJfaGVhZDsKIApAQCAtOTYsNiArMTEwLDExIEBACiAJbXR4 X2luaXQoJmRydmRiX210eCwgIldpbmRvd3MgZHJpdmVyIERCIGxvY2siLAogCSAgICAiV2luZG93 cyBpbnRlcm5hbCBsb2NrIiwgTVRYX0RFRik7CiAKKyNpZmRlZiBfX2FtZDY0X18KKwlTTElTVF9J TklUKCZmcHVfY2NfaGVhZCk7CisJbXR4X2luaXQoJmZwdV9jYWNoZV9tdHgsICJmcHUgY29udGV4 dCBjYWNoZSBsb2NrIiwgTlVMTCwgTVRYX0RFRik7CisjZW5kaWYKKwogCS8qCiAJICogUENJIGFu ZCBwY2NhcmQgZGV2aWNlcyBkb24ndCBuZWVkIHRvIHVzZSBJUlBzIHRvCiAJICogaW50ZXJhY3Qg d2l0aCB0aGVpciBidXMgZHJpdmVycyAodXN1YWxseSksIHNvIG91cgpAQCAtMTMwLDYgKzE0OSw5 IEBACiB3aW5kcnZfbGliZmluaSh2b2lkKQogewogCXN0cnVjdCBkcnZkYl9lbnQJKmQ7CisjaWZk ZWYgX19hbWQ2NF9fCisJc3RydWN0IGZwdV9jY19lbnQgKmVudDsKKyNlbmRpZgogCiAJbXR4X2xv Y2soJmRydmRiX210eCk7IAogCXdoaWxlKFNUQUlMUV9GSVJTVCgmZHJ2ZGJfaGVhZCkgIT0gTlVM TCkgewpAQCAtMTQ4LDYgKzE3MCwxNSBAQAogCXNtcF9yZW5kZXp2b3VzKE5VTEwsIHg4Nl9vbGRs ZHQsIE5VTEwsIE5VTEwpOwogCUV4RnJlZVBvb2wobXlfdGlkcyk7CiAjZW5kaWYKKyNpZmRlZiBf X2FtZDY0X18KKwl3aGlsZSAoKGVudCA9IFNMSVNUX0ZJUlNUKCZmcHVfY2NfaGVhZCkpICE9IE5V TEwpIHsKKwkJU0xJU1RfUkVNT1ZFX0hFQUQoJmZwdV9jY19oZWFkLCBsaW5rKTsKKwkJZnB1X2tl cm5fZnJlZV9jdHgoZW50LT5jdHgpOworCQlmcmVlKGVudCwgTV9ERVZCVUYpOworCX0KKworCW10 eF9kZXN0cm95KCZmcHVfY2FjaGVfbXR4KTsKKyNlbmRpZgogCXJldHVybiAoMCk7CiB9CiAKQEAg LTYxMyw2ICs2NDQsMTQyIEBACiAKIAlyZXR1cm4gKDApOwogfQorCitzdGF0aWMgc3RydWN0IGZw dV9jY19lbnQgKgorcmVxdWVzdF9mcHVfY2NfZW50KHZvaWQpCit7CisJc3RydWN0IGZwdV9jY19l bnQgKmVudDsKKworCW10eF9sb2NrKCZmcHVfY2FjaGVfbXR4KTsKKwlTTElTVF9GT1JFQUNIKGVu dCwgJmZwdV9jY19oZWFkLCBsaW5rKSB7CisJCWlmKGVudC0+dXNlZCA9PSAwKSB7CisJCQllbnQt PnVzZWQgPSAxOworCQkJbXR4X3VubG9jaygmZnB1X2NhY2hlX210eCk7CisJCQlyZXR1cm4gKGVu dCk7CisJCX0KKwl9CisJbXR4X3VubG9jaygmZnB1X2NhY2hlX210eCk7CisKKwlpZiAoKGVudCA9 IG1hbGxvYyhzaXplb2Yoc3RydWN0IGZwdV9jY19lbnQpLCBNX0RFVkJVRiwgTV9OT1dBSVQgfAor CSAgICBNX1pFUk8pKSAhPSBOVUxMKSB7CisJCWVudC0+Y3R4ID0gZnB1X2tlcm5fYWxsb2NfY3R4 KEZQVV9LRVJOX05PUk1BTCB8CisJCSAgICBGUFVfS0VSTl9OT1dBSVQpOworCQlpZiAoZW50LT5j dHggIT0gTlVMTCkgeworCQkJZW50LT51c2VkID0gMTsKKwkJCW10eF9sb2NrKCZmcHVfY2FjaGVf bXR4KTsKKwkJCVNMSVNUX0lOU0VSVF9IRUFEKCZmcHVfY2NfaGVhZCwgZW50LCBsaW5rKTsKKwkJ CW10eF91bmxvY2soJmZwdV9jYWNoZV9tdHgpOworCQl9IGVsc2UKKwkJCWZyZWUoZW50LCBNX0RF VkJVRik7CisJfQorCisJcmV0dXJuIChlbnQpOworfQorCitzdGF0aWMgdm9pZAorcmVsZWFzZV9m cHVfY2NfZW50KHN0cnVjdCBmcHVfY2NfZW50ICplbnQpCit7CisKKwllbnQtPnVzZWQgPSAwOwor fQorCit1aW50NjRfdAorX3g4Nl82NF9jYWxsMSh2b2lkICpmbiwgdWludDY0X3QgYSkKK3sKKwlz dHJ1Y3QgZnB1X2NjX2VudCAqZW50OworCXVpbnQ2NF90IHJldDsKKworCWlmICgoZW50ID0gcmVx dWVzdF9mcHVfY2NfZW50KCkpID09IE5VTEwpCisJCXJldHVybiAoRU5PTUVNKTsKKwlmcHVfa2Vy bl9lbnRlcihjdXJ0aHJlYWQsIGVudC0+Y3R4LCBGUFVfS0VSTl9OT1JNQUwpOworCXJldCA9IHg4 Nl82NF9jYWxsMShmbiwgYSk7CisJZnB1X2tlcm5fbGVhdmUoY3VydGhyZWFkLCBlbnQtPmN0eCk7 CisJcmVsZWFzZV9mcHVfY2NfZW50KGVudCk7CisKKwlyZXR1cm4gKHJldCk7Cit9CisKK3VpbnQ2 NF90CitfeDg2XzY0X2NhbGwyKHZvaWQgKmZuLCB1aW50NjRfdCBhLCB1aW50NjRfdCBiKQorewor CXN0cnVjdCBmcHVfY2NfZW50ICplbnQ7CisJdWludDY0X3QgcmV0OworCisJaWYgKChlbnQgPSBy ZXF1ZXN0X2ZwdV9jY19lbnQoKSkgPT0gTlVMTCkKKwkJcmV0dXJuIChFTk9NRU0pOworCWZwdV9r ZXJuX2VudGVyKGN1cnRocmVhZCwgZW50LT5jdHgsIEZQVV9LRVJOX05PUk1BTCk7CisJcmV0ID0g eDg2XzY0X2NhbGwyKGZuLCBhLCBiKTsKKwlmcHVfa2Vybl9sZWF2ZShjdXJ0aHJlYWQsIGVudC0+ Y3R4KTsKKwlyZWxlYXNlX2ZwdV9jY19lbnQoZW50KTsKKworCXJldHVybiAocmV0KTsKK30KKwor dWludDY0X3QKK194ODZfNjRfY2FsbDModm9pZCAqZm4sIHVpbnQ2NF90IGEsIHVpbnQ2NF90IGIs IHVpbnQ2NF90IGMpCit7CisJc3RydWN0IGZwdV9jY19lbnQgKmVudDsKKwl1aW50NjRfdCByZXQ7 CisKKwlpZiAoKGVudCA9IHJlcXVlc3RfZnB1X2NjX2VudCgpKSA9PSBOVUxMKQorCQlyZXR1cm4g KEVOT01FTSk7CisJZnB1X2tlcm5fZW50ZXIoY3VydGhyZWFkLCBlbnQtPmN0eCwgRlBVX0tFUk5f Tk9STUFMKTsKKwlyZXQgPSB4ODZfNjRfY2FsbDMoZm4sIGEsIGIsIGMpOworCWZwdV9rZXJuX2xl YXZlKGN1cnRocmVhZCwgZW50LT5jdHgpOworCXJlbGVhc2VfZnB1X2NjX2VudChlbnQpOworCisJ cmV0dXJuIChyZXQpOworfQorCit1aW50NjRfdAorX3g4Nl82NF9jYWxsNCh2b2lkICpmbiwgdWlu dDY0X3QgYSwgdWludDY0X3QgYiwgdWludDY0X3QgYywgdWludDY0X3QgZCkKK3sKKwlzdHJ1Y3Qg ZnB1X2NjX2VudCAqZW50OworCXVpbnQ2NF90IHJldDsKKworCWlmICgoZW50ID0gcmVxdWVzdF9m cHVfY2NfZW50KCkpID09IE5VTEwpCisJCXJldHVybiAoRU5PTUVNKTsKKwlmcHVfa2Vybl9lbnRl cihjdXJ0aHJlYWQsIGVudC0+Y3R4LCBGUFVfS0VSTl9OT1JNQUwpOworCXJldCA9IHg4Nl82NF9j YWxsNChmbiwgYSwgYiwgYywgZCk7CisJZnB1X2tlcm5fbGVhdmUoY3VydGhyZWFkLCBlbnQtPmN0 eCk7CisJcmVsZWFzZV9mcHVfY2NfZW50KGVudCk7CisKKwlyZXR1cm4gKHJldCk7Cit9CisKK3Vp bnQ2NF90CitfeDg2XzY0X2NhbGw1KHZvaWQgKmZuLCB1aW50NjRfdCBhLCB1aW50NjRfdCBiLCB1 aW50NjRfdCBjLCB1aW50NjRfdCBkLAorICAgIHVpbnQ2NF90IGUpCit7CisJc3RydWN0IGZwdV9j Y19lbnQgKmVudDsKKwl1aW50NjRfdCByZXQ7CisKKwlpZiAoKGVudCA9IHJlcXVlc3RfZnB1X2Nj X2VudCgpKSA9PSBOVUxMKQorCQlyZXR1cm4gKEVOT01FTSk7CisJZnB1X2tlcm5fZW50ZXIoY3Vy dGhyZWFkLCBlbnQtPmN0eCwgRlBVX0tFUk5fTk9STUFMKTsKKwlyZXQgPSB4ODZfNjRfY2FsbDUo Zm4sIGEsIGIsIGMsIGQsIGUpOworCWZwdV9rZXJuX2xlYXZlKGN1cnRocmVhZCwgZW50LT5jdHgp OworCXJlbGVhc2VfZnB1X2NjX2VudChlbnQpOworCisJcmV0dXJuIChyZXQpOworfQorCit1aW50 NjRfdAorX3g4Nl82NF9jYWxsNih2b2lkICpmbiwgdWludDY0X3QgYSwgdWludDY0X3QgYiwgdWlu dDY0X3QgYywgdWludDY0X3QgZCwKKyAgICB1aW50NjRfdCBlLCB1aW50NjRfdCBmKQoreworCXN0 cnVjdCBmcHVfY2NfZW50ICplbnQ7CisJdWludDY0X3QgcmV0OworCisJaWYgKChlbnQgPSByZXF1 ZXN0X2ZwdV9jY19lbnQoKSkgPT0gTlVMTCkKKwkJcmV0dXJuIChFTk9NRU0pOworCWZwdV9rZXJu X2VudGVyKGN1cnRocmVhZCwgZW50LT5jdHgsIEZQVV9LRVJOX05PUk1BTCk7CisJcmV0ID0geDg2 XzY0X2NhbGw2KGZuLCBhLCBiLCBjLCBkLCBlLCBmKTsKKwlmcHVfa2Vybl9sZWF2ZShjdXJ0aHJl YWQsIGVudC0+Y3R4KTsKKwlyZWxlYXNlX2ZwdV9jY19lbnQoZW50KTsKKworCXJldHVybiAocmV0 KTsKK30KICNlbmRpZiAvKiBfX2FtZDY0X18gKi8KIAogCkluZGV4OiBzeXMvY29tcGF0L25kaXMv cGVfdmFyLmgKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PQotLS0gc3lzL2NvbXBhdC9uZGlzL3BlX3Zhci5oCShyZXZpc2lv biAyNjEyMzkpCisrKyBzeXMvY29tcGF0L25kaXMvcGVfdmFyLmgJKHdvcmtpbmcgY29weSkKQEAg LTQ2MCwyMiArNDYwLDMwIEBACiBleHRlcm4gdWludDY0X3QgeDg2XzY0X2NhbGw2KHZvaWQgKiwg dWludDY0X3QsIHVpbnQ2NF90LCB1aW50NjRfdCwgdWludDY0X3QsCiAJdWludDY0X3QsIHVpbnQ2 NF90KTsKIAordWludDY0X3QgX3g4Nl82NF9jYWxsMSh2b2lkICosIHVpbnQ2NF90KTsKK3VpbnQ2 NF90IF94ODZfNjRfY2FsbDIodm9pZCAqLCB1aW50NjRfdCwgdWludDY0X3QpOwordWludDY0X3Qg X3g4Nl82NF9jYWxsMyh2b2lkICosIHVpbnQ2NF90LCB1aW50NjRfdCwgdWludDY0X3QpOwordWlu dDY0X3QgX3g4Nl82NF9jYWxsNCh2b2lkICosIHVpbnQ2NF90LCB1aW50NjRfdCwgdWludDY0X3Qs IHVpbnQ2NF90KTsKK3VpbnQ2NF90IF94ODZfNjRfY2FsbDUodm9pZCAqLCB1aW50NjRfdCwgdWlu dDY0X3QsIHVpbnQ2NF90LCB1aW50NjRfdCwKKyAgICB1aW50NjRfdCk7Cit1aW50NjRfdCBfeDg2 XzY0X2NhbGw2KHZvaWQgKiwgdWludDY0X3QsIHVpbnQ2NF90LCB1aW50NjRfdCwgdWludDY0X3Qs CisgICAgdWludDY0X3QsIHVpbnQ2NF90KTsKIAogI2RlZmluZQlNU0NBTEwxKGZuLCBhKQkJCQkJ CVwKLQl4ODZfNjRfY2FsbDEoKGZuKSwgKHVpbnQ2NF90KShhKSkKKwlfeDg2XzY0X2NhbGwxKChm biksICh1aW50NjRfdCkoYSkpCiAjZGVmaW5lCU1TQ0FMTDIoZm4sIGEsIGIpCQkJCQlcCi0JeDg2 XzY0X2NhbGwyKChmbiksICh1aW50NjRfdCkoYSksICh1aW50NjRfdCkoYikpCisJX3g4Nl82NF9j YWxsMigoZm4pLCAodWludDY0X3QpKGEpLCAodWludDY0X3QpKGIpKQogI2RlZmluZQlNU0NBTEwz KGZuLCBhLCBiLCBjKQkJCQkJXAotCXg4Nl82NF9jYWxsMygoZm4pLCAodWludDY0X3QpKGEpLCAo dWludDY0X3QpKGIpLAkJXAorCV94ODZfNjRfY2FsbDMoKGZuKSwgKHVpbnQ2NF90KShhKSwgKHVp bnQ2NF90KShiKSwJCVwKIAkodWludDY0X3QpKGMpKQogI2RlZmluZQlNU0NBTEw0KGZuLCBhLCBi LCBjLCBkKQkJCQkJXAotCXg4Nl82NF9jYWxsNCgoZm4pLCAodWludDY0X3QpKGEpLCAodWludDY0 X3QpKGIpLAkJXAorCV94ODZfNjRfY2FsbDQoKGZuKSwgKHVpbnQ2NF90KShhKSwgKHVpbnQ2NF90 KShiKSwJCVwKIAkodWludDY0X3QpKGMpLCAodWludDY0X3QpKGQpKQogI2RlZmluZQlNU0NBTEw1 KGZuLCBhLCBiLCBjLCBkLCBlKQkJCQlcCi0JeDg2XzY0X2NhbGw1KChmbiksICh1aW50NjRfdCko YSksICh1aW50NjRfdCkoYiksCQlcCisJX3g4Nl82NF9jYWxsNSgoZm4pLCAodWludDY0X3QpKGEp LCAodWludDY0X3QpKGIpLAkJXAogCSh1aW50NjRfdCkoYyksICh1aW50NjRfdCkoZCksICh1aW50 NjRfdCkoZSkpCiAjZGVmaW5lCU1TQ0FMTDYoZm4sIGEsIGIsIGMsIGQsIGUsIGYpCQkJCVwKLQl4 ODZfNjRfY2FsbDYoKGZuKSwgKHVpbnQ2NF90KShhKSwgKHVpbnQ2NF90KShiKSwJCVwKKwlfeDg2 XzY0X2NhbGw2KChmbiksICh1aW50NjRfdCkoYSksICh1aW50NjRfdCkoYiksCQlcCiAJKHVpbnQ2 NF90KShjKSwgKHVpbnQ2NF90KShkKSwgKHVpbnQ2NF90KShlKSwgKHVpbnQ2NF90KShmKSkKIAog I2VuZGlmIC8qIF9fYW1kNjRfXyAqLwo= --001a11343fbc90848404f10e0ef4-- From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 21:42:42 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9ADA8694; Tue, 28 Jan 2014 21:42:42 +0000 (UTC) Received: from mail2.dataoppdrag.no (mail2.dataoppdrag.no [IPv6:2a02:f58:7:2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 518851D66; Tue, 28 Jan 2014 21:42:42 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail2.dataoppdrag.no (Postfix) with ESMTP id 875F543038; Tue, 28 Jan 2014 22:42:40 +0100 (CET) Received: from mail2.dataoppdrag.no ([127.0.0.1]) by localhost (mail2.dataoppdrag.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MuClo+YO5gwm; Tue, 28 Jan 2014 22:42:40 +0100 (CET) Received: from [172.20.10.252] (42-80-141-95.net.dataoppdrag.no [95.141.80.42]) by mail2.dataoppdrag.no (Postfix) with ESMTP id 688FB43037; Tue, 28 Jan 2014 22:42:40 +0100 (CET) Message-ID: <52E82450.60107@dataoppdrag.no> Date: Tue, 28 Jan 2014 22:42:40 +0100 From: Ole Myhre User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: =?ISO-8859-1?Q?Ermal_Lu=E7i?= Subject: Re: carp and rtadvd References: <52E7AB9B.5050707@dataoppdrag.no> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 21:42:42 -0000 On 28.01.2014 14:18, Ermal Luçi wrote: > You have to use the rtadvd patched from pfSense. > Look at our tools repo to get the code. Doesn't pfSense use radvd and not rtadvd? I've tried the patched radvd from the tools repo, however it does not seem to work. Maybe it's created for < 10.0? Thanks, Ole From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 22:50:53 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 83090DEE; Tue, 28 Jan 2014 22:50:53 +0000 (UTC) Received: from mail-vb0-x236.google.com (mail-vb0-x236.google.com [IPv6:2607:f8b0:400c:c02::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2B8691221; Tue, 28 Jan 2014 22:50:53 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id w20so678395vbb.13 for ; Tue, 28 Jan 2014 14:50:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=iaaxAPvs6SStV4r2phgDCLVn8FFTUUwXEItzFgHSAtU=; b=Ty10IHEY0dqgribTVJ3Q+mkRm1InnNWA19etliL868IhwrD8yUXqlJFUdDxBtn5Cyr nC+g1sgFeaCTkfSF2sEn798nQCvg2iN3rfmqcRPBwkV6ayqfNGbQ/d6bfUON3LvTiyXP X5M8HqLxw2wphlch3A/Nen+TBWkVi87K4SZa5qK0t+IGkbHKh0aAKerH/0/slvFKNF9+ TyhUOxCOt00gSIqKqLbqLQnVBTI5wAFe9VsLLYmDZTKSWrDerSL0GGGS1veKvIFDNpJ2 d5AZnfgDw/o12xmvIE3BkQV//r2bXKJYYCDViQ4BzMyirJnQlZn8w5/YReQm288tpfra UEJQ== MIME-Version: 1.0 X-Received: by 10.221.26.10 with SMTP id rk10mr3315716vcb.0.1390949452156; Tue, 28 Jan 2014 14:50:52 -0800 (PST) Sender: ndenev@gmail.com Received: by 10.220.78.84 with HTTP; Tue, 28 Jan 2014 14:50:52 -0800 (PST) In-Reply-To: References: <1390909590119-5880672.post@n5.nabble.com> <52E7A9D8.30604@freebsd.org> Date: Tue, 28 Jan 2014 22:50:52 +0000 X-Google-Sender-Auth: w6PDynSsQKBZiQIQzs1fKBQzj8o Message-ID: Subject: Re: Jails on fib problem From: Nikolay Denev To: zaphod@berentweb.com Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 22:50:53 -0000 On Tue, Jan 28, 2014 at 5:17 PM, Beeblebrox wrote: >> what's a fib device? Do you mean each jail has a different default fib? >> you are not using vimage jails? > > Hi Julian. > * No vimage > * All jails use the same fib. /etc/rc.conf: > cloned_interfaces="lo2" > ifconfig_lo2="inet 127.0.1.1/28" > static_routes="jail default" > route_jail="default 127.0.1.1 -fib 1" > route_default="default 192.168.1.1" > >> so they all have the same address?? can you even do that? or you mean that >> they all have the same default route? > I mean same default route, jail IP's start from 127.0.1.2/32 and go to > 127.0.1.6/32 > jail.conf assigns fib with "exec.fib = 1;" > jails on the 127.0.1.1/28 subnet range should be able to route traffic > through the 127.0.0.1 gateway regardless of the fact that the jails > themselves reside on a /32 subnet. However, it's not working smoothly > >> fibs don't have devices. > Yes, I know - a misnomer. > > setfib 1 netstat -rn > Destination Gateway Flags Netif Expire > default 127.0.1.1 UGS lo2 > 127.0.0.1 link#3 UH lo0 > 127.0.1.1 link#4 UH lo2 > 127.0.1.2 link#4 UH lo2 > 127.0.1.3 link#4 UH lo2 > 127.0.1.4 link#4 UH lo2 > 192.168.1.0/24 link#1 U re0 (Ext_If) > 192.168.2.0/26 link#2 U re1 (Lan_If) > > To complicate things further, I also have a vboxnet0 for VBox guests. > 127.0.1.2 is a dns jail for example. The Internal LAN clients, > vboxnet0 guests and lo0 need to resolve names from that jail. > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" You can't use 127/8 addresses and expect them to be routed/forwarded. See rfc1122. --Nikolay From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 23:29:00 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DDF75FB3 for ; Tue, 28 Jan 2014 23:29:00 +0000 (UTC) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com [IPv6:2607:f8b0:4001:c05::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AC74A158D for ; Tue, 28 Jan 2014 23:29:00 +0000 (UTC) Received: by mail-ig0-f178.google.com with SMTP id uq10so3086975igb.5 for ; Tue, 28 Jan 2014 15:29:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=zClW/XfweItmgB6x4uQgGMoKtWGxMYr9oVqjZycfAAw=; b=tDJtkBk16eYXhOvovvf+jHrFYa/P4ICK16u5ENaY7dexEogwzmpJOn7rKW79pdK+xs Rkc1U9G5sQvEk5kqzHThK2Nvc1c0Q4MGQVJhjvchYEuhMKAtbR/R8k0l+tuZ/nHJ/tyg 5Le19Hg1lFmMIlt587eY6tSaEnJfaE7TplA6FpKFCn00EJq/FSUb4yn7YnVWIYWKlwPS 2c9gid5NBGli9pnQsjBlizG1nGjGv6YdRnYWlcZeqeTJkWY7JsRjbnW9nNeIdF4WJHUZ JerpkkShDSvLP/D1Kw/4wzj45oQReNO3L+rs4c/1TPdexfQJlhd+tM4WMJcHseuYyVjU WhEg== MIME-Version: 1.0 X-Received: by 10.51.17.101 with SMTP id gd5mr5486745igd.25.1390951739913; Tue, 28 Jan 2014 15:28:59 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Tue, 28 Jan 2014 15:28:59 -0800 (PST) In-Reply-To: <20140128021450.GY13704@funkthat.com> References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> <20140128021450.GY13704@funkthat.com> Date: Tue, 28 Jan 2014 18:28:59 -0500 X-Google-Sender-Auth: kfx4cadvc10boT96qtjos3WGgc0 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: freebsd-net@freebsd.org, jmg@funkthat.com Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 23:29:00 -0000 On Mon, Jan 27, 2014 at 9:14 PM, John-Mark Gurney wrote: > P.S. if someone wants to submit a patch to mbuf.9 to update the docs > that would be helpful... I'll review and commit... and m_append is > also undocumented... Would that look something like this be a start in that direction? http://pastebin.com/UVir1BET This is all very new to me, so I apologize if that's completely wrong. m_append does appear to be documented, it's between m_adj and m_prepend. (At least it is on 9.2, which is the latest tree I have access to.) It does also look like m_getm is just a macro in mbuf.h that calls m_getm2 with flags set to M_PKTHDR, not a function as described in the man page. It was not immediately obvious if that was intentional or something that should be fixed or if it's intentionally meant to be treated as a function from an API standpoint. Thanks! From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 23:41:38 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9A8F6782 for ; Tue, 28 Jan 2014 23:41:38 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5D28916B2 for ; Tue, 28 Jan 2014 23:41:37 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0SNfbqk082788 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 28 Jan 2014 15:41:37 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0SNfaqM082787; Tue, 28 Jan 2014 15:41:36 -0800 (PST) (envelope-from jmg) Date: Tue, 28 Jan 2014 15:41:36 -0800 From: John-Mark Gurney To: J David Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140128234136.GJ13704@funkthat.com> Mail-Followup-To: J David , freebsd-net@freebsd.org References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> <20140128021450.GY13704@funkthat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Tue, 28 Jan 2014 15:41:37 -0800 (PST) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 23:41:38 -0000 J David wrote this message on Tue, Jan 28, 2014 at 18:28 -0500: > On Mon, Jan 27, 2014 at 9:14 PM, John-Mark Gurney wrote: > > P.S. if someone wants to submit a patch to mbuf.9 to update the docs > > that would be helpful... I'll review and commit... and m_append is > > also undocumented... > > Would that look something like this be a start in that direction? > > http://pastebin.com/UVir1BET It might be better to move most of m_getm's docs under m_getm2, and document that m_getm is just m_getm2 w/ M_PKTHDR flag set. Could you also document that only M_PKTHDR and M_EOR are valid flags for m_getm2? > This is all very new to me, so I apologize if that's completely wrong. Nope, good first start... > m_append does appear to be documented, it's between m_adj and > m_prepend. (At least it is on 9.2, which is the latest tree I have > access to.) You are correct.. the problem is that the MLINK isn't setup in the Makefile, so: $ man m_append No manual entry for m_append I've fixed that, r261254... > It does also look like m_getm is just a macro in mbuf.h that calls > m_getm2 with flags set to M_PKTHDR, not a function as described in the > man page. It was not immediately obvious if that was intentional or > something that should be fixed or if it's intentionally meant to be > treated as a function from an API standpoint. It's common to use a macro when the change isn't complicated, i.e. just adding a flag... Thanks. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 00:06:50 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B602CF4A for ; Wed, 29 Jan 2014 00:06:50 +0000 (UTC) Received: from mail-ie0-x22a.google.com (mail-ie0-x22a.google.com [IPv6:2607:f8b0:4001:c03::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 83AF5185C for ; Wed, 29 Jan 2014 00:06:50 +0000 (UTC) Received: by mail-ie0-f170.google.com with SMTP id u16so1377678iet.1 for ; Tue, 28 Jan 2014 16:06:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=xcf2TVbtRhc6egPITwsSspZRXi7QWrV3eOZVZG3Qaak=; b=F5B+6qVoXhrwQ9tUyEFzzM6m2huoMHUTEJ46X3WW15FZF3PVb5Md3ZDnf3jY/dauhf FCjjVtLts3Jd8OawDaVe/OB30hT07dcwl7NwAZzGHxL54IjsxMnXgTDqrtecDVUBkTgq O75SBdBiaLMD4DIvBaLL2jgrEuqbNlZkMhU6SDvkZnmimmGJUnyqMGsqKU21ouzZZZSq 4CMrvbJtZDLytZsQEUDCN4ji9/EJExfN3F8ZNB0h6ZUTaXqfILx1sbYzJ3jxfaxKpuak sQNvsQtowA2UFd0gc+BFeKIbsYddYfzIjdw0x6mA3S07eWOYqwdY+YRIm/wwCWvnWMQG ATlg== MIME-Version: 1.0 X-Received: by 10.50.154.102 with SMTP id vn6mr25502784igb.1.1390954009442; Tue, 28 Jan 2014 16:06:49 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Tue, 28 Jan 2014 16:06:49 -0800 (PST) In-Reply-To: <372707859.17587309.1390923341323.JavaMail.root@uoguelph.ca> References: <20140128021450.GY13704@funkthat.com> <372707859.17587309.1390923341323.JavaMail.root@uoguelph.ca> Date: Tue, 28 Jan 2014 19:06:49 -0500 X-Google-Sender-Auth: qNyyKJhBRRg54dHCKEQmwcr46rw Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 00:06:50 -0000 On Tue, Jan 28, 2014 at 10:35 AM, Rick Macklem wrote: > Since messgaes are sent quickly and then mbufs released, except for > the DRC in the server, I think avoiding large allocations for server > replies that may be cached is the case to try and avoid. Fortunately > the large replies will be for read and readdir and these don't need > to be cached by the DRC. As such, a patch that uses 4K clusters in > the server for read, readdir and 4K clusters for write requests in > the client, should be appropriate, I think? m_getm2 appears to consistent produce "right-sized" results. The relevant code is: while (len > 0) { if (len > MCLBYTES) mb = m_getjcl(how, type, (flags & M_PKTHDR), MJUMPAGESIZE); else if (len >= MINCLSIZE) mb = m_getcl(how, type, (flags & M_PKTHDR)); else if (flags & M_PKTHDR) mb = m_gethdr(how, type); else mb = m_get(how, type); /* ... */ } So it allocates the shortest possible chain and uses the best-fit cluster for the last (or only) block in the chain. It's probably the use of this function in m_uiotombuf or somewhere very similar that prevents tools like iperf from encountering this same issue. Getting this same logic into the NFS code seems like it would be a good thing, in terms of reducing code duplication, increasing performance, and leveraging a well-tested code path. It may raise portability concerns, but it does seem likely that other OS's to which the NFS code could potentially be ported have similar mechanisms these days. Possibly it would be worthwhile to examine whether the NFS code could choose a slightly different point of abstraction. Or, if that's undesirable, maybe asking the hypothetical person doing such a port to cross that bridge when they come to it is not unreasonable, since that would be the person most likely to be intimately familiar with the relevant details of both OS's. Also, looking at GAWollman's patch, an mbuf+cluster allocator that kicks back a prewired iovec seems really handy. Is that something that would be useful elsewhere in the kernel, or is NFS just kind of a special case because it's just moving data around, not across weird boundaries like device drivers and anything user mode-facing does? Thanks! From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 00:32:22 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 10CBF47A; Wed, 29 Jan 2014 00:32:22 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id A1C311A62; Wed, 29 Jan 2014 00:32:20 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAIVL6FKDaFve/2dsb2JhbABag0RWgn65OE+BJXSCJQEBAQMBAQEBIAQnHgILBRYYERkCBCUBCSYGCAcEARwEh1wIDaozn0MXjigGAQEbGRsHgm+BSQSJSYZ1hReEBZBtg0seMXwIFyI X-IronPort-AV: E=Sophos;i="4.95,739,1384318800"; d="scan'208";a="91280933" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 28 Jan 2014 19:32:13 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6E9B4B404E; Tue, 28 Jan 2014 19:32:13 -0500 (EST) Date: Tue, 28 Jan 2014 19:32:13 -0500 (EST) From: Rick Macklem To: J David Message-ID: <312973812.17975525.1390955533440.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_17975523_770461322.1390955533437" X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 00:32:22 -0000 ------=_Part_17975523_770461322.1390955533437 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit J David wrote: > On Tue, Jan 28, 2014 at 10:35 AM, Rick Macklem > wrote: > > Since messgaes are sent quickly and then mbufs released, except for > > the DRC in the server, I think avoiding large allocations for > > server > > replies that may be cached is the case to try and avoid. > > Fortunately > > the large replies will be for read and readdir and these don't need > > to be cached by the DRC. As such, a patch that uses 4K clusters in > > the server for read, readdir and 4K clusters for write requests in > > the client, should be appropriate, I think? > > m_getm2 appears to consistent produce "right-sized" results. The > relevant code is: > > while (len > 0) { > > if (len > MCLBYTES) > > mb = m_getjcl(how, type, (flags & M_PKTHDR), > > MJUMPAGESIZE); > > else if (len >= MINCLSIZE) > > mb = m_getcl(how, type, (flags & M_PKTHDR)); > > else if (flags & M_PKTHDR) > > mb = m_gethdr(how, type); > > else > > mb = m_get(how, type); > > /* ... */ > > } > > So it allocates the shortest possible chain and uses the best-fit > cluster for the last (or only) block in the chain. > > It's probably the use of this function in m_uiotombuf or somewhere > very similar that prevents tools like iperf from encountering this > same issue. > > Getting this same logic into the NFS code seems like it would be a > good thing, in terms of reducing code duplication, increasing > performance, and leveraging a well-tested code path. > For the server generating read replies, I suspect this is the case and that is what Garrett Wollman's patch does. However, readdir builds up the reply in small chunks via NFSM_BUILD() and this will require an extra argument that says "allocate a big cluster". Since it builds the reply in small chunks, it cannot use m_getm2(). I haven't looked at the client side write yet, so I don't know if m_getm2() is feasible for it or not. Hopefully Garrett and/or you will be able to do some testing of it and report back w.r.t. performance gains, etc. Once we have that, we can decide if this is an appropriate commit to head. Since I suspect it will take some time for Garrett to do this, please try my simple patch in your test environment, mostly to determine if the fail count goes to 0 (and also count calls to m_collapse() without/with the patch, since those will impact performance, too). Thanks in advance for trying the patch, rick ps: Attached again, just in case you don't already have it. > It may raise portability concerns, but it does seem likely that other > OS's to which the NFS code could potentially be ported have similar > mechanisms these days. Possibly it would be worthwhile to examine > whether the NFS code could choose a slightly different point of > abstraction. Or, if that's undesirable, maybe asking the > hypothetical > person doing such a port to cross that bridge when they come to it is > not unreasonable, since that would be the person most likely to be > intimately familiar with the relevant details of both OS's. > As I mentioned before, I am no longer concerned about portability. The discussion about portability was meant to explain why the code was written the way it was and, yes, I did note that "portability is nice" but did not intend to imply that that should limit modifications to the code that improve it for FreeBSD. > Also, looking at GAWollman's patch, an mbuf+cluster allocator that > kicks back a prewired iovec seems really handy. Is that something > that would be useful elsewhere in the kernel, or is NFS just kind of > a > special case because it's just moving data around, not across weird > boundaries like device drivers and anything user mode-facing does? > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > ------=_Part_17975523_770461322.1390955533437 Content-Type: text/x-patch; name=4kmcl.patch Content-Disposition: attachment; filename=4kmcl.patch Content-Transfer-Encoding: base64 LS0tIGZzL25mcy9uZnNwb3J0Lmguc2F2MgkyMDE0LTAxLTI2IDE4OjQzOjQ3LjAwMDAwMDAwMCAt MDUwMAorKysgZnMvbmZzL25mc3BvcnQuaAkyMDE0LTAxLTI2IDE5OjA0OjI3LjAwMDAwMDAwMCAt MDUwMApAQCAtMTUzLDE0ICsxNTMsMjcgQEAKIAkJCU1HRVRIRFIoKG0pLCBNX1dBSVRPSywgTVRf REFUQSk7IAlcCiAJCX0gCQkJCQkJXAogCX0gd2hpbGUgKDApCi0jZGVmaW5lCU5GU01DTEdFVCht LCB3KQlkbyB7IAkJCQkJXAotCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEpOyAJCQlcCi0J CXdoaWxlICgobSkgPT0gTlVMTCApIHsgCQkJCVwKLQkJCSh2b2lkKSBuZnNfY2F0bmFwKFBaRVJP LCAwLCAibmZzbWdldCIpOwlcCi0JCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEpOyAJCVwK LQkJfSAJCQkJCQlcCi0JCU1DTEdFVCgobSksICh3KSk7CQkJCVwKKyNpZiBNSlVNUEFHRVNJWkUg PiBNQ0xCWVRFUworI2RlZmluZQlORlNNQ0xHRVQobSwgdykJZG8gewkgCQkJCQlcCisJCShtKSA9 IG1fZ2V0amNsKE1fV0FJVE9LLCBNVF9EQVRBLCAwLCBNSlVNUEFHRVNJWkUpOwlcCisJCXdoaWxl ICgobSkgPT0gTlVMTCkgewkgCQkJCVwKKwkJCSh2b2lkKW5mc19jYXRuYXAoUFpFUk8sIDAsICJu ZnNtZ2V0Iik7CQlcCisJCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEpOwkgCQlcCisJCQlp ZiAoKG0pICE9IE5VTEwpCQkJCVwKKwkJCQlNQ0xHRVQoKG0pLCAodykpOwkJCVwKKwkJfQkgCQkJ CQkJXAogCX0gd2hpbGUgKDApCisjZWxzZQorI2RlZmluZQlORlNNQ0xHRVQobSwgdykJZG8gewkg CQkJCQlcCisJCShtKSA9IG1fZ2V0amNsKE1fV0FJVE9LLCBNVF9EQVRBLCAwLCBNQ0xCWVRFUyk7 CQlcCisJCXdoaWxlICgobSkgPT0gTlVMTCkgewkgCQkJCVwKKwkJCSh2b2lkKW5mc19jYXRuYXAo UFpFUk8sIDAsICJuZnNtZ2V0Iik7CQlcCisJCQlNR0VUKChtKSwgTV9XQUlUT0ssIE1UX0RBVEEp OwkgCQlcCisJCQlpZiAoKG0pICE9IE5VTEwpCQkJCVwKKwkJCQlNQ0xHRVQoKG0pLCAodykpOwkJ CVwKKwkJfQkgCQkJCQkJXAorCX0gd2hpbGUgKDApCisjZW5kaWYKICNkZWZpbmUJTkZTTUNMR0VU SERSKG0sIHcpIGRvIHsgCQkJCVwKIAkJTUdFVEhEUigobSksIE1fV0FJVE9LLCBNVF9EQVRBKTsJ CVwKIAkJd2hpbGUgKChtKSA9PSBOVUxMICkgeyAJCQkJXAotLS0gZnMvbmZzc2VydmVyL25mc19u ZnNkcG9ydC5jLnNhdjIJMjAxNC0wMS0yNiAxODo1NDoyOS4wMDAwMDAwMDAgLTA1MDAKKysrIGZz L25mc3NlcnZlci9uZnNfbmZzZHBvcnQuYwkyMDE0LTAxLTI2IDE4OjU2OjA4LjAwMDAwMDAwMCAt MDUwMApAQCAtNTY2LDggKzU2Niw3IEBAIG5mc3Zub19yZWFkbGluayhzdHJ1Y3Qgdm5vZGUgKnZw LCBzdHJ1Y3QKIAlsZW4gPSAwOwogCWkgPSAwOwogCXdoaWxlIChsZW4gPCBORlNfTUFYUEFUSExF TikgewotCQlORlNNR0VUKG1wKTsKLQkJTUNMR0VUKG1wLCBNX1dBSVRPSyk7CisJCU5GU01DTEdF VChtcCwgTV9XQUlUT0spOwogCQltcC0+bV9sZW4gPSBORlNNU0laKG1wKTsKIAkJaWYgKGxlbiA9 PSAwKSB7CiAJCQltcDMgPSBtcDIgPSBtcDsKQEAgLTYzNiw4ICs2MzUsNyBAQCBuZnN2bm9fcmVh ZChzdHJ1Y3Qgdm5vZGUgKnZwLCBvZmZfdCBvZmYsCiAJICovCiAJaSA9IDA7CiAJd2hpbGUgKGxl ZnQgPiAwKSB7Ci0JCU5GU01HRVQobSk7Ci0JCU1DTEdFVChtLCBNX1dBSVRPSyk7CisJCU5GU01D TEdFVChtLCBNX1dBSVRPSyk7CiAJCW0tPm1fbGVuID0gMDsKIAkJc2l6ID0gbWluKE1fVFJBSUxJ TkdTUEFDRShtKSwgbGVmdCk7CiAJCWxlZnQgLT0gc2l6Owo= ------=_Part_17975523_770461322.1390955533437-- From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 00:33:23 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 48279511 for ; Wed, 29 Jan 2014 00:33:23 +0000 (UTC) Received: from mail-ie0-x22e.google.com (mail-ie0-x22e.google.com [IPv6:2607:f8b0:4001:c03::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 11DCE1A6F for ; Wed, 29 Jan 2014 00:33:23 +0000 (UTC) Received: by mail-ie0-f174.google.com with SMTP id tp5so1357640ieb.5 for ; Tue, 28 Jan 2014 16:33:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=mqgPCBtc6Fj8VwcT/oeN2Ls+r46ZRnZ4VlM3kr4CjBM=; b=IfDZ0kRXmLGqc4/Gr+wcwD7tKXHleDcPsYU81luDdvxzIiDjkE8Sf65Hizg3sgcRyd tCmJgFCcKSv54mSMyJfB1MAXKOH3X/7k/c8vaLdeXQNmhzJ+G/Ux+rGFhfDSy38i/JF6 7FBE6zFahzq6rez3R5agUkvUhSoOMk6asnXyaym9g6h9OYvlgoKVXFNBhWegK/EUuU2w NdVYbMU8I2PbdFxN8O8Fj7QskZfp+vCdNarIQGYLZuZt72yM1BSxd0d1Sv7gyYzuCAup PjCGcUMuzaQcztRh62v1FO6Gqsy3vvvFTH5T4zd5k+vUCJpOHHTT1odFDEkk4GwjRKfc D/+g== MIME-Version: 1.0 X-Received: by 10.50.50.70 with SMTP id a6mr5684153igo.1.1390955602379; Tue, 28 Jan 2014 16:33:22 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Tue, 28 Jan 2014 16:33:22 -0800 (PST) In-Reply-To: <20140128234136.GJ13704@funkthat.com> References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> <20140128021450.GY13704@funkthat.com> <20140128234136.GJ13704@funkthat.com> Date: Tue, 28 Jan 2014 19:33:22 -0500 X-Google-Sender-Auth: GeJ3Uavi6pnu4OAknw5gfAvbplQ Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: freebsd-net@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 00:33:23 -0000 On Tue, Jan 28, 2014 at 6:41 PM, John-Mark Gurney wrote: > It might be better to move most of m_getm's docs under m_getm2, and > document that m_getm is just m_getm2 w/ M_PKTHDR flag set. > > Could you also document that only M_PKTHDR and M_EOR are valid > flags for m_getm2? OK, try this: http://pastebin.com/39kwExsc > You are correct.. the problem is that the MLINK isn't setup in the > Makefile, so: > $ man m_append > No manual entry for m_append Oh drat, I just saw another one of those the other day and didn't make a note of it. Oh yeah, it's uma_zalloc_arg. Thanks, shell history! > It's common to use a macro when the change isn't complicated, i.e. > just adding a flag... Sure, it's a little unsettling that the man page is explicitly separated into macros and functions and this macro is the second entry in under "The functions are:." It's like seeing a library book shelved in the wrong section. Probably m_getm was a function for a long time and got turned into a macro when m_getm2 was born. The documentation was either not updated or intentionally left as-is. Since I didn't know which was the case, I've left it where it is for now. Thanks! From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 02:37:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D5A45506; Wed, 29 Jan 2014 02:37:27 +0000 (UTC) Received: from mail-ie0-x22d.google.com (mail-ie0-x22d.google.com [IPv6:2607:f8b0:4001:c03::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 98C601298; Wed, 29 Jan 2014 02:37:27 +0000 (UTC) Received: by mail-ie0-f173.google.com with SMTP id e14so1520249iej.4 for ; Tue, 28 Jan 2014 18:37:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ImOtefEV5UD+d9ZE3OvxNvydaThRm3yhcrhw5wSimp4=; b=zC2WF2iU4zvS/6mdBsIzqeA1dQTO54A7F3w8WoVEDhrWo+Wh7CTix7oUrlfmCcJZDO 5tyx2jQGVQ4YTlipRJATAbKP0qaLAE7fBFjwnsb0PNKMQ+L4Ln9DCdZucgXZJHDvNg4P LIok2moHefxSjF7L0bvFjglQRyfKSL8sgWeUkoA2qlxqmBDWliwOEZzJsFpjlsKRLDKi XNCoFy/mDJBU6pVKoLYfuECr85oc3G1MoK7qmCeJI+pW189CQ2UWavLYM9HMLNQIGeio yeL8fOsOv2qP1AAM/NsrsyPxUpcl6t2no2bsyHSEQFDeN0h9sFMHghN6Gj3BxyVItCnJ b1Zg== MIME-Version: 1.0 X-Received: by 10.51.17.101 with SMTP id gd5mr6124613igd.25.1390963046931; Tue, 28 Jan 2014 18:37:26 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Tue, 28 Jan 2014 18:37:26 -0800 (PST) In-Reply-To: <312973812.17975525.1390955533440.JavaMail.root@uoguelph.ca> References: <312973812.17975525.1390955533440.JavaMail.root@uoguelph.ca> Date: Tue, 28 Jan 2014 21:37:26 -0500 X-Google-Sender-Auth: VVC2ujiT2pgYylxeDvIFsMdka8k Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 02:37:27 -0000 On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem wrote: > Hopefully Garrett and/or you will be able to do some testing of it > and report back w.r.t. performance gains, etc. OK, it has seen light testing. As predicted the vtnet drops are eliminated and CPU load is reduced. The performance is also improved: Test Before After SeqWr 1506 7461 SeqRd 566 192015 RndRd 602 218730 RndWr 44 13972 All numbers in kiB/sec. There were initially still some problems with lousy hostcache values on the client after the test, which is what causes the iperf performance to tank after the NFS test, but after a reboot of both sides and fresh retest, I haven't reproduced that again. If it comes back, I'll try to figure out what's going on. But this definitely looks like a move in the right direction. Thanks! From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 05:54:46 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E82B2B39 for ; Wed, 29 Jan 2014 05:54:46 +0000 (UTC) Received: from mail-ve0-x22a.google.com (mail-ve0-x22a.google.com [IPv6:2607:f8b0:400c:c01::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9BE641298 for ; Wed, 29 Jan 2014 05:54:46 +0000 (UTC) Received: by mail-ve0-f170.google.com with SMTP id cz12so902978veb.15 for ; Tue, 28 Jan 2014 21:54:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=berentweb.com; s=google; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=4qqJvkFGjgTsV5aDsFVuX8N951vJsfxA7rjgL8lPd24=; b=MhjPuRH8Aenf5SZ8h9KzlDHNGFDuGSEG1C3+TFBZ0qB/XrfHC6O+wxhhoFWKB+iyJb TXXWTd32xLi3Jg1iFpL3Z2QJ1ezxaguVJb/i/S0MdEYW2GPfVqxqAKXPnCDKvJm75kd/ pEjvJuK7qy2pkiQL7SmYO33PskihRw5A5m1mw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:sender:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=4qqJvkFGjgTsV5aDsFVuX8N951vJsfxA7rjgL8lPd24=; b=d8RDvz/xMkHE9BX+WdcZ+92R70NnF3hXJlnjL0ImSNEjpc6L0p6O4bV7MBbvFBFZ3P NyirM+uVIW3k7cNRmNtHt4joE5BQ96x85rbNkaovPgn6e16K/23BN/XlxcUcyjcpZy8s EztOdslRe+/l5w1FANLG0bRIR7ArY6SM6u+NSnWVEwB1QebuQU/X9gDuOY1HjJbLCboI EkTWvlcpoSc2C5toAA+Qx6MmEjKfyMHwRLqHOpyAeHGtMZEzHgJ3XGCRlwuCkUSBPcGI 704WoTBwxuD/5LVSRyhWtcyqHTTRglbffUyX5+vAMbn7d+TVV8/U2o8V9MAKrU2/Uuvn YeuQ== X-Gm-Message-State: ALoCoQk3+HZ1d91YcG3pSsvyu++ZtYU69k3cTU92gvSXoCBBK6F+x4UV6YTmD6f7O3qXP34L+fBt MIME-Version: 1.0 X-Received: by 10.220.2.199 with SMTP id 7mr4875082vck.4.1390974885664; Tue, 28 Jan 2014 21:54:45 -0800 (PST) Sender: rsb@berentweb.com Received: by 10.220.146.145 with HTTP; Tue, 28 Jan 2014 21:54:45 -0800 (PST) X-Originating-IP: [83.66.213.127] In-Reply-To: References: <1390909590119-5880672.post@n5.nabble.com> <52E7A9D8.30604@freebsd.org> Date: Wed, 29 Jan 2014 07:54:45 +0200 X-Google-Sender-Auth: lIs5T0MIqsFGhXM35g8y-kb5m50 Message-ID: Subject: Re: Jails on fib problem From: Beeblebrox To: Nikolay Denev Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: zaphod@berentweb.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 05:54:47 -0000 > You can't use 127/8 addresses and expect them to be routed/forwarded. > See rfc1122. > --Nikolay Thank you very much Nikolay. To correct the setup, I could a. Remove fib-1 thus placing everything on fib-0 and jail IP's remain at 127/32 or b. Migrate jails and fib-1 to a 192.168 range. Will I have various problems if I try to reconfigure as described in (a) ? Regards From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 12:22:58 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5428A8AD for ; Wed, 29 Jan 2014 12:22:58 +0000 (UTC) Received: from mail-ve0-x22e.google.com (mail-ve0-x22e.google.com [IPv6:2607:f8b0:400c:c01::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 045CE12A3 for ; Wed, 29 Jan 2014 12:22:57 +0000 (UTC) Received: by mail-ve0-f174.google.com with SMTP id pa12so1097980veb.19 for ; Wed, 29 Jan 2014 04:22:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=berentweb.com; s=google; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=NeSbL7nm53wNG66yBirEJtIgQwFYAgvI1rUEwVep/6s=; b=hAdxvDQlQ1SGF0cKlPHhbR5q8qHyaaAuDRsdFbGf5d86IC2MRsqvns431DroVb6J27 xdsk4TpI0zrBGsxi9d+aL1yDP02aHjPuT5Hx9bn9PoCKHDckNKH+l+OXkZFl5cQYiqbm /sjyDmodyaLgyeTnV7O2OHoAaGnvgxLwzyFPw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:sender:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=NeSbL7nm53wNG66yBirEJtIgQwFYAgvI1rUEwVep/6s=; b=mZY14CMGKRrZQdyRQDOWaShfvjfxYZQFW01qUgzfBPGlVEY82Nl63+F1P98UmqIT8h uIXnrtONLBc8KehmUlCIpxarDGGQMHGxQVmcJaHHLBIVmaGW6DZTe45ZFpcZbcNlCqkQ hN8iJjPaIf1pG0zZX1/lEjVf2qTZP6yBHX/gE74efO4xIz+aj4hgtsM4tupWAeqYyEmB nqhej2hP6Y4UWPT5nDsbsacE1pHCZocTYcu75Evogx44kRoCUtuLi6lKYCmQWYZkZcrr scdqVzC+haaS9rPY23biJCkgLvJbNM2IvnppY7hIA8D90q+wJGeswEj+D5VkaAihoB6C OS1g== X-Gm-Message-State: ALoCoQlB680rsNRGx46QJtgiwYKyI6aOU66fx/V9GsatpnsK9L28ha2kFPfaY1LmtkdugMrhKh+3 MIME-Version: 1.0 X-Received: by 10.59.0.193 with SMTP id ba1mr6328287ved.12.1390998176957; Wed, 29 Jan 2014 04:22:56 -0800 (PST) Sender: rsb@berentweb.com Received: by 10.220.146.145 with HTTP; Wed, 29 Jan 2014 04:22:56 -0800 (PST) X-Originating-IP: [83.66.213.127] In-Reply-To: References: <1390909590119-5880672.post@n5.nabble.com> <52E7A9D8.30604@freebsd.org> Date: Wed, 29 Jan 2014 14:22:56 +0200 X-Google-Sender-Auth: ln6vnDeEPagQjdMb25r91mE8380 Message-ID: Subject: Re: Jails on fib problem From: Beeblebrox To: Nikolay Denev Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: zaphod@berentweb.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 12:22:58 -0000 Since no answer as to a-or-b option, I assumed (a) option was also flawed, so I went with (b). I moved the jails to a 192 address family. Current rc.conf: cloned_interfaces="lo2" ifconfig_lo2="inet 192.168.2.110/28" static_routes="jail default" route_jail="default 192.168.2.110 -fib 1" route_default="default 192.168.1.1" # setfib 1 netstat -rn Destination Gateway Flags Netif Expire default 192.168.2.110 UGS lo2 127.0.0.1 link#3 UH lo0 192.168.1.0/24 link#1 U re0 192.168.2.99 link#4 UH lo2 (privoxy) 192.168.2.100 link#4 UH lo2 (http cache) 192.168.2.110 link#4 UH lo2 192.168.56.0/28 link#6 U vboxnet0 Traffic for any internet IP gets passed to httpcache -> privoxy jail (99), but does not get forwarded to the 192.168.1.1 gateway. If I try to access the 192.168.1.1 adsl modem page, this does come up correctly (I presume because it is within defined address range on the routing table). What am I missing so that traffic from jail knows to exit from re0 and on to default gateway? In pf.conf I have one NAT rule - Should I be natting on lo2 as well? nat on $ExtIf from !($ExtIf) -> $ExtIf Regards. From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 13:12:17 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3D12BF4C for ; Wed, 29 Jan 2014 13:12:17 +0000 (UTC) Received: from frv190.fwdcdn.com (frv190.fwdcdn.com [212.42.77.190]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EAF5716EB for ; Wed, 29 Jan 2014 13:12:16 +0000 (UTC) Received: from [10.10.1.29] (helo=frv197.fwdcdn.com) by frv190.fwdcdn.com with esmtp ID 1W8UWK-000NVB-J9 for net@freebsd.org; Wed, 29 Jan 2014 14:45:32 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id:Cc:To:Subject:From:Date; bh=J78dzmwjnhMXZAuV2RB1+jsYGeIAtpT9+mcostG3u3A=; b=dYhXvjbulFz2cZp9sI02HYtSrjNl9hWPSQzpZRcfX4HptjeinrS3IjF+eSLe3lCRxgyxjxUJU133BcC27NuuaoHXvXMz4hnMuDsXXIsbHtG8rIJzVBilG2duzeFeEkJOdzgVh095OkqgLBBuIllwwpN9QI3GdN3kyHBAMktQ9sc=; Received: from [10.10.10.35] (helo=frv35.ukr.net) by frv197.fwdcdn.com with smtp ID 1W8UWD-0008v9-7c for net@freebsd.org; Wed, 29 Jan 2014 14:45:25 +0200 Date: Wed, 29 Jan 2014 14:45:24 +0200 From: Vladislav Prodan Subject: Necessary to implement static NAT 1:1 To: questions@freebsd.org X-Mailer: mail.ukr.net 5.0 Message-Id: <1390999493.115887823.pfbg2ep5@frv35.ukr.net> MIME-Version: 1.0 Received: from universite@ukr.net by frv35.ukr.net; Wed, 29 Jan 2014 14:45:24 +0200 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: binary Content-Disposition: inline Cc: "net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 13:12:17 -0000 Necessary to implement static NAT 1:1 10.1.2.3 -> 100.1.2.3 10.1.2.4 -> 100.1.2.4 10.1.2.5 -> 100.1.2.5 10.1.2.6 -> 100.1.2.6 ... IP addresses such an over 20k prompt you implement? -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 15:25:48 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6829DD68 for ; Wed, 29 Jan 2014 15:25:48 +0000 (UTC) Received: from mail-ve0-x229.google.com (mail-ve0-x229.google.com [IPv6:2607:f8b0:400c:c01::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 1A74B15D3 for ; Wed, 29 Jan 2014 15:25:47 +0000 (UTC) Received: by mail-ve0-f169.google.com with SMTP id oy12so1342268veb.28 for ; Wed, 29 Jan 2014 07:25:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=berentweb.com; s=google; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:cc:content-type; bh=qBWtuonJSeSW9j90vQXkW/So1WD/pth3iNzrF38T8To=; b=S7nh81WmP/nuYrv6n21YvPiQPkUN4xK+k5H7N7kNuVzCxnFMHhy8WBnTDfbOEHbx5m GwHK79L70itSoYzxPsHkUn+9V3DAxM/OsXh454zSF6ydffozb+41rHQTcMxEj5BpsuM8 3dRft8Tdu6sP5wyNxBfmpkIs/zPIEvZOPodtE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:sender:in-reply-to :references:date:message-id:subject:from:cc:content-type; bh=qBWtuonJSeSW9j90vQXkW/So1WD/pth3iNzrF38T8To=; b=JWiPlP7C9sfNQXBGIhSI+RvqQzrh44Kma3lDOAmkYtPI/DoSSbawc14zyx3ebTAyxJ OYewcDmJ88sQxAf+CT0yYuajrVK21hRxackxxicR6mLZc5Eernri59WhXXE52zppvWY8 1E3nZzaYPd1CiZ6LO9iL4hnvl/McIvgDmHJdQU/KJasxOuPZf9kn71JcR355zPs+7WgM 7temX+UcQ49VRTI7mhmv2Tp8CRHC0tpUTFdnw8d7UfEuccV2472ljUTlWVZSChroJJGu v4qwFBeTkswGPUOvqbiZMw11MQjzEQcUiPxTewTQnLAQhwLgmKwV4Gb5HkgiSnFmce0L D8zA== X-Gm-Message-State: ALoCoQmuLOwUJTeiQIIDa876MrUetdW5Okk28xH7wlDxMC0yDRmEAg3Q8M21TlyUp+PKteQisGyP MIME-Version: 1.0 X-Received: by 10.52.76.105 with SMTP id j9mr63825vdw.52.1391009147000; Wed, 29 Jan 2014 07:25:47 -0800 (PST) Sender: rsb@berentweb.com Received: by 10.220.146.145 with HTTP; Wed, 29 Jan 2014 07:25:46 -0800 (PST) X-Originating-IP: [83.66.213.78] In-Reply-To: References: <1390909590119-5880672.post@n5.nabble.com> <52E7A9D8.30604@freebsd.org> Date: Wed, 29 Jan 2014 17:25:46 +0200 X-Google-Sender-Auth: kRr8nH3JEYnx1S6iqVMYT08EpRo Message-ID: Subject: Re: Jails on fib problem From: Beeblebrox Cc: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: zaphod@berentweb.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 15:25:48 -0000 I forgot about this details - again: From ifconfig man page's FIB section: The FIB is not inherited, e.g. vlans or other sub-interfaces will use the default FIB (0) irrespective of the parent interface's FIB. What alternatives are there to get this setup working? FIB it seems, is not the answer? From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 17:52:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B8EA5F62; Wed, 29 Jan 2014 17:52:09 +0000 (UTC) Received: from mail-wg0-x22f.google.com (mail-wg0-x22f.google.com [IPv6:2a00:1450:400c:c00::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 1C1791431; Wed, 29 Jan 2014 17:52:08 +0000 (UTC) Received: by mail-wg0-f47.google.com with SMTP id m15so4292086wgh.14 for ; Wed, 29 Jan 2014 09:52:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:from:to:subject:in-reply-to:references:user-agent:date :message-id:mime-version:content-type:content-transfer-encoding; bh=BW+s0IikF17BwEWylDYMo/dIXE7UbD34hNvQceEjF4Y=; b=kBn8LIMYJ88b3FmLh7YtPHYpq7pZp66z+V3FeUOfWJfPEVB2UvtOiq/UJqS4Y1Fhvt gg8KUbgFMQn7B7EmY49MhOTMqr+ltg8iDHaVacxN2GokEoeTHtHGtFeKzlQ5EF0+YBlP 2iOhthjPP+2xv5fsLNwo/JIppj/fB07Sc5d0KCn8CR6Xjd8VbfpWEpdohj1Xbdz/wGcF v9Up0RErMpKRHclkVjC28fIx09htUaU/vrGNdpwhbzNgdtvDafpPBebh4CI7jExXLm7Q li28ntjeaWEveSdWE11Qi2qJv+bTU8mlxu7W/s7uUXAHaiOvSAmbOk5cMZgNNlZXH/Hf rukA== X-Received: by 10.180.207.15 with SMTP id ls15mr6586868wic.50.1391017927580; Wed, 29 Jan 2014 09:52:07 -0800 (PST) Received: from srvbsdfenssv.interne.associated-bears.org (LCaen-151-92-21-48.w217-128.abo.wanadoo.fr. [217.128.200.48]) by mx.google.com with ESMTPSA id cm5sm6911830wid.5.2014.01.29.09.52.06 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 29 Jan 2014 09:52:07 -0800 (PST) Sender: Eric Masson Received: from srvbsdfenssv.interne.associated-bears.org (localhost [127.0.0.1]) by srvbsdfenssv.interne.associated-bears.org (Postfix) with ESMTP id DB51ACF0CB; Wed, 29 Jan 2014 18:52:05 +0100 (CET) X-Virus-Scanned: amavisd-new at interne.associated-bears.org Received: from srvbsdfenssv.interne.associated-bears.org ([127.0.0.1]) by srvbsdfenssv.interne.associated-bears.org (srvbsdfenssv.interne.associated-bears.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nwz9VeNOhd96; Wed, 29 Jan 2014 18:52:05 +0100 (CET) Received: by srvbsdfenssv.interne.associated-bears.org (Postfix, from userid 1001) id 0402ACF1AF; Wed, 29 Jan 2014 18:52:05 +0100 (CET) From: Eric Masson To: Mailing List FreeBSD Network , Mailing List FreeBSD ipfw Subject: Re: [FreeBSD 10.0] nat before vpn, incoming packets not translated In-Reply-To: <868uu4rshh.fsf@srvbsdfenssv.interne.associated-bears.org> (Eric Masson's message of "Sat, 25 Jan 2014 16:28:10 +0100") References: <868uu4rshh.fsf@srvbsdfenssv.interne.associated-bears.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (berkeley-unix) X-Operating-System: FreeBSD 9.2-RELEASE-p3 amd64 Date: Wed, 29 Jan 2014 18:52:04 +0100 Message-ID: <861tzqwu9n.fsf@srvbsdfenssv.interne.associated-bears.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 17:52:09 -0000 Eric Masson writes: Hi, No idea on this subject ? forwarding to freebsd-ipfw. Regards Éric Masson > Hi, > > I've setup a lab to experiment nat before ipsec scenario. > Architecture : > - 3 host only interfaces have been set up on the host > - 4 FreeBSD10 guests have been set up : > - 2 clients connected to their respective gateways via dedicated host > only interfaces. > - 2 gateways connected together via dedicated host only interface > > Client 1 setup : > <-----------------------------------------------------------------> > emss@client1:~ % more /etc/rc.conf > hostname="client1" > keymap="fr.iso.acc.kbd" > ifconfig_em0="inet 192.168.11.100 netmask 255.255.255.0" > ifconfig_em0_ipv6="inet6 accept_rtadv" > defaultrouter="192.168.11.15" > sshd_enable="YES" > dumpdev="AUTO" > sendmail_enable="NO" > sendmail_submit_enable="NO" > sendmail_outbound_enable="NO" > sendmail_msp_queue_enable="NO" > <-----------------------------------------------------------------> > > Gateway 1 setup : > <-----------------------------------------------------------------> > emss@gateway1:~ % more /etc/rc.conf > hostname="gateway1" > keymap="fr.iso.acc.kbd" > ifconfig_em1="inet 192.168.11.15 netmask 255.255.255.0" > ifconfig_em1_ipv6="inet6 accept_rtadv" > ifconfig_em0="inet 10.0.0.5 netmask 255.255.255.0" > gateway_enable="YES" > ipsec_enable="YES" > ipsec_file="/etc/ipsec.conf" > firewall_enable="YES" > firewall_script="/etc/ipfw.rules" > firewall_logging="YES" > sshd_enable="YES" > # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable > dumpdev="AUTO" > sendmail_enable="NO" > sendmail_submit_enable="NO" > sendmail_outbound_enable="NO" > sendmail_msp_queue_enable="NO" > emss@gateway1:~ % more /etc/ipfw.rules > #!/bin/sh > cmd="/sbin/ipfw" > $cmd -f flush > $cmd add 00100 nat 100 all from 192.168.11.0/24 to 192.168.21.0/24 > $cmd nat 100 config log ip 172.16.0.1 reverse > emss@gateway1:~ % more /etc/ipsec.conf > flush; > spdflush; > > add 10.0.0.5 10.0.0.6 esp 0x1000 -E 3des-cbc "123456789012345678901234"; > add 10.0.0.6 10.0.0.5 esp 0x1001 -E 3des-cbc "432109876543210987654321"; > > add 10.0.0.5 10.0.0.6 ipcomp 0x2000 -C deflate; > add 10.0.0.6 10.0.0.5 ipcomp 0x2001 -C deflate; > > spdadd 192.168.21.0/24 172.16.0.1/32 any -P in ipsec > ipcomp/tunnel/10.0.0.6-10.0.0.5/require > esp/tunnel/10.0.0.6-10.0.0.5/require; > > spdadd 172.16.0.1/32 192.168.21.0/24 any -P out ipsec > ipcomp/tunnel/10.0.0.5-10.0.0.6/require > esp/tunnel/10.0.0.5-10.0.0.6/require; > emss@gateway1:~ % more /boot/loader.conf > ipfw_load="YES" > ipfw_nat_load="YES" > > net.inet.ip.fw.default_to_accept="1" > <-----------------------------------------------------------------> > > Gateway 2 setup : > <-----------------------------------------------------------------> > emss@gateway2:~ % more /etc/rc.conf > hostname="gateway2" > keymap="fr.iso.acc.kbd" > ifconfig_em1="inet 10.0.0.6 netmask 255.255.255.0" > ifconfig_em0="inet 192.168.21.15 netmask 255.255.255.0" > ifconfig_em0_ipv6="inet6 accept_rtadv" > gateway_enable="YES" > ipsec_enable="YES" > ipsec_file="/etc/ipsec.conf" > sshd_enable="YES" > # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable > dumpdev="AUTO" > sendmail_enable="NO" > sendmail_submit_enable="NO" > sendmail_outbound_enable="NO" > sendmail_msp_queue_enable="NO" > emss@gateway2:~ % more /etc/ipsec.conf > flush; > spdflush; > > add 10.0.0.5 10.0.0.6 esp 0x1000 -E 3des-cbc "123456789012345678901234"; > add 10.0.0.6 10.0.0.5 esp 0x1001 -E 3des-cbc "432109876543210987654321"; > > add 10.0.0.5 10.0.0.6 ipcomp 0x2000 -C deflate; > add 10.0.0.6 10.0.0.5 ipcomp 0x2001 -C deflate; > > spdadd 192.168.21.0/24 172.16.0.1/32 any -P out ipsec > ipcomp/tunnel/10.0.0.6-10.0.0.5/require > esp/tunnel/10.0.0.6-10.0.0.5/require; > > spdadd 172.16.0.1/32 192.168.21.0/24 any -P in ipsec > ipcomp/tunnel/10.0.0.5-10.0.0.6/require > esp/tunnel/10.0.0.5-10.0.0.6/require; > <-----------------------------------------------------------------> > > Client 2 setup : > <-----------------------------------------------------------------> > emss@client2:~ % more /etc/rc.conf > hostname="client2" > keymap="fr.iso.acc.kbd" > ifconfig_em0="inet 192.168.21.100 netmask 255.255.255.0" > ifconfig_em0_ipv6="inet6 accept_rtadv" > defaultrouter="192.168.21.15" > sshd_enable="YES" > # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable > dumpdev="AUTO" > sendmail_enable="NO" > sendmail_submit_enable="NO" > sendmail_outbound_enable="NO" > sendmail_msp_queue_enable="NO" > <-----------------------------------------------------------------> > > Test setup by pinging client2 from client1 : > > On client1 : > emss@client1:~ % ping 192.168.21.100 > PING 192.168.21.100 (192.168.21.100): 56 data bytes > > On gateway1 inside interface : > > root@gateway1:~ # tcpdump -i em1 > 17:16:08.600154 IP 192.168.11.100 > 192.168.21.100: ICMP echo request, id 10499, seq 7207, length 64 > 17:16:08.600660 IP 192.168.11.100 > 192.168.21.100: ICMP echo request, id 59651, seq 213, length 64 > ... > > On gateway1 outside interface : > root@gateway1:~ # tcpdump -i em0 > 17:16:48.501317 IP 10.0.0.5 > 10.0.0.6: ESP(spi=0x00001000,seq=0x1ed4), length 128 > 17:16:48.501612 IP 10.0.0.5 > 10.0.0.6: ESP(spi=0x00001000,seq=0x1ed5), length 128 > 17:16:48.502665 IP 10.0.0.6 > 10.0.0.5: ESP(spi=0x00001001,seq=0x1e67), length 128 > 17:16:48.502938 IP 10.0.0.6 > 10.0.0.5: ESP(spi=0x00001001,seq=0x1e68), length 128 > ... > > On client2 : > root@client2:~ # tcpdump -i em0 > 17:14:17.671181 IP 172.16.0.1 > 192.168.21.100: ICMP echo request, id 59651, seq 107, length 64 > 17:14:17.671230 IP 192.168.21.100 > 172.16.0.1: ICMP echo reply, id 59651, seq 107, length 64 > ... > > So, the only remaining issue is that gateway1 doesn't nat back ipsec > decapsulated packets (if no nat in scenario, everything works fine). > > Setting net.inet.ip.fw.one_pass to 0 doesn't change anything. > > Any idea, please ? > > Regards > > Éric Masson -- Intéressant votre témoignage, quoique un peu long. Pourriez-vous en écrire davantage ! -+- LL in GNU n'a qu'un mot à dire : assez, encore ! -+- From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 18:08:46 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4B2BA4B0 for ; Wed, 29 Jan 2014 18:08:46 +0000 (UTC) Received: from exprod6og122.obsmtp.com (exprod6og122.obsmtp.com [64.18.1.238]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DBB421572 for ; Wed, 29 Jan 2014 18:08:45 +0000 (UTC) Received: from osprey.verisign.com ([216.168.239.75]) (using TLSv1) by exprod6ob122.postini.com ([64.18.5.12]) with SMTP ID DSNKUulDptsto1iLNMsm37kh2BdptmO2lk3b@postini.com; Wed, 29 Jan 2014 10:08:45 PST Received: from BRN1WNEXCHM01.vcorp.ad.vrsn.com (brn1wnexchm01.vcorp.ad.vrsn.com [10.173.152.255]) by osprey.verisign.com (8.13.6/8.13.4) with ESMTP id s0TI8ZUH017321 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for ; Wed, 29 Jan 2014 13:08:38 -0500 Received: from BRN1WNEXMBX01.vcorp.ad.vrsn.com ([::1]) by BRN1WNEXCHM01.vcorp.ad.vrsn.com ([::1]) with mapi id 14.02.0342.003; Wed, 29 Jan 2014 13:08:34 -0500 From: "Bentkofsky, Michael" To: "freebsd-net@freebsd.org" Subject: RE: kern/176446: [netinet] [patch] Concurrency in ixgbe Thread-Topic: kern/176446: [netinet] [patch] Concurrency in ixgbe Thread-Index: Ac8dG8u9zjMR+0+wTzaH1G8/XnwkLAAAUZjQ Date: Wed, 29 Jan 2014 18:08:33 +0000 Message-ID: <080FBD5B7A09F845842100A6DE79623346E60505@BRN1WNEXMBX01.vcorp.ad.vrsn.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.173.152.4] MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.17 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 18:08:46 -0000 I believe this has been fixed in r240968. From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 18:54:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 69504390 for ; Wed, 29 Jan 2014 18:54:09 +0000 (UTC) Received: from khavrinen.csail.mit.edu (khavrinen.csail.mit.edu [IPv6:2001:470:8b2d:1e1c:21b:21ff:feb8:d7b0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2229519A4 for ; Wed, 29 Jan 2014 18:54:09 +0000 (UTC) Received: from khavrinen.csail.mit.edu (localhost [127.0.0.1]) by khavrinen.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0TIs7Qp047007 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL CN=khavrinen.csail.mit.edu issuer=Client+20CA) for ; Wed, 29 Jan 2014 13:54:08 -0500 (EST) (envelope-from wollman@khavrinen.csail.mit.edu) Received: (from wollman@localhost) by khavrinen.csail.mit.edu (8.14.7/8.14.7/Submit) id s0TIs7K5047004; Wed, 29 Jan 2014 13:54:07 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21225.20047.947384.390241@khavrinen.csail.mit.edu> Date: Wed, 29 Jan 2014 13:54:07 -0500 From: Garrett Wollman To: freebsd-net@freebsd.org Subject: Big physically contiguous mbuf clusters X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (khavrinen.csail.mit.edu [127.0.0.1]); Wed, 29 Jan 2014 13:54:08 -0500 (EST) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 18:54:09 -0000 Resolved: that mbuf clusters longer than one page ought not be supported. There is too much physical-memory fragmentation for them to be of use on a moderately active server. 9k mbufs are especially bad, since in the fragmented case they waste 3k per allocation. -GAWollman From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 19:23:00 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D119EA18 for ; Wed, 29 Jan 2014 19:23:00 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id ABFEF1D0A for ; Wed, 29 Jan 2014 19:23:00 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0TJLcuF002771 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 29 Jan 2014 11:21:38 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0TJLbe6002770; Wed, 29 Jan 2014 11:21:37 -0800 (PST) (envelope-from jmg) Date: Wed, 29 Jan 2014 11:21:37 -0800 From: John-Mark Gurney To: Garrett Wollman Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140129192137.GF93141@funkthat.com> Mail-Followup-To: Garrett Wollman , freebsd-net@freebsd.org References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <21225.20047.947384.390241@khavrinen.csail.mit.edu> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Wed, 29 Jan 2014 11:21:38 -0800 (PST) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 19:23:00 -0000 Garrett Wollman wrote this message on Wed, Jan 29, 2014 at 13:54 -0500: > Resolved: that mbuf clusters longer than one page ought not be > supported. There is too much physical-memory fragmentation for them > to be of use on a moderately active server. 9k mbufs are especially > bad, since in the fragmented case they waste 3k per allocation. I agree, but I am split on removing the code as there are still very broken controllers that may require them, though in those cases, it might be helpful to have a tunable that lets you set how many jumbo frames are allocated at boot, and these pages are never released back to the system... We definately need to fix all the drivers that use MJUM9BYTES which apparently are quite a few: http://fxr.watson.org/fxr/ident?im=excerpts;i=MJUM9BYTES -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 19:26:01 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 34251BCC for ; Wed, 29 Jan 2014 19:26:01 +0000 (UTC) Received: from web01.jbserver.net (web01.jbserver.net [IPv6:2a00:d10:2000:e::3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id ECD961D2C for ; Wed, 29 Jan 2014 19:26:00 +0000 (UTC) Received: from 75-138-17-190.fibertel.com.ar ([190.17.138.75] helo=[192.168.3.102]) by web01.jbserver.net with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) (Exim 4.82) (envelope-from ) id 1W8alp-0007w8-5W; Wed, 29 Jan 2014 20:25:57 +0100 Message-ID: <52E955BA.9060908@gont.com.ar> Date: Wed, 29 Jan 2014 16:25:46 -0300 From: Fernando Gont User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: FreeBSD Net Subject: Fwd: RFC 7112 on Implications of Oversized IPv6 Header Chains References: <20140129173044.D475C7FC17B@rfc-editor.org> In-Reply-To: <20140129173044.D475C7FC17B@rfc-editor.org> X-Enigmail-Version: 1.5.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 19:26:01 -0000 Folks, FYI. This one has important implications -- it allows stateless filtering in IPv6 (otherwise not really possible) -------- Original Message -------- Subject: RFC 7112 on Implications of Oversized IPv6 Header Chains Date: Wed, 29 Jan 2014 09:30:44 -0800 (PST) From: rfc-editor@rfc-editor.org To: ietf-announce@ietf.org, rfc-dist@rfc-editor.org CC: drafts-update-ref@iana.org, ipv6@ietf.org, rfc-editor@rfc-editor.org A new Request for Comments is now available in online RFC libraries. RFC 7112 Title: Implications of Oversized IPv6 Header Chains Author: F. Gont, V. Manral, R. Bonica Status: Standards Track Stream: IETF Date: January 2014 Mailbox: fgont@si6networks.com, vishwas@ionosnetworks.com, rbonica@juniper.net Pages: 8 Characters: 15897 Updates: RFC 2460 I-D Tag: draft-ietf-6man-oversized-header-chain-09.txt URL: http://www.rfc-editor.org/rfc/rfc7112.txt The IPv6 specification allows IPv6 Header Chains of an arbitrary size. The specification also allows options that can, in turn, extend each of the headers. In those scenarios in which the IPv6 Header Chain or options are unusually long and packets are fragmented, or scenarios in which the fragment size is very small, the First Fragment of a packet may fail to include the entire IPv6 Header Chain. This document discusses the interoperability and security problems of such traffic, and updates RFC 2460 such that the First Fragment of a packet is required to contain the entire IPv6 Header Chain. This document is a product of the IPv6 Maintenance Working Group of the IETF. This is now a Proposed Standard. STANDARDS TRACK: This document specifies an Internet standards track protocol for the Internet community,and requests discussion and suggestions for improvements. Please refer to the current edition of the Internet Official Protocol Standards (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. This announcement is sent to the IETF-Announce and rfc-dist lists. To subscribe or unsubscribe, see http://www.ietf.org/mailman/listinfo/ietf-announce http://mailman.rfc-editor.org/mailman/listinfo/rfc-dist For searching the RFC series, see http://www.rfc-editor.org/search/rfc_search.php For downloading RFCs, see http://www.rfc-editor.org/rfc.html Requests for special distribution should be addressed to either the author of the RFC in question, or to rfc-editor@rfc-editor.org. Unless specifically noted otherwise on the RFC itself, all RFCs are for unlimited distribution. The RFC Editor Team Association Management Solutions, LLC -------------------------------------------------------------------- IETF IPv6 working group mailing list ipv6@ietf.org Administrative Requests: https://www.ietf.org/mailman/listinfo/ipv6 -------------------------------------------------------------------- -- Fernando Gont e-mail: fernando@gont.com.ar || fgont@si6networks.com PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1 From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 21:30:01 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CFFDAE49 for ; Wed, 29 Jan 2014 21:30:01 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BA9701753 for ; Wed, 29 Jan 2014 21:30:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0TLU1vh067521 for ; Wed, 29 Jan 2014 21:30:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0TLU1dp067519; Wed, 29 Jan 2014 21:30:01 GMT (envelope-from gnats) Date: Wed, 29 Jan 2014 21:30:01 GMT Message-Id: <201401292130.s0TLU1dp067519@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= Subject: Re: kern/177905: [xl] [panic] ifmedia_set when pluging CardBus LAN card, xl(4) driver X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 21:30:01 -0000 The following reply was made to PR kern/177905; it has been noted by GNATS. From: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= To: bug-followup@freebsd.org, olivier@cochard.me Cc: Subject: Re: kern/177905: [xl] [panic] ifmedia_set when pluging CardBus LAN card, xl(4) driver Date: Wed, 29 Jan 2014 22:25:52 +0100 --001a11362cd417b57a04f12297a8 Content-Type: text/plain; charset=ISO-8859-1 Just a keepalive: Still the same problem on FreeBSD 10.0. --001a11362cd417b57a04f12297a8 Content-Type: text/html; charset=ISO-8859-1
Just a keepalive: Still the same problem on FreeBSD 10.0.
--001a11362cd417b57a04f12297a8-- From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 22:21:22 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6B0457D4 for ; Wed, 29 Jan 2014 22:21:22 +0000 (UTC) Received: from mail-qc0-x22a.google.com (mail-qc0-x22a.google.com [IPv6:2607:f8b0:400d:c01::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2A19D1CED for ; Wed, 29 Jan 2014 22:21:22 +0000 (UTC) Received: by mail-qc0-f170.google.com with SMTP id e9so3859618qcy.1 for ; Wed, 29 Jan 2014 14:21:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=UnyI2eYYxSEcf7NFs3EDoEZfO6pNK1Ca8wpGCcfymGE=; b=pmgHc5QV8KHTmF2/SPKALh0qx25Adtrgx0FOEBCRWWrkiSTZ94nRURq7e2fgpQh9gL Hjzjya9Xeg3m3jz3CgscLPx9oCaT6A+Oyno6w/JvhNcg57zRoEJm6potOb42gbGFRWgW /hMTM7sFaqIoIsUziFvqA1UF4KZzjzlzIfETKkQe9s4DSnMadH/rJINMvo+j+9Hs8nsa ncH6OI4pLP1kK8rl3rrwpnCsQIx/EyW08r6CRsRx65E3jGerpwpyPQVS40ImOKO372wL xPZY9cXe1QwSkWQXWphPVRjM83MrpdKtFvLbOEPUr1FUk0IxVCE5BSkAEwGe3L5z/YEj 1YEw== MIME-Version: 1.0 X-Received: by 10.224.52.3 with SMTP id f3mr16517773qag.26.1391034081362; Wed, 29 Jan 2014 14:21:21 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Wed, 29 Jan 2014 14:21:21 -0800 (PST) In-Reply-To: <21225.20047.947384.390241@khavrinen.csail.mit.edu> References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> Date: Wed, 29 Jan 2014 14:21:21 -0800 X-Google-Sender-Auth: h3t7oDAulNFeRilUMr0HheGsL3c Message-ID: Subject: Re: Big physically contiguous mbuf clusters From: Adrian Chadd To: Garrett Wollman Content-Type: text/plain; charset=ISO-8859-1 Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 22:21:22 -0000 Hi, On 29 January 2014 10:54, Garrett Wollman wrote: > Resolved: that mbuf clusters longer than one page ought not be > supported. There is too much physical-memory fragmentation for them > to be of use on a moderately active server. 9k mbufs are especially > bad, since in the fragmented case they waste 3k per allocation. I've been wondering whether it'd be feasible to teach the physical memory allocator about >page sized allocations and to create zones of slightly more physically contiguous memory. For servers with lots of memory we could then keep these around and only dip into them for temporary allocations (eg not VM pages that may be held for some unknown amount of time.) Question is - can we enforce that kind of behaviour? -a From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 22:26:34 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DD2E2A3E; Wed, 29 Jan 2014 22:26:34 +0000 (UTC) Received: from khavrinen.csail.mit.edu (khavrinen.csail.mit.edu [IPv6:2001:470:8b2d:1e1c:21b:21ff:feb8:d7b0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A94701D23; Wed, 29 Jan 2014 22:26:34 +0000 (UTC) Received: from khavrinen.csail.mit.edu (localhost [127.0.0.1]) by khavrinen.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0TMQXeC049098 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL CN=khavrinen.csail.mit.edu issuer=Client+20CA); Wed, 29 Jan 2014 17:26:33 -0500 (EST) (envelope-from wollman@khavrinen.csail.mit.edu) Received: (from wollman@localhost) by khavrinen.csail.mit.edu (8.14.7/8.14.7/Submit) id s0TMQXxo049095; Wed, 29 Jan 2014 17:26:33 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21225.32793.237629.329516@khavrinen.csail.mit.edu> Date: Wed, 29 Jan 2014 17:26:33 -0500 From: Garrett Wollman To: Adrian Chadd Subject: Re: Big physically contiguous mbuf clusters In-Reply-To: References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (khavrinen.csail.mit.edu [127.0.0.1]); Wed, 29 Jan 2014 17:26:33 -0500 (EST) Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 22:26:34 -0000 < said: > For servers with lots of memory we could then keep these around and > only dip into them for temporary allocations (eg not VM pages that may > be held for some unknown amount of time.) mbufs may also be held for some unknown amout of time, so I don't think that helps at all. -GAWollman From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 22:27:19 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 35F7DBE6; Wed, 29 Jan 2014 22:27:19 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E8CE81D3A; Wed, 29 Jan 2014 22:27:18 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0TMRE5r006197 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 29 Jan 2014 14:27:14 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0TMREGH006196; Wed, 29 Jan 2014 14:27:14 -0800 (PST) (envelope-from jmg) Date: Wed, 29 Jan 2014 14:27:14 -0800 From: John-Mark Gurney To: Adrian Chadd Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140129222714.GK93141@funkthat.com> Mail-Followup-To: Adrian Chadd , Garrett Wollman , FreeBSD Net References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Wed, 29 Jan 2014 14:27:14 -0800 (PST) Cc: Garrett Wollman , FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 22:27:19 -0000 Adrian Chadd wrote this message on Wed, Jan 29, 2014 at 14:21 -0800: > On 29 January 2014 10:54, Garrett Wollman wrote: > > Resolved: that mbuf clusters longer than one page ought not be > > supported. There is too much physical-memory fragmentation for them > > to be of use on a moderately active server. 9k mbufs are especially > > bad, since in the fragmented case they waste 3k per allocation. > > I've been wondering whether it'd be feasible to teach the physical > memory allocator about >page sized allocations and to create zones of > slightly more physically contiguous memory. > > For servers with lots of memory we could then keep these around and > only dip into them for temporary allocations (eg not VM pages that may > be held for some unknown amount of time.) > > Question is - can we enforce that kind of behaviour? It shouldn't be too hard to do... Since everything pretty much goes through uma we can adopt a scheme similar to what Solaris does (read Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources)... Instead of dealing w/ page size allocations, everything is larger, say 16KB, and broken down from there... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 23:01:45 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 89D9AE0D; Wed, 29 Jan 2014 23:01:45 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 27134108D; Wed, 29 Jan 2014 23:01:44 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAIiH6VKDaFve/2dsb2JhbABZg0RWgn65QU+BGXSCJQEBAQMBAQEBIAQnHQECCwUWGAICDRkCKQEJJgYIBwQBGQMEh1wIDaploEYXgSmNBQEBGzQHgm+BSQSJSYp3gRWEBZBtg0seMYEEOQ X-IronPort-AV: E=Sophos;i="4.95,744,1384318800"; d="scan'208";a="91674231" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 18:01:43 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id A3442B40A3; Wed, 29 Jan 2014 18:01:43 -0500 (EST) Date: Wed, 29 Jan 2014 18:01:43 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1352428787.18632865.1391036503658.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman , Bryan Venteicher X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 23:01:45 -0000 J David wrote: > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > wrote: > > Hopefully Garrett and/or you will be able to do some testing of it > > and report back w.r.t. performance gains, etc. > > OK, it has seen light testing. > > As predicted the vtnet drops are eliminated and CPU load is reduced. > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the driver feasible? However, I do suspect we'll be putting a refined version of the patch in head someday (maybe April, sooner would have to be committed by someone else). I suspect that Garrett's code for server read will work well and I'll cobble something to-gether for server readdir and client write. > The performance is also improved: > > Test Before After > SeqWr 1506 7461 > SeqRd 566 192015 > RndRd 602 218730 > RndWr 44 13972 > > All numbers in kiB/sec. > If you get the chance, you can try a few tunables on the server. vfs.nfsd.fha.enable=0 - ken@ found that FHA was necessary for ZFS exports, to avoid out of order reads from confusing ZFS's sequential reading heuristic. However, FHA also means that all readaheads for a file are serialized with the reads for the file (same fh->same nfsd thread). Somehow, it seems to me that doing reads concurrently in the server (given shared vnode locks) could be a good thing. --> I wonder what the story is for UFS? So, it would be interesting to see what disabling FHA does for the sequential read test. I think I already mentioned the DRC cache ones: vfs.nfsd.tcphighwater=100000 vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300) Good to see some progress, rick ps: Daniel reports that he will be able to test the patch this weekend, to see if it fixes his problem that required TSO to be disabled, so we'll wait and see. > There were initially still some problems with lousy hostcache values > on the client after the test, which is what causes the iperf > performance to tank after the NFS test, but after a reboot of both > sides and fresh retest, I haven't reproduced that again. If it comes > back, I'll try to figure out what's going on. > Hopefully a networking type might know what is going on, because this is way out of my area of expertise. > But this definitely looks like a move in the right direction. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 23:03:07 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CF34CEE2; Wed, 29 Jan 2014 23:03:07 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 7FB0C10A5; Wed, 29 Jan 2014 23:03:07 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAIiH6VKDaFve/2dsb2JhbABZg0RWgn65QU+BGXSCJQEBAQMBAQEBICseAgsFFhgCAg0ZAikBCSYGCAcEARwEh1wIDaploEYXgSmNBQEBGzQHgm+BSQSJSYwMhAWQbYNLHjGBBDk X-IronPort-AV: E=Sophos;i="4.95,744,1384318800"; d="scan'208";a="91674458" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 18:03:06 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6E3F0B3F00; Wed, 29 Jan 2014 18:03:06 -0500 (EST) Date: Wed, 29 Jan 2014 18:03:06 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1869703796.18633714.1391036586445.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 23:03:07 -0000 J David wrote: > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > wrote: > > Hopefully Garrett and/or you will be able to do some testing of it > > and report back w.r.t. performance gains, etc. > > OK, it has seen light testing. > > As predicted the vtnet drops are eliminated and CPU load is reduced. > Oh, and I forgot to say thanks for doing this testing, rick > The performance is also improved: > > Test Before After > SeqWr 1506 7461 > SeqRd 566 192015 > RndRd 602 218730 > RndWr 44 13972 > > All numbers in kiB/sec. > > There were initially still some problems with lousy hostcache values > on the client after the test, which is what causes the iperf > performance to tank after the NFS test, but after a reboot of both > sides and fresh retest, I haven't reproduced that again. If it comes > back, I'll try to figure out what's going on. > > But this definitely looks like a move in the right direction. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 23:08:44 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 615542BC; Wed, 29 Jan 2014 23:08:44 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 1393710F2; Wed, 29 Jan 2014 23:08:43 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,744,1384318800"; d="scan'208";a="92187781" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 29 Jan 2014 18:08:31 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 8EDFCB4062; Wed, 29 Jan 2014 18:08:31 -0500 (EST) Date: Wed, 29 Jan 2014 18:08:31 -0500 (EST) From: Rick Macklem To: J David Message-ID: <2032299860.18637455.1391036911579.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 23:08:44 -0000 J David wrote: > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > wrote: > > Hopefully Garrett and/or you will be able to do some testing of it > > and report back w.r.t. performance gains, etc. > > OK, it has seen light testing. > > As predicted the vtnet drops are eliminated and CPU load is reduced. > > The performance is also improved: > > Test Before After > SeqWr 1506 7461 > SeqRd 566 192015 > RndRd 602 218730 > RndWr 44 13972 > > All numbers in kiB/sec. > Oops, ignore most of what I said about FHA. I now see that the default is 8 nfsd per FH, which should handle readaheads. However, it does remind me that it would be nice to try cranking up the readahead value for the client mount. "-o readahead=8" would be a good one to try (you can go as high as 16, if you'd like). Have fun with it, rick > There were initially still some problems with lousy hostcache values > on the client after the test, which is what causes the iperf > performance to tank after the NFS test, but after a reboot of both > sides and fresh retest, I haven't reproduced that again. If it comes > back, I'll try to figure out what's going on. > > But this definitely looks like a move in the right direction. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 23:11:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4D55F561; Wed, 29 Jan 2014 23:11:27 +0000 (UTC) Received: from mail-pb0-x22e.google.com (mail-pb0-x22e.google.com [IPv6:2607:f8b0:400e:c01::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 191CA118C; Wed, 29 Jan 2014 23:11:27 +0000 (UTC) Received: by mail-pb0-f46.google.com with SMTP id um1so2364408pbc.19 for ; Wed, 29 Jan 2014 15:11:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=DY5Bn7ksljXPjIRVBlEkpA6d9joXOdvyunVoYwCRcQ0=; b=GCPISSPhA2hQ50K2ncGUe2/qjFs3fCX8ObF/Ag8uNzyYiEPPnYiNrn/2DUPTu49+fY SkHSmy6ch/L9zgekJUNHGu1mkeBzyEC5WWYjoBs7DBXU3ksGuUj5l2T6I73cLB7P/LMw kWIZubTW1GBkDWROPLHqLuLl2mWu/yY4GJKvTo1GRjpl4Hn1WEUSjPtj5MlzUtYw8AOE VgRh4UWiG37CfwIcTsUoZU4oL+TLQtywwBoidRZFcHGS/3pJxy6dwPD7fKQvsdtSbgYs 0SFbIt28QTmXozQ5MaNy6cr45kql59aLcWJYbtVBCxvTe4Ygcc9OiZo3igVeYDm7ewXc EqIA== X-Received: by 10.68.130.202 with SMTP id og10mr10744421pbb.133.1391037086785; Wed, 29 Jan 2014 15:11:26 -0800 (PST) Received: from ox (c-24-6-44-228.hsd1.ca.comcast.net. [24.6.44.228]) by mx.google.com with ESMTPSA id sy10sm26834530pac.15.2014.01.29.15.11.25 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 29 Jan 2014 15:11:26 -0800 (PST) Date: Wed, 29 Jan 2014 15:11:21 -0800 From: Navdeep Parhar To: Adrian Chadd Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140129231121.GA18434@ox> Mail-Followup-To: Adrian Chadd , Garrett Wollman , FreeBSD Net References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Garrett Wollman , FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 23:11:27 -0000 On Wed, Jan 29, 2014 at 02:21:21PM -0800, Adrian Chadd wrote: > Hi, > > On 29 January 2014 10:54, Garrett Wollman wrote: > > Resolved: that mbuf clusters longer than one page ought not be > > supported. There is too much physical-memory fragmentation for them > > to be of use on a moderately active server. 9k mbufs are especially > > bad, since in the fragmented case they waste 3k per allocation. > > I've been wondering whether it'd be feasible to teach the physical > memory allocator about >page sized allocations and to create zones of > slightly more physically contiguous memory. I think this would be very useful. For example, a zone_jumbo32 would hit a sweet spot -- enough to fit 3 jumbo frames and some loose change for metadata. I'd like to see us improve our allocators and VM system to work better with larger contiguous allocations, rather than deprecating the larger zones. It seems backwards to push towards smaller allocation units when installed physical memory in a typical system continues to rise. Allocating 3 x 4K instead of 1 x 9K for a jumbo means 3x the number of vtophys translations, 3x the phys_addr/len traffic on the PCIe bus (scatter list has to be fed to the chip and now it's 3x what it has to be), 3x the number of "wrapper" mbuf allocations (one for each 4K cluster) which will then be stitched together to form a frame, etc. etc. Regards, Navdeep > > For servers with lots of memory we could then keep these around and > only dip into them for temporary allocations (eg not VM pages that may > be held for some unknown amount of time.) > > Question is - can we enforce that kind of behaviour? > > > > -a > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Wed Jan 29 23:31:14 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 92320330; Wed, 29 Jan 2014 23:31:14 +0000 (UTC) Received: from mail-ig0-x230.google.com (mail-ig0-x230.google.com [IPv6:2607:f8b0:4001:c05::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 484AE139A; Wed, 29 Jan 2014 23:31:14 +0000 (UTC) Received: by mail-ig0-f176.google.com with SMTP id j1so16305315iga.3 for ; Wed, 29 Jan 2014 15:31:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=DL42o1oFSdz/gGZv93VjQhF9DRGOaiIuIXnIgx0Asd4=; b=tZjwsn/8KCGC98Db76d3JnodEAQKeqvxD3jXWuagDzc+uAFiMPlYxd0dRSBjY9GFDe 5yiBDRjlHLMcb3H+XPbtqPI+WcIrGvZNPVPP3/uUfKO8UVhLkSkqVVFmE8GOqfvGhmxG 6RDj+xV3D1w3WC5A24otXcXoZe/2DqqJiZLgY0Skx9y2TUKaEJ6PqQQdu7t4aJ4qxoUF Qmm4rFql4DXlZnvabjD4/s20r1nbw8W+B6yATRqsJaZmhP4tre+dD8teZeBBk5asQkwL JVwcreW+odYR6e/k7aIChrIHaUp8W3h1JxjQxCwThgGl7iWy1FC/fXjD7OgtloBo6G7P f/9A== X-Received: by 10.50.114.4 with SMTP id jc4mr31370497igb.0.1391038273623; Wed, 29 Jan 2014 15:31:13 -0800 (PST) MIME-Version: 1.0 Sender: mr.kodiak@gmail.com Received: by 10.64.96.73 with HTTP; Wed, 29 Jan 2014 15:30:43 -0800 (PST) In-Reply-To: <1352428787.18632865.1391036503658.JavaMail.root@uoguelph.ca> References: <1352428787.18632865.1391036503658.JavaMail.root@uoguelph.ca> From: Bryan Venteicher Date: Wed, 29 Jan 2014 17:30:43 -0600 X-Google-Sender-Auth: TYs9uu6M0ndJjNQBihmikRszTDg Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: freebsd-net@freebsd.org, J David , Garrett Wollman , Bryan Venteicher X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jan 2014 23:31:14 -0000 On Wed, Jan 29, 2014 at 5:01 PM, Rick Macklem wrote: > J David wrote: > > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > > wrote: > > > Hopefully Garrett and/or you will be able to do some testing of it > > > and report back w.r.t. performance gains, etc. > > > > OK, it has seen light testing. > > > > As predicted the vtnet drops are eliminated and CPU load is reduced. > > > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the > driver feasible? > > I've been busy the last few days, and won't be able to get to any code until the weekend. The current MAX_TX_SEGS value is mostly arbitrary - the implicit limit is VIRTIO_MAX_INDIRECT. This value is used in virtqueue.c to allocate an array of 'struct vring_desc' which is 16 bytes so we have some next power of 2 rounding going on, so we can make it bigger without using any real additional memory usage. But also note I do put an MAX_TX_SEGS sized array of 'struct sglist_segs' on the stack so it cannot be made too big. Even what is currently there is probably already pushing what's a Good Idea to put on the stack anyways (especially since it is near the bottom of a typically pretty deep call stack). I've been meaning to move that to hanging on the 'struct vtnet_txq' instead. I think all TSO capable drivers that use m_collapse(..., 32) (and don't set if_hw_tsomax) are broken - there looks to be several. I was slightly on top of my game by using 33 since it appears m_collapse() does not touch the pkthdr mbuf (I think that was my thinking 3 years ago, and seems to be the case by a quick glance at the code). I think drivers using m_defrag(..., 32) are OK, but that function can be much, much more expensive. However, I do suspect we'll be putting a refined version of the patch > in head someday (maybe April, sooner would have to be committed by > someone else). I suspect that Garrett's code for server read will work > well and I'll cobble something to-gether for server readdir and client > write. > > > The performance is also improved: > > > > Test Before After > > SeqWr 1506 7461 > > SeqRd 566 192015 > > RndRd 602 218730 > > RndWr 44 13972 > > > > All numbers in kiB/sec. > > > If you get the chance, you can try a few tunables on the server. > vfs.nfsd.fha.enable=0 > - ken@ found that FHA was necessary for ZFS exports, to avoid out > of order reads from confusing ZFS's sequential reading heuristic. > However, FHA also means that all readaheads for a file are serialized > with the reads for the file (same fh->same nfsd thread). Somehow, it > seems to me that doing reads concurrently in the server (given shared > vnode locks) could be a good thing. > --> I wonder what the story is for UFS? > So, it would be interesting to see what disabling FHA does for the > sequential read test. > > I think I already mentioned the DRC cache ones: > vfs.nfsd.tcphighwater=100000 > vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300) > > Good to see some progress, rick > ps: Daniel reports that he will be able to test the patch this > weekend, to see if it fixes his problem that required TSO > to be disabled, so we'll wait and see. > > > There were initially still some problems with lousy hostcache values > > on the client after the test, which is what causes the iperf > > performance to tank after the NFS test, but after a reboot of both > > sides and fresh retest, I haven't reproduced that again. If it comes > > back, I'll try to figure out what's going on. > > > Hopefully a networking type might know what is going on, because this > is way out of my area of expertise. > > > But this definitely looks like a move in the right direction. > > > > Thanks! > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscribe@freebsd.org" > > > From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 00:55:20 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9637138D for ; Thu, 30 Jan 2014 00:55:20 +0000 (UTC) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 659971EA1 for ; Thu, 30 Jan 2014 00:55:20 +0000 (UTC) Received: from compute1.internal (compute1.nyi.mail.srv.osa [10.202.2.41]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 15BBC20C66 for ; Wed, 29 Jan 2014 19:55:18 -0500 (EST) Received: from frontend1 ([10.202.2.160]) by compute1.internal (MEProxy); Wed, 29 Jan 2014 19:55:18 -0500 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id :references:to; s=smtpout; bh=NX/OrH7/1OJxJdubTttkDG2E+OA=; b=bD k2ZJt1u2+gf/d/Fs7KEOxqoMMGrC0RQVbLeNr4stwiN3NCA5cX54csmGsewRdPPN L5zyjXJxGZUjJqpks/Mzs+10IJb3FWSyoyl11ce4lUqTh7DsPJcwrDUAAAvClswx BaagBSlkr9KhS4JVxf+2hLIhgg5k9HaOwTh4qIs9M= X-Sasl-enc: MIBf6EKWPZeWy2mXG1OjSkVRX+42FYOXOP1mRmFvsw8G 1391043317 Received: from [172.16.1.145] (unknown [68.117.126.78]) by mail.messagingengine.com (Postfix) with ESMTPA id 8DEEBC00E81; Wed, 29 Jan 2014 19:55:17 -0500 (EST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: Re: carp and rtadvd From: Mark Felder In-Reply-To: <52E7AB9B.5050707@dataoppdrag.no> Date: Wed, 29 Jan 2014 18:55:16 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <52E7AB9B.5050707@dataoppdrag.no> To: Ole Myhre X-Mailer: Apple Mail (2.1827) Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 00:55:20 -0000 On Jan 28, 2014, at 7:07, Ole Myhre wrote: > Hi, >=20 > I have a simple setup with two 10.0-RELEASE firewalls running carp, a > virtual IPv6 address and running rtadvd: >=20 > (applied to both firewalls) >=20 > # kldload carp > # ifconfig em2 inet6 2001:db8::1/64 vhid 1 up > # sysctl net.inet6.ip6.forwarding=3D1 > # echo 'rtadvd_enable=3D"YES"' >> /etc/rc.conf > # echo 'rtadvd_interfaces=3D"em2"' >> /etc/rc.conf > # service rtadvd start >=20 > This works fine, one firewall is MASTER, the other BACKUP and the > clients behind em2 gets a prefix in the 2001:db8::/64 subnet. However > both firewalls are sending router advertisements (only one being = MASTER) > with the LL-address of the physical em2 interface as the gateway. This > causes clients that supports multiple default gateways to select both > firewalls as their default gateway, and sending traffic to both the > MASTER and BACKUP firewall. >=20 > Is there a way to make only the MASTER send router advertisements or > (preferably only the MASTER) sending router advertisements with a > virtual LL-address? >=20 What I would do is use devd to start/stop the rtadvd service based on = whether or not you're master. # notify 30 { # match "system" "IFNET"; # match "subsystem" "carp0"; # match "type" "LINK_UP"; # action "/path/to/script/or/command"; # }; # =20 # notify 30 { # match "system" "IFNET"; # match "subsystem" "carp0"; # match "type" "LINK_DOWN"; # action "/path/to/script/or/command"; # }; From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 01:34:42 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C509D7FD; Thu, 30 Jan 2014 01:34:42 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 95EC91246; Thu, 30 Jan 2014 01:34:41 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0U1YZl6008675 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 29 Jan 2014 17:34:35 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0U1YYV7008674; Wed, 29 Jan 2014 17:34:34 -0800 (PST) (envelope-from jmg) Date: Wed, 29 Jan 2014 17:34:34 -0800 From: John-Mark Gurney To: Adrian Chadd , Garrett Wollman , FreeBSD Net Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140130013434.GP93141@funkthat.com> Mail-Followup-To: Adrian Chadd , Garrett Wollman , FreeBSD Net References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> <20140129231121.GA18434@ox> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140129231121.GA18434@ox> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Wed, 29 Jan 2014 17:34:35 -0800 (PST) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 01:34:42 -0000 Navdeep Parhar wrote this message on Wed, Jan 29, 2014 at 15:11 -0800: > On Wed, Jan 29, 2014 at 02:21:21PM -0800, Adrian Chadd wrote: > > Hi, > > > > On 29 January 2014 10:54, Garrett Wollman wrote: > > > Resolved: that mbuf clusters longer than one page ought not be > > > supported. There is too much physical-memory fragmentation for them > > > to be of use on a moderately active server. 9k mbufs are especially > > > bad, since in the fragmented case they waste 3k per allocation. > > > > I've been wondering whether it'd be feasible to teach the physical > > memory allocator about >page sized allocations and to create zones of > > slightly more physically contiguous memory. > > I think this would be very useful. For example, a zone_jumbo32 would > hit a sweet spot -- enough to fit 3 jumbo frames and some loose change > for metadata. I'd like to see us improve our allocators and VM system Actually, that is what currently happens... I just verified this on -current... http://fxr.watson.org/fxr/source/vm/uma_core.c#L880 is where the allocation happens, notice the uk_ppera, and kgdb says: print zone_jumbo9[0].uz_kegs.lh_first[0].kl_keg[0].uk_ppera $7 = 3 > to work better with larger contiguous allocations, rather than > deprecating the larger zones. It seems backwards to push towards > smaller allocation units when installed physical memory in a typical > system continues to rise. > > Allocating 3 x 4K instead of 1 x 9K for a jumbo means 3x the number of > vtophys translations, 3x the phys_addr/len traffic on the PCIe bus I don't think that this will be an issue.. If we support a 9k jumbo that is not physically contiguous (easy on main memory), it's likely that the table we use to fetch the first physical page will likely have the next two pages in it, so I doubt there will be that significant performance penalty, yes, we'll loop a few more times, but main memory accesses is more the speed limiter in these situations... > (scatter list has to be fed to the chip and now it's 3x what it has to > be), 3x the number of "wrapper" mbuf allocations (one for each 4K > cluster) which will then be stitched together to form a frame, etc. etc. And what is that in percentage of overall traffic? .4% (assuming 16 bytes per 4k page)... If your PCIe bus is saturating and you need that extra .4% traffic, then you have a serious issue w/ your bus layout... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 02:05:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3DF5EDB1; Thu, 30 Jan 2014 02:05:27 +0000 (UTC) Received: from mail-pb0-x230.google.com (mail-pb0-x230.google.com [IPv6:2607:f8b0:400e:c01::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 09F92149D; Thu, 30 Jan 2014 02:05:27 +0000 (UTC) Received: by mail-pb0-f48.google.com with SMTP id rr13so2504520pbb.21 for ; Wed, 29 Jan 2014 18:05:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mail-followup-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=BYVlWTcCBZVAZbvxJEeExuf3OB4kxF3bOLO3wrj0Qx8=; b=C7rLbW3saSuaYMczwhkmx2ff4KJh/4VWLQhfmKDZurfSSwZ+GoB/iDJbwaFz0w0QYI AqfymKhS2JfGGYF4YLXT12XrjW+VI4Afxna60MQv7QuuXHxSmihnEe1bwezVkI3OxT6u 8N5BY8kqFO/V4BE78nYOhmQ2VMahspB2DxwhhaXQ/40HWR/QCvse39qaz93qvQxqDZ9M cWs0k9O6iH6dZ+m13zTDPRy6+8gFaQwGr36Nry1qjCAlzav3qPa0G5b9ulMUMb/OYlTL /eRxclfsVVX/1VGCRNIFuMjMUYOqPE1RISCDcYw3MPjuMWb0x7NSU75NztfpyKCcT3Wa TIhQ== X-Received: by 10.66.163.164 with SMTP id yj4mr11461589pab.91.1391047526709; Wed, 29 Jan 2014 18:05:26 -0800 (PST) Received: from ox (c-24-6-44-228.hsd1.ca.comcast.net. [24.6.44.228]) by mx.google.com with ESMTPSA id ns7sm11640705pbc.32.2014.01.29.18.05.25 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 29 Jan 2014 18:05:25 -0800 (PST) Date: Wed, 29 Jan 2014 18:05:23 -0800 From: Navdeep Parhar To: Adrian Chadd , Garrett Wollman , FreeBSD Net Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140130020523.GB18434@ox> Mail-Followup-To: Adrian Chadd , Garrett Wollman , FreeBSD Net References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> <20140129231121.GA18434@ox> <20140130013434.GP93141@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140130013434.GP93141@funkthat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 02:05:27 -0000 On Wed, Jan 29, 2014 at 05:34:34PM -0800, John-Mark Gurney wrote: > Navdeep Parhar wrote this message on Wed, Jan 29, 2014 at 15:11 -0800: > > > (scatter list has to be fed to the chip and now it's 3x what it has to > > be), 3x the number of "wrapper" mbuf allocations (one for each 4K > > cluster) which will then be stitched together to form a frame, etc. etc. > > And what is that in percentage of overall traffic? .4% (assuming 16 bytes > per 4k page)... If your PCIe bus is saturating and you need that extra > .4% traffic, then you have a serious issue w/ your bus layout... The 16B and 4KB are in different directions, the former is from host to chip and the latter from chip to host memory. So the 16B eats into the transmit bandwidth. FWIW, I do deal with cards where PCIe is the limiting factor (a 4x10G card with a pcie gen2 x8 block, a 2x40G card with pcie gen3 x8 block) and the effects of 4K vs. 9K rx on the transmit bandwidth are measurable. These days chips can even place multiple frames into a single buffer (if they'd fit) and that's another reason I tend to advocate for larger contiguous buffer sizes. Regards, Navdeep From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:08:23 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0CDF0E1E for ; Thu, 30 Jan 2014 03:08:23 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id B083F1964 for ; Thu, 30 Jan 2014 03:08:22 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0U38KQk009863; Wed, 29 Jan 2014 22:08:20 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0U38JSM009860; Wed, 29 Jan 2014 22:08:19 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21225.49699.916951.881502@hergotha.csail.mit.edu> Date: Wed, 29 Jan 2014 22:08:19 -0500 From: Garrett Wollman To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? In-Reply-To: <1352428787.18632865.1391036503658.JavaMail.root@uoguelph.ca> References: <1352428787.18632865.1391036503658.JavaMail.root@uoguelph.ca> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 29 Jan 2014 22:08:20 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:08:23 -0000 < said: > However, I do suspect we'll be putting a refined version of the patch > in head someday (maybe April, sooner would have to be committed by > someone else). I suspect that Garrett's code for server read will work > well and I'll cobble something to-gether for server readdir and client write. Once I can get this mps(4) issue ironed out, I should be in a position to get some real data on this. -GAWollman From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:22:30 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 11C454E5 for ; Thu, 30 Jan 2014 03:22:30 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id CB1321A9F for ; Thu, 30 Jan 2014 03:22:29 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAJfE6VKDaFve/2dsb2JhbABZhBuDAboagRl0giUBAQEEI1YbGAICDRkCWQYTiAWqeqBxF4EpjSI0B4JvgUkEiUmgfoNLHoFu X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91740667" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:22:22 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id D571DB40D2; Wed, 29 Jan 2014 22:22:22 -0500 (EST) Date: Wed, 29 Jan 2014 22:22:22 -0500 (EST) From: Rick Macklem To: Garrett Wollman Message-ID: <1315174039.18735121.1391052142869.JavaMail.root@uoguelph.ca> In-Reply-To: <21225.49699.916951.881502@hergotha.csail.mit.edu> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:22:30 -0000 Garrett Wollman wrote: > < said: > > > However, I do suspect we'll be putting a refined version of the > > patch > > in head someday (maybe April, sooner would have to be committed by > > someone else). I suspect that Garrett's code for server read will > > work > > well and I'll cobble something to-gether for server readdir and > > client write. > > Once I can get this mps(4) issue ironed out, I should be in a > position > to get some real data on this. > If you can check the network device driver you use and if it looks like it has a scatter size of less than 36 (often a constant with "TXSEG" or "TX_SEG" in the name) and calls either m_defrag() or m_collapse(), adding a counter to see if those functions are being called, would be nice. If the m_collapse()/m_defrag() function is being called without the patch and not with the patch, the performance difference may be avoiding that call and not a more generic benefit. I just did a quick find/grep and it looks like a lot of drivers have *TXSEGS* set to around 32 and then call one of two functions for more than that. Since without a patch, 64K NFS reads/writes hand sosend() an mbuf list of 34 entries, it seems like this could be happening a lot. (I didn't look to see which ones set if_hw_tsomax to significantly less than 64K.) Thanks for working on this, rick. ps: you might want to combine your patch with mine, so readdir and client side writes use 4K clusters. > -GAWollman > > From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:22:58 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6508D572 for ; Thu, 30 Jan 2014 03:22:58 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 061511AA4 for ; Thu, 30 Jan 2014 03:22:57 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0U3MukO010030; Wed, 29 Jan 2014 22:22:56 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0U3Mt3s010029; Wed, 29 Jan 2014 22:22:55 -0500 (EST) (envelope-from wollman) Date: Wed, 29 Jan 2014 22:22:55 -0500 (EST) From: Garrett Wollman Message-Id: <201401300322.s0U3Mt3s010029@hergotha.csail.mit.edu> To: nparhar@gmail.com Subject: Re: Big physically contiguous mbuf clusters X-Newsgroups: mit.lcs.mail.freebsd-net In-Reply-To: <20140129231138$3db6@grapevine.csail.mit.edu> References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> Organization: none X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 29 Jan 2014 22:22:56 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,LOTS_OF_MONEY autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:22:58 -0000 In article <20140129231138$3db6@grapevine.csail.mit.edu>, nparhar@gmail.com writes: >I think this would be very useful. For example, a zone_jumbo32 would >hit a sweet spot -- enough to fit 3 jumbo frames and some loose change >for metadata. I'd like to see us improve our allocators and VM system >to work better with larger contiguous allocations, rather than >deprecating the larger zones. It seems backwards to push towards >smaller allocation units when installed physical memory in a typical >system continues to rise. In order to resist fragmentation, you need to be willing to dedicate some partition of physical memory to larger allocations. That's fine for a special-purpose device like a switch, but is not so good for a general-purpose operating system. But if you were willing to reserve, say, 1/64th of physical memory at boot time, make it all direct-mapped using superpages, and allocate it in fixed-power-of-two-sized chunks, you would probably get a performance win. But the chunks *have* to be fixed-size, otherwise you are nearly guaranteed to get your arena checkerboarded. I'd consider giving 2 GB on a 128-GB machine for that. For NFS performance, you'd probably want to be able to take a whole chunk, read the desired data into it in a single VOP, then pass the whole thing to the socket layer wrapped in an mbuf. -GAWollman From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:31:10 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2C7096B1; Thu, 30 Jan 2014 03:31:10 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 956461B3B; Thu, 30 Jan 2014 03:31:09 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAPXG6VKDaFve/2dsb2JhbABZg0RXgwG5S0+BGnSCJQEBAQMBAQEBIAQnHQECCwUWGAICDRkCKQEJJgYIBwQBGQMEh1wIDapuoHAXgSmMfwYBAQEaNAeCb4FJBIlJineBFYQFkG2DSx4xfAgXIg X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91741811" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:31:08 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2F49BB4184; Wed, 29 Jan 2014 22:31:08 -0500 (EST) Date: Wed, 29 Jan 2014 22:31:08 -0500 (EST) From: Rick Macklem To: Bryan Venteicher Message-ID: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, J David , Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:31:10 -0000 Bryan Venteicher wrote: > On Wed, Jan 29, 2014 at 5:01 PM, Rick Macklem > wrote: > > > J David wrote: > > > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > > > > > > wrote: > > > > Hopefully Garrett and/or you will be able to do some testing of > > > > it > > > > and report back w.r.t. performance gains, etc. > > > > > > OK, it has seen light testing. > > > > > > As predicted the vtnet drops are eliminated and CPU load is > > > reduced. > > > > > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the > > driver feasible? > > > > > > I've been busy the last few days, and won't be able to get to any > code > until the weekend. > > The current MAX_TX_SEGS value is mostly arbitrary - the implicit > limit is > VIRTIO_MAX_INDIRECT. This value is used in virtqueue.c to allocate an > array > of 'struct vring_desc' which is 16 bytes so we have some next power > of 2 > rounding going on, so we can make it bigger without using any real > additional memory usage. > > But also note I do put an MAX_TX_SEGS sized array of 'struct > sglist_segs' > on the stack so it cannot be made too big. Even what is currently > there is > probably already pushing what's a Good Idea to put on the stack > anyways > (especially since it is near the bottom of a typically pretty deep > call > stack). I've been meaning to move that to hanging on the 'struct > vtnet_txq' > instead. > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then increasing it from 34 to 35 would be all it takes. However, see below. > I think all TSO capable drivers that use m_collapse(..., 32) (and > don't set > if_hw_tsomax) are broken - there looks to be several. I was slightly > on top > of my game by using 33 since it appears m_collapse() does not touch > the > pkthdr mbuf (I think that was my thinking 3 years ago, and seems to > be the > case by a quick glance at the code). I think drivers using > m_defrag(..., > 32) are OK, but that function can be much, much more expensive. > Well, even m_defrag(..M_NOWAIT..) can fail and then it means a TCP layer timeout/retransmit. If the allocator is constipated, this could be pretty much a trainwreck, I think. I also agree that m_defrag() adds a lot of overhead, but calling m_collapse() a lot will be quite a bit of overhead, as well. (Also, I don't think that m_collapse() is more likely to fail, since it only copies data to the previous mbuf when the entire mbuf that follows will fit and it's allowed. I'd assume that a ref count copied mbuf cluster doesn't allow this copy or things would be badly broken.) Bottom line, I think calling either m_collapse() or m_defrag() should be considered a "last resort". Maybe the driver could reduce the size of if_hw_tsomax whenever it finds it needs to call one of these functions, to try and avoid a re-occurrence? rick > > However, I do suspect we'll be putting a refined version of the patch > > in head someday (maybe April, sooner would have to be committed by > > someone else). I suspect that Garrett's code for server read will > > work > > well and I'll cobble something to-gether for server readdir and > > client > > write. > > > > > The performance is also improved: > > > > > > Test Before After > > > SeqWr 1506 7461 > > > SeqRd 566 192015 > > > RndRd 602 218730 > > > RndWr 44 13972 > > > > > > All numbers in kiB/sec. > > > > > If you get the chance, you can try a few tunables on the server. > > vfs.nfsd.fha.enable=0 > > - ken@ found that FHA was necessary for ZFS exports, to avoid out > > of order reads from confusing ZFS's sequential reading heuristic. > > However, FHA also means that all readaheads for a file are > > serialized > > with the reads for the file (same fh->same nfsd thread). Somehow, > > it > > seems to me that doing reads concurrently in the server (given > > shared > > vnode locks) could be a good thing. > > --> I wonder what the story is for UFS? > > So, it would be interesting to see what disabling FHA does for the > > sequential read test. > > > > I think I already mentioned the DRC cache ones: > > vfs.nfsd.tcphighwater=100000 > > vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300) > > > > Good to see some progress, rick > > ps: Daniel reports that he will be able to test the patch this > > weekend, to see if it fixes his problem that required TSO > > to be disabled, so we'll wait and see. > > > > > There were initially still some problems with lousy hostcache > > > values > > > on the client after the test, which is what causes the iperf > > > performance to tank after the NFS test, but after a reboot of > > > both > > > sides and fresh retest, I haven't reproduced that again. If it > > > comes > > > back, I'll try to figure out what's going on. > > > > > Hopefully a networking type might know what is going on, because > > this > > is way out of my area of expertise. > > > > > But this definitely looks like a move in the right direction. > > > > > > Thanks! > > > _______________________________________________ > > > freebsd-net@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > > To unsubscribe, send any mail to > > > "freebsd-net-unsubscribe@freebsd.org" > > > > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:37:26 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 316ED84F for ; Thu, 30 Jan 2014 03:37:26 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id E73981B5E for ; Thu, 30 Jan 2014 03:37:25 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aj8MACXI6VKDaFve/2dsb2JhbABZDoFkBoFMV4MBuUtPgRp0giUBAQEEAQEBICsgCxsYAgINGQIjBgEJJgYIBwQBHAEDh1ADEQ2qbpdHDYkcF4Epi0GBPgYBAQEaNAeCb4FJBIlJineBFWeDHosshUGCbl0eMXwIFyI X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91742974" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:37:24 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id D2A4479283; Wed, 29 Jan 2014 22:37:24 -0500 (EST) Date: Wed, 29 Jan 2014 22:37:24 -0500 (EST) From: Rick Macklem To: Garrett Wollman Message-ID: <323566728.18752313.1391053044849.JavaMail.root@uoguelph.ca> In-Reply-To: <201401300322.s0U3Mt3s010029@hergotha.csail.mit.edu> Subject: Re: Big physically contiguous mbuf clusters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, nparhar@gmail.com X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:37:26 -0000 Garrett Wollman wrote: > In article <20140129231138$3db6@grapevine.csail.mit.edu>, > nparhar@gmail.com writes: > > >I think this would be very useful. For example, a zone_jumbo32 > >would > >hit a sweet spot -- enough to fit 3 jumbo frames and some loose > >change > >for metadata. I'd like to see us improve our allocators and VM > >system > >to work better with larger contiguous allocations, rather than > >deprecating the larger zones. It seems backwards to push towards > >smaller allocation units when installed physical memory in a typical > >system continues to rise. > > In order to resist fragmentation, you need to be willing to dedicate > some partition of physical memory to larger allocations. That's fine > for a special-purpose device like a switch, but is not so good for a > general-purpose operating system. But if you were willing to > reserve, > say, 1/64th of physical memory at boot time, make it all > direct-mapped > using superpages, and allocate it in fixed-power-of-two-sized chunks, > you would probably get a performance win. But the chunks *have* to > be > fixed-size, otherwise you are nearly guaranteed to get your arena > checkerboarded. I'd consider giving 2 GB on a 128-GB machine for > that. > > For NFS performance, you'd probably want to be able to take a whole > chunk, read the desired data into it in a single VOP, then pass the > whole thing to the socket layer wrapped in an mbuf. > Yep, 1 64K (or 128K soon) mbuf would be nice for read, readdir, write. (Assuming tcp_output knows how to split it up for net interfaces that can't handle TSO segments that large.) I'm not sure why, but most use 65535 (max IP datagram size) as if_hw_tsomax. (This guarantees the 64K NFS send gets split up. Doesn't TSO split it up into mtu sized segments? If so, I don't see why if_hw_tsomax would be a limit?) I'm not knowledgible w.r.t. TSO, so feel free to ignore or ocrrect this. rick > -GAWollman > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:50:01 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B34299B7 for ; Thu, 30 Jan 2014 03:50:01 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 86EAB1D19 for ; Thu, 30 Jan 2014 03:50:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0U3o1TR066908 for ; Thu, 30 Jan 2014 03:50:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0U3o1ii066905; Thu, 30 Jan 2014 03:50:01 GMT (envelope-from gnats) Date: Thu, 30 Jan 2014 03:50:01 GMT Message-Id: <201401300350.s0U3o1ii066905@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: Takefu Subject: Re: kern/121257: [tcp] TSO + natd -> slow outgoing tcp traffic X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: Takefu List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:50:01 -0000 The following reply was made to PR kern/121257; it has been noted by GNATS. From: Takefu To: bug-followup@FreeBSD.org Cc: vnovy@vnovy.net Subject: Re: kern/121257: [tcp] TSO + natd -> slow outgoing tcp traffic Date: Thu, 30 Jan 2014 12:42:33 +0900 Limited improvement method 8.4-RELEASE 9.2-RELEASE 10.0-RELEASE --- /usr/src/etc/rc.d/natd 2013-07-01 15:47:09.000000000 +0900 +++ /etc/rc.d/natd 2014-01-30 12:26:43.000000000 +0900 @@ -36,6 +36,7 @@ fi fi + sysctl net.inet.tcp.tso=0 > /dev/null return 0 } From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:56:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 27A70B03 for ; Thu, 30 Jan 2014 03:56:31 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id E23921DA2 for ; Thu, 30 Jan 2014 03:56:30 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAM7M6VKDaFve/2dsb2JhbABZhBuDAboRCYEadIJPBIEHAg0ZAl+IGJthjxGgcReBKY0igyqBSQSJSaB+g0segW4 X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91744702" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:56:29 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DBB80B4063 for ; Wed, 29 Jan 2014 22:56:29 -0500 (EST) Date: Wed, 29 Jan 2014 22:56:29 -0500 (EST) From: Rick Macklem To: FreeBSD Net Message-ID: <24918548.18766184.1391054189890.JavaMail.root@uoguelph.ca> Subject: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:56:31 -0000 For some time, I've been seeing reports of NFS related issues that get resolved by the user either disabling TSO or reducing the rsize/wsize to 32K. I now think I know why this is happening, although the evidence is just coming in. (I have no hardware/software that does TSO, so I never see these problems during testing.) A 64K NFS read reply, readdir reply or write request results in the krpc handing the TCP socket an mbuf list with 34 entries via sosend(). Now, I am really rusty w.r.t. TCP, but it looks like this will result in a TCP/IP header + 34 data mbufs being handed to the network device driver, if if_hw_tsomax has the default setting of 65535 (max IP datagram). At a glance, many drivers use a scatter/gather list of around 32 elements for transmission. If the mbuf list doesn't fit in this scatter/gather list (which looks to me like it will be the case), then the driver either calls m_defrag() or m_collapse() to try and fix the problem. This seems like a serious problem to me. 1 - If m_collapse()/m_defrag() fails, the transmit doesn't happen and things wedge until a TCP timeout retransmit gets things going again. It looks like m_defrag() is less likely to fail, but generates a lot of overhead. m_collapse() seems to be less overhead, but seems less likely to succeed. (Since m_defrag() is called with M_NOWAIT, it can fail in that extreme case. I'm not sure if it will fail otherwise?) So, how to fix this? 1 - Change NFS to use 4K clusters for these 64K reads/writes, reducing the mbuf list from 34->18. Preliminary patches for this are being tested. --> However, this seems to be more of a work-around than a fix. 2 - As soon as a driver needs to call m_defrag() or m_collapse() because the length of the TSO transmit mbuf list is too long, reduce if_hw_tsomax by a significant amount to try and get tcp_output() to generate shorter mbuf lists. Not great, but at least better than calling m_defrag()/m_collapse() over and over and over again. --> As a starting point, instrumenting the device drivers so that counts of # ofcalls to m_defrag()/m_collapse() and counts of failed calls would help to confirm how serious this problem is. 3 - ??? Any ideas from folk familiar with TSO and these drivers. rick ps: Until this gets resolved, please tell anyone with serious NFS performance/reliability issues to try either disabling TSO or doing client mounts with "-o rsize=32768,wsize=32768". I'm not sure how many believe me when I tell them, but at least I now have a theory as to why it can help a lot. From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 08:47:08 2014 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 75D9F310 for ; Thu, 30 Jan 2014 08:47:08 +0000 (UTC) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id C57F8127F for ; Thu, 30 Jan 2014 08:47:07 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA10574 for ; Thu, 30 Jan 2014 10:46:59 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1W8nH0-0007cE-TK for freebsd-net@freebsd.org; Thu, 30 Jan 2014 10:46:58 +0200 Message-ID: <52EA114C.40908@FreeBSD.org> Date: Thu, 30 Jan 2014 10:46:04 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: FreeBSD Net Subject: Re: Big physically contiguous mbuf clusters References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> <20140129222714.GK93141@funkthat.com> In-Reply-To: <20140129222714.GK93141@funkthat.com> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 08:47:08 -0000 on 30/01/2014 00:27 John-Mark Gurney said the following: > Adrian Chadd wrote this message on Wed, Jan 29, 2014 at 14:21 -0800: >> On 29 January 2014 10:54, Garrett Wollman wrote: >>> Resolved: that mbuf clusters longer than one page ought not be >>> supported. There is too much physical-memory fragmentation for them >>> to be of use on a moderately active server. 9k mbufs are especially >>> bad, since in the fragmented case they waste 3k per allocation. >> >> I've been wondering whether it'd be feasible to teach the physical >> memory allocator about >page sized allocations and to create zones of >> slightly more physically contiguous memory. >> >> For servers with lots of memory we could then keep these around and >> only dip into them for temporary allocations (eg not VM pages that may >> be held for some unknown amount of time.) >> >> Question is - can we enforce that kind of behaviour? > > It shouldn't be too hard to do... Since everything pretty much goes > through uma we can adopt a scheme similar to what Solaris does (read > Magazines and Vmem: Extending the Slab Allocator to Many CPUs and > Arbitrary Resources)... Instead of dealing w/ page size allocations, > everything is larger, say 16KB, and broken down from there... > FWIW, this is not how it is currently implemented in Solaris judging from OpenSolaris / illumos code. They try to find a slab size where the waste would be minimal. There is a cap on the maximum slab size, of course. This is also done for sub-page items. E.g. if an item size is 3KB, then FreeBSD uma would use 4KB slabs and waste about 1KB in each slab. On the other hand, illumos kmem cache code would pick 12KB slab size. -- Andriy Gapon From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 09:36:49 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 23A66E0A; Thu, 30 Jan 2014 09:36:49 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EC2D5170F; Thu, 30 Jan 2014 09:36:48 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0U9amgi073278; Thu, 30 Jan 2014 09:36:48 GMT (envelope-from vanhu@freefall.freebsd.org) Received: (from vanhu@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0U9amBJ073277; Thu, 30 Jan 2014 09:36:48 GMT (envelope-from vanhu) Date: Thu, 30 Jan 2014 09:36:48 GMT Message-Id: <201401300936.s0U9amBJ073277@freefall.freebsd.org> To: vanhu@FreeBSD.org, freebsd-net@FreeBSD.org, vanhu@FreeBSD.org From: vanhu@FreeBSD.org Subject: Re: kern/169438: [ipsec] ipv4-in-ipv6 tunnel mode IPsec does not work X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 09:36:49 -0000 Synopsis: [ipsec] ipv4-in-ipv6 tunnel mode IPsec does not work Responsible-Changed-From-To: freebsd-net->vanhu Responsible-Changed-By: vanhu Responsible-Changed-When: jeu 30 jan 2014 09:34:17 UTC Responsible-Changed-Why: Hi. Your hack solves the issue for ipv4-in-ipv6, but the same issue exists for ipv6-in-ipv4, and requires some more refactoring of the code. We're working on such a patch for both ways, and I hope we'll have a version ready to commit within the next few weeks. http://www.freebsd.org/cgi/query-pr.cgi?pr=169438 From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 13:45:30 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B70737C9 for ; Thu, 30 Jan 2014 13:45:30 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3712B1E50 for ; Thu, 30 Jan 2014 13:45:30 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id s0UDjKQm032687 for ; Thu, 30 Jan 2014 15:45:20 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua s0UDjKQm032687 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id s0UDjKXU032686 for freebsd-net@freebsd.org; Thu, 30 Jan 2014 15:45:20 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 30 Jan 2014 15:45:19 +0200 From: Konstantin Belousov To: FreeBSD Net Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140130134519.GU24664@kib.kiev.ua> References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> <20140129231121.GA18434@ox> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="oGy11dVowAZA6eXT" Content-Disposition: inline In-Reply-To: <20140129231121.GA18434@ox> User-Agent: Mutt/1.5.22 (2013-10-16) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 13:45:30 -0000 --oGy11dVowAZA6eXT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jan 29, 2014 at 03:11:21PM -0800, Navdeep Parhar wrote: > On Wed, Jan 29, 2014 at 02:21:21PM -0800, Adrian Chadd wrote: > > Hi, > >=20 > > On 29 January 2014 10:54, Garrett Wollman wrote: > > > Resolved: that mbuf clusters longer than one page ought not be > > > supported. There is too much physical-memory fragmentation for them > > > to be of use on a moderately active server. 9k mbufs are especially > > > bad, since in the fragmented case they waste 3k per allocation. > >=20 > > I've been wondering whether it'd be feasible to teach the physical > > memory allocator about >page sized allocations and to create zones of > > slightly more physically contiguous memory. >=20 > I think this would be very useful. For example, a zone_jumbo32 would > hit a sweet spot -- enough to fit 3 jumbo frames and some loose change > for metadata. I'd like to see us improve our allocators and VM system > to work better with larger contiguous allocations, rather than > deprecating the larger zones. It seems backwards to push towards > smaller allocation units when installed physical memory in a typical > system continues to rise. >=20 > Allocating 3 x 4K instead of 1 x 9K for a jumbo means 3x the number of > vtophys translations, 3x the phys_addr/len traffic on the PCIe bus > (scatter list has to be fed to the chip and now it's 3x what it has to > be), 3x the number of "wrapper" mbuf allocations (one for each 4K > cluster) which will then be stitched together to form a frame, etc. etc. If the platform supports IOMMU, then physical contiguity of the pages could be ignored, since with proper busdma tag VT-d driver allocates continous bus address space for device view mapping. Of course, this is moot right now due to drivers have no idea about IOMMU presence, and since IOMMU busdma both disabled by default and having non-trivial setup cost. >=20 > Regards, > Navdeep >=20 > >=20 > > For servers with lots of memory we could then keep these around and > > only dip into them for temporary allocations (eg not VM pages that may > > be held for some unknown amount of time.) > >=20 > > Question is - can we enforce that kind of behaviour? > >=20 > >=20 > >=20 > > -a > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" --oGy11dVowAZA6eXT Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (FreeBSD) iQIcBAEBAgAGBQJS6ldvAAoJEJDCuSvBvK1BIO4P/jdUPJCR+HAGTKErDa5QXf2s EKVRLoHGl/wRiI6gv5G3fbh6ZdHAQLSD29iyOecpTMmv0CtTLyk19y9nVB0VdyhB cBKYkDOUZab1jksREKKjUlf+OGWpdTuG7FZ3pQki6VKXB81zmDN0aOijCM4H+poU lrG6OWIcU7nlbiAhRqA7rdSdMOKTvtbWBSWfOFzXmlYi374PJXYZOC2foRZXrfAw G972p18FZnbOTsZ3SO91NmYxTWQ5C6qxufmEjg9OuG38YWFPa80l17c5A3WDKZHt 0F2Ujh970vwynmNRaPq2rX4d/QE9jbtm1qAhDueKoppE1pbLnlORWz5DPSZMzXDx H8ei0BP/fPXTt5IILHlcMyRBuTTharDGB0UjhOZI0ruD4cWCmkZcoQxraMoe4iWQ EpiQsGbcXB0VzbG5mAp9KuPt8By0gWf94NLkwva7Z21N2u+SWLvQdUr/k5N0oAqL mAoMuFc//tzpC8i9/73R+yjMkoiaGZCB3X42OtofLwdWQkINrMdOy4k1ZyI8s2kS 5w9W2QXJPlIOxkc+WCF+MKhX/2RGNvuKLNN+LfbffYIzIDuFsEknoPEuIyVIjPM/ XZP6gMcHIMYS/a18vQzx0h9BnlsV1TnIU0759iGsDS1AT12NPwF8L4TIf0TD+nRr d+K2ZRxD20R/TptSfghN =X67A -----END PGP SIGNATURE----- --oGy11dVowAZA6eXT-- From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 15:06:29 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E296FA43 for ; Thu, 30 Jan 2014 15:06:29 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id A7DC01510 for ; Thu, 30 Jan 2014 15:06:29 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,750,1384318800"; d="scan'208";a="92376724" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 30 Jan 2014 10:06:28 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2A704B40F9 for ; Thu, 30 Jan 2014 10:06:28 -0500 (EST) Date: Thu, 30 Jan 2014 10:06:28 -0500 (EST) From: Rick Macklem To: FreeBSD Net Message-ID: <1217356349.106076.1391094388165.JavaMail.root@uoguelph.ca> Subject: Re: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 15:06:29 -0000 Hi, just adding one more idea on what to do about this to the list: - Add a if_hw_tsomaxseg and modify the loop in tcp_output() so that it uses both if_hw_tsomax and if_hw_tsomaxseg to decide how much to hand to the device driver in each mbuf list. (I haven't looked to see how easy it would be to change this loop.) rick From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 15:12:08 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D78A0BD2 for ; Thu, 30 Jan 2014 15:12:08 +0000 (UTC) Received: from mail-qa0-x22a.google.com (mail-qa0-x22a.google.com [IPv6:2607:f8b0:400d:c00::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 96E8A15C3 for ; Thu, 30 Jan 2014 15:12:08 +0000 (UTC) Received: by mail-qa0-f42.google.com with SMTP id k4so4552913qaq.15 for ; Thu, 30 Jan 2014 07:12:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=LwkMeLNNvlicyXbBmSJES5ymzZNGHsGZVAELUI5ZbtE=; b=rLrIHDuDA+2fYXb49F6FroYECzG53qs8lUq6eh2Uz3T8R7bxzLsr9TSZilj6PAntK9 mvlyxc1/l1osv9Cipqp3zhTLYkDwAvI7pMpOp902CZg48mGhDtTrMl/CPUQkLMq8SCdV alkMCs35KPfpkdcjVUenGNdWJeJzYXgTNMUHEYOcEzB1Z3cCHbov0Fx37N3dxRqHI+fh bEAAlsErEZAL6TR0EovDQxUhP/cdw861rdatMgVZMUGv8kkyZ9TAmHhGbiqZoWDPtIAM alMinID2+419OYZPiEBessyerTq939tjE0q15s69Kf8FLkrY6cVH9RB8uPOrGAchF+Uf UKjQ== MIME-Version: 1.0 X-Received: by 10.140.108.74 with SMTP id i68mr21004155qgf.87.1391094727736; Thu, 30 Jan 2014 07:12:07 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Thu, 30 Jan 2014 07:12:07 -0800 (PST) In-Reply-To: <1217356349.106076.1391094388165.JavaMail.root@uoguelph.ca> References: <1217356349.106076.1391094388165.JavaMail.root@uoguelph.ca> Date: Thu, 30 Jan 2014 07:12:07 -0800 X-Google-Sender-Auth: lY59v5VrBUmI-QO7Ts9UvkRKGzo Message-ID: Subject: Re: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO From: Adrian Chadd To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 15:12:08 -0000 On 30 January 2014 07:06, Rick Macklem wrote: > Hi, just adding one more idea on what to do about this > to the list: > - Add a if_hw_tsomaxseg and modify the loop in tcp_output() > so that it uses both if_hw_tsomax and if_hw_tsomaxseg to > decide how much to hand to the device driver in each mbuf list. > (I haven't looked to see how easy it would be to change this loop.) I don't think that's a hack. I think adding that and setting tsomaxseg to say 30 for now would be a good comprimise. -a From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 15:18:29 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 71788CE6; Thu, 30 Jan 2014 15:18:29 +0000 (UTC) Received: from mail-qc0-x234.google.com (mail-qc0-x234.google.com [IPv6:2607:f8b0:400d:c01::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 229411601; Thu, 30 Jan 2014 15:18:29 +0000 (UTC) Received: by mail-qc0-f180.google.com with SMTP id i17so5178873qcy.11 for ; Thu, 30 Jan 2014 07:18:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=ugcUvCRSsoGpTtfZbv0Uhnr4LqgB0lzB/VnS6jy4UO4=; b=pl2xqtvo2xByrojDI07wxBJuZQHV3freN+XlfHzdCDo1GOHkZNAJmcJp6r5HxFwRSP QsgtoBoHFM4MO7AFvUphy2Cr13fRvOh2MppC44R/MXuyFvrg/bgsBhHTH54uzspd7I9v HUBiKAnhxtd3zb/R5wWY5kDANHHPzhEwFqYCOYbW59W1IkEUbct3EgAJHkfN/117P1yy rTwCJeeMIfDp6btQy50fjlgEeOhLbT6zCqGhpwXCec3rhciSEcUhYlCZ5wfQ41toOJTQ EsvhM0xOnsrJ/sVy9N64+ICFzbnmHEL/ggqevC8DKPdtHKFbutWHZhDu3t9gIuEX5q7k ttPQ== MIME-Version: 1.0 X-Received: by 10.229.35.194 with SMTP id q2mr22669825qcd.7.1391095108227; Thu, 30 Jan 2014 07:18:28 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Thu, 30 Jan 2014 07:18:28 -0800 (PST) Date: Thu, 30 Jan 2014 07:18:28 -0800 X-Google-Sender-Auth: b7mz-_vmvNsP2ykU0q8_bR9XTA8 Message-ID: Subject: From: Adrian Chadd To: FreeBSD Net , "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 15:18:29 -0000 Hi, I'd like to disable the code in flowtable.c that assigns the mbuf flowid. I'd like to ensure that any mbuf flowid that's set is (eventually) going to be consistently toeplitz in the future (to match what NICs are doing on the RX side) and this may cause the flowid to be set to something completely different. I've only done some light production testing with this so far, to no visible ill effects. What do people think? Thanks, -a Index: sys/net/flowtable.c =================================================================== --- sys/net/flowtable.c (revision 261001) +++ sys/net/flowtable.c (working copy) @@ -1102,10 +1102,12 @@ if (af == AF_INET6) fle = flowtable_lookup_mbuf6(ft, m); #endif +#if 0 if (fle != NULL && m != NULL && (m->m_flags & M_FLOWID) == 0) { m->m_flags |= M_FLOWID; m->m_pkthdr.flowid = fle->f_fhash; } +#endif return (fle); } From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 18:40:44 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id ABF44B87; Thu, 30 Jan 2014 18:40:44 +0000 (UTC) Received: from mail-qc0-x22c.google.com (mail-qc0-x22c.google.com [IPv6:2607:f8b0:400d:c01::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5B9B21909; Thu, 30 Jan 2014 18:40:44 +0000 (UTC) Received: by mail-qc0-f172.google.com with SMTP id c9so5627888qcz.31 for ; Thu, 30 Jan 2014 10:40:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=uiFIhVVAE6V9Nvpvo/jiWNAsVutM+eIUHimqf4JIi8M=; b=i/WGkaTck6R5QkObIVBYWjQKlR+76HfqpcZjJHsW0JRsVaX/yzwkZCegd1dSdDvwBm 1qlm0fY/M23u+SspaMJFURsFvQ3tZG96S9SxZv540bcPcTHqaByErk8N6gb+mC9ZrvX4 D+JOE6dns5n6xwEhAhK5u1LKeLEYyVuxbD5D9ztEAyiGMpM7mOJRvDH7uxtatYGjUsvH 8/jWRMazBOrNJkf9t9w3OLvTARaeIx4X/4OmrMD94PMz+AfmqGZ3jnQeqiPPziLXN9II 3YBs66a0lJQWC23bHne7h1lhacSkIYpWDeTRu63KmsZi+5ymrISnNRJQetPedS8ZdDgG fMYw== MIME-Version: 1.0 X-Received: by 10.140.96.180 with SMTP id k49mr22717122qge.4.1391107243141; Thu, 30 Jan 2014 10:40:43 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Thu, 30 Jan 2014 10:40:43 -0800 (PST) Date: Thu, 30 Jan 2014 10:40:43 -0800 X-Google-Sender-Auth: _UpKv-12afRRAdwR0t63cd57tik Message-ID: Subject: Re: (removing mbuf flowid setup in flowtable.c) From: Adrian Chadd To: FreeBSD Net , "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 18:40:44 -0000 On 30 January 2014 07:18, Adrian Chadd wrote: > Hi, > > I'd like to disable the code in flowtable.c that assigns the mbuf flowid. > > I'd like to ensure that any mbuf flowid that's set is (eventually) > going to be consistently toeplitz in the future (to match what NICs > are doing on the RX side) and this may cause the flowid to be set to > something completely different. > > I've only done some light production testing with this so far, to no > visible ill effects. > > What do people think? Someone pointed out privately that doing this would mean that UDP flows without flow ids would suddenly not have flowids any longer and thus wouldn't use multiple output queues. So, I'll leave this alone for now until I can import the toeplitz hash code into -HEAD and add an option to tag outbound udp frames with this particular flowid hash. Thanks, -a > Thanks, > > > -a > > > Index: sys/net/flowtable.c > =================================================================== > --- sys/net/flowtable.c (revision 261001) > +++ sys/net/flowtable.c (working copy) > @@ -1102,10 +1102,12 @@ > if (af == AF_INET6) > fle = flowtable_lookup_mbuf6(ft, m); > #endif > +#if 0 > if (fle != NULL && m != NULL && (m->m_flags & M_FLOWID) == 0) { > m->m_flags |= M_FLOWID; > m->m_pkthdr.flowid = fle->f_fhash; > } > +#endif > return (fle); > } From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 18:49:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4CCCAD6B for ; Thu, 30 Jan 2014 18:49:12 +0000 (UTC) Received: from mail-ie0-f173.google.com (mail-ie0-f173.google.com [209.85.223.173]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 14DDF1964 for ; Thu, 30 Jan 2014 18:49:11 +0000 (UTC) Received: by mail-ie0-f173.google.com with SMTP id e14so3677952iej.32 for ; Thu, 30 Jan 2014 10:49:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:subject:mime-version:content-type:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=G6zJZbKaq2ujxL42jBzDatR2Zr2Mcq4ojANg6Msl+JY=; b=YDkdUT7DCe0qCexOiHEXexdqN47mbMnPjVqBMXD+9Ni6FzoLFSwpKc+DzDgJq/wmmD ngV9lN3Fvyzfyo4+TjtZotse6qgjmJMZERRzJDKz3VbryUwoQAPVrJhUfY18/24bVlRG cEUTYbRU4V43urW/xxVOtB+uPOldWMEkEDGH/58AKQHGIIGNa52IiyxYrCQfI8xq4XH9 6R7kvB+7hHWJ5SSSFpa9OEMqP2ZG+5wPgUHxKCAnsj3KRGCQSXIuVZhmMy0MniIBzStS 6/7wOTTSJaMs43VYqj/WLJLtZESIUSpInvUiuJwGaouu7LpIGvhhEc/qqPMpNQdefurs +XvA== X-Gm-Message-State: ALoCoQnqDKbfRNaPoB1lfBCiWL50JkCoq1pRrWboP986SekWRvyIlVAEHVHPEt6ZV+LbuIBCd5bd X-Received: by 10.50.138.37 with SMTP id qn5mr18540543igb.36.1391107740401; Thu, 30 Jan 2014 10:49:00 -0800 (PST) Received: from fusionlt2834a.int.fusionio.com ([209.117.142.2]) by mx.google.com with ESMTPSA id t4sm81981674igm.10.2014.01.30.10.48.59 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 30 Jan 2014 10:48:59 -0800 (PST) Sender: Warner Losh Subject: Re: (removing mbuf flowid setup in flowtable.c) Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: Date: Thu, 30 Jan 2014 11:48:57 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <87746AA8-4759-4302-883C-25CDAC95C322@bsdimp.com> References: To: Adrian Chadd X-Mailer: Apple Mail (2.1085) Cc: FreeBSD Net , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 18:49:12 -0000 On Jan 30, 2014, at 11:40 AM, Adrian Chadd wrote: > On 30 January 2014 07:18, Adrian Chadd wrote: >> Hi, >>=20 >> I'd like to disable the code in flowtable.c that assigns the mbuf = flowid. >>=20 >> I'd like to ensure that any mbuf flowid that's set is (eventually) >> going to be consistently toeplitz in the future (to match what NICs >> are doing on the RX side) and this may cause the flowid to be set to >> something completely different. >>=20 >> I've only done some light production testing with this so far, to no >> visible ill effects. >>=20 >> What do people think? >=20 > Someone pointed out privately that doing this would mean that UDP > flows without flow ids would suddenly not have flowids any longer and > thus wouldn't use multiple output queues. >=20 > So, I'll leave this alone for now until I can import the toeplitz hash > code into -HEAD and add an option to tag outbound udp frames with this > particular flowid hash. toeplitz is a funky kind of matrix, according to google. What does that = have to do with mbufs? :) Warner From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 18:50:32 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8B194F81; Thu, 30 Jan 2014 18:50:32 +0000 (UTC) Received: from mail-qc0-x22c.google.com (mail-qc0-x22c.google.com [IPv6:2607:f8b0:400d:c01::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 38D1E1977; Thu, 30 Jan 2014 18:50:32 +0000 (UTC) Received: by mail-qc0-f172.google.com with SMTP id c9so5647531qcz.31 for ; Thu, 30 Jan 2014 10:50:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=nDUF5NlLj7vyiYJ7D9I5i7McjwlQ38+thwDPkwxt9SY=; b=i/MNjsSJ0WTffo7iHwtzQVUrvv/IgwfvFP057VWD3YO+jHhFy9yoI2E2otlLKjq3Tw 4BFj46BlGmlBgTKs0gnFK1XX1ZwljAEzMiyhq0b+DRnH4CaeyUFSdh6rynwLcSWos9CZ myQ+QB9z82T1tM3Kx9toFFqHgUZttu5LCI5L8ND+knd7WdRq+4jqgkG1rVnUeb5VdnxS 8/8gSWaURDDQddh/HYNplDJ2jzZ6RZp+SeoVm7MgsKVYGV3VfCGG/+u80boxIhbL/On/ 5vMWH3lX4uDv827tuhz3QZaIasbDxTW686DU/h090Gta6h14YQtxDMnZ/FO7FP8TIV7U G3fg== MIME-Version: 1.0 X-Received: by 10.140.24.71 with SMTP id 65mr23460311qgq.12.1391107831471; Thu, 30 Jan 2014 10:50:31 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Thu, 30 Jan 2014 10:50:31 -0800 (PST) In-Reply-To: <87746AA8-4759-4302-883C-25CDAC95C322@bsdimp.com> References: <87746AA8-4759-4302-883C-25CDAC95C322@bsdimp.com> Date: Thu, 30 Jan 2014 10:50:31 -0800 X-Google-Sender-Auth: iLb-IZ3u7wCTL9TuDs_QIrhanZk Message-ID: Subject: Re: (removing mbuf flowid setup in flowtable.c) From: Adrian Chadd To: Warner Losh Content-Type: text/plain; charset=ISO-8859-1 Cc: FreeBSD Net , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 18:50:32 -0000 On 30 January 2014 10:48, Warner Losh wrote: > > toeplitz is a funky kind of matrix, according to google. What does that have to do with mbufs? :) google "toeplitz hash rss". :-) -a From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 20:30:18 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A99C8DC7; Thu, 30 Jan 2014 20:30:18 +0000 (UTC) Received: from mail-ig0-x233.google.com (mail-ig0-x233.google.com [IPv6:2607:f8b0:4001:c05::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 6328A1241; Thu, 30 Jan 2014 20:30:18 +0000 (UTC) Received: by mail-ig0-f179.google.com with SMTP id c10so8032174igq.0 for ; Thu, 30 Jan 2014 12:30:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=BvNKMo4HmFJTvtr1m9IzBaaalkwmp7VIJ/nXUSIDH0w=; b=D31j1TdIDJpGm+qHklnbsHH3ZMdnyHwRjIj2mNUBtZTD4v004Ghhq1SyEAzG4N9Zyg JycEvLY1F68ueIdFqwGWZzRPEVJR3Fq5hsvJ5rifA3cPHQO7ZOtpOUMticISC0sW3Gp6 bPEuSp5pMqT1dg3KHU0n6sp7Qs2uXTdr/fLOk1/fuTlec/yHB2iwSBRhf8Pp3bHe9kO2 hGJgMha3BIEaVSiV7oScDI3/tH9O4NnQ/zW0rOHdWqSTWixQqdWfkPbVaqH5comlwPhD NKwIkhqE3fBvy5W6k4fpzo+bLTOy4rQyEzk/+ywHv2NwQnDfC0L/jJTRk2ChubUDeHlj dHqQ== MIME-Version: 1.0 X-Received: by 10.50.50.70 with SMTP id a6mr15952234igo.1.1391113816972; Thu, 30 Jan 2014 12:30:16 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 30 Jan 2014 12:30:16 -0800 (PST) In-Reply-To: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca> References: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca> Date: Thu, 30 Jan 2014 15:30:16 -0500 X-Google-Sender-Auth: dtPD4ZjLe0S89T7j6CnmOBITrh4 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: Bryan Venteicher , Garrett Wollman , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 20:30:18 -0000 On Wed, Jan 29, 2014 at 10:31 PM, Rick Macklem wrote: >> I've been busy the last few days, and won't be able to get to any >> code >> until the weekend. Is there likely to be more to it than just cranking the MAX_TX_SEGS value and recompiling? If so, is it something I could take on? > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then > increasing it from 34 to 35 would be all it takes. However, see below. One thing I don't want to miss here is that an NFS block size of 65,536 is really suboptimal. The largest size of a TCP datagram is 65535. So by the time NFS adds the overhead on and the total amount of data to be sent winds up in that ~65k range, it guarantees that the operation has to be split it into at least two TCP packets, one max-size and one tiny one. This doubles a lot of the network stack overhead, regardless of whether the packet ends up being segmented into tiny bits down the road or not. If NFS could be modified to respect the actual size of a TCP packet, generating a steady stream of 63.9k (or thereabout) writes instead of the current 64k-1k-64k-1k, performance would likely see another significant boost. This would nearly double the average throughput per packet, which would help with network latency and CPU load. It's also not 100% clear but it seems like in some cases the existing behavior also causes the TCP stack to park on the "leftover" bit and wait for more data, which comes in another >64k chunk, and from there on out there's no more correlation between TCP packets and NFS operations, so an operation doesn't begin on a packet boundary. That continues as long as load keeps up. That's probably not good for performance either. And it certainly confuses the heck out of tcpdump. Probably 60k would be the next most reasonable size, since it's the largest page size multiple that will fit into a TCP packet while still leaving room for overhead. Since the max size of TCP packets is not an area where there's really any flexibility, what would have to happen to NFS to make that (or arbitrary values) perform at its best within that constraint? It's apparent from even trivial testing that performance is dramatically affected if the "use a power of two for NFS rsize/wsize" recommendation isn't followed, but what is the origin of that? Is it something that could be changed? > I don't think that m_collapse() is more likely to fail, since it > only copies data to the previous mbuf when the entire mbuf that > follows will fit and it's allowed. I'd assume that a ref count > copied mbuf cluster doesn't allow this copy or things would be > badly broken.) m_collapse checks M_WRITEABLE which appears to cover the ref count case. (It's a dense macro, but it seems to require a ref count of 1 if a cluster is used.) The cases where m_collapse can succeed are pretty slim. It pretty much requires two consecutive underutilizied buffers, which probably explains why it fails so often in this code path. Since one of its two methods outright skips the packet header mbuf (to avoid risk of moving it), possibly the only case where it succeeds is when the last data mbuf is short enough that whatever NFS trailers are being appended can fit with it. > Bottom line, I think calling either m_collapse() or m_defrag() > should be considered a "last resort". It definitely seems more designed for a case where 8 different stack layers each put their own little header/trailer fingerprint on the packet, and that's not what's happening here. > Maybe the driver could reduce the size of if_hw_tsomax whenever > it finds it needs to call one of these functions, to try and avoid > a re-occurrence? Since the issue is one of segment length rather than packet length, this seems risky. If one of those touched-by-everybody packets goes by, it may not be that large, but it would risk permanently (until reboot) dropping the throughput of that interface. Thanks! From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 22:44:06 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5A28DC9F; Thu, 30 Jan 2014 22:44:06 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id D9B561E6F; Thu, 30 Jan 2014 22:44:05 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,752,1384318800"; d="scan'208";a="92492324" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 30 Jan 2014 17:44:03 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id CD40CB4084; Thu, 30 Jan 2014 17:44:03 -0500 (EST) Date: Thu, 30 Jan 2014 17:44:03 -0500 (EST) From: Rick Macklem To: J David Message-ID: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Bryan Venteicher , Garrett Wollman , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 22:44:06 -0000 J David wrote: > On Wed, Jan 29, 2014 at 10:31 PM, Rick Macklem > wrote: > >> I've been busy the last few days, and won't be able to get to any > >> code > >> until the weekend. > > Is there likely to be more to it than just cranking the MAX_TX_SEGS > value and recompiling? If so, is it something I could take on? > > > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then > > increasing it from 34 to 35 would be all it takes. However, see > > below. > > One thing I don't want to miss here is that an NFS block size of > 65,536 is really suboptimal. The largest size of a TCP datagram is > 65535. So by the time NFS adds the overhead on and the total amount > of data to be sent winds up in that ~65k range, it guarantees that > the > operation has to be split it into at least two TCP packets, one > max-size and one tiny one. This doubles a lot of the network stack > overhead, regardless of whether the packet ends up being segmented > into tiny bits down the road or not. > For your virtual network, yes. For the underlying file system on the server (which would not normally be in memory), a large block size will normally be good. (No one size fits all, which is why there are the rsize/wsize mount options.) To be honest, the limit is MAXBSIZE, which just happens to be 64K at this time. I'd like to see MAXBSIZE increased to at least 128K, since that is the default blocksize for ZFS, I've been told. Also, for real networks, the NFS RPC message will be broken into quite a few packets to go on the wire, as far as I know. (I don't think there are real networks using a 64K jumbo packet, is there?) For my hardware, the packets will be 1500bytes each on the wire, since nothing I have does jumbo packets. Unfortunately, NFS adds a little bit to the front of the data, so an NFS RPC will always be a little bit more than a power of 2 in size for reads/writes of a power of 2. Also, most NFS RPC messages are small, so NFS traffic is always going to have a lot of small TCP segments interspersed with a few large ones (and going in both directions on the TCP connection concurrently). Now, I am not sure why 65535 (largest ip datagram) has been chosen as the default limit for TSO segments? (From my point of view, it would be nice if the limit were larger, assuming there is a limit on the number of mbufs in the list, so that calls to m_collapse()/m_defrag() are avoided. I am hoping the networking types consider my recent post and maybe the suggestion of having a if_hw_tsomaxseg limit along with if_hw_tsomax.) > If NFS could be modified to respect the actual size of a TCP packet, > generating a steady stream of 63.9k (or thereabout) writes instead of > the current 64k-1k-64k-1k, performance would likely see another > significant boost. This would nearly double the average throughput > per packet, which would help with network latency and CPU load. > > It's also not 100% clear but it seems like in some cases the existing > behavior also causes the TCP stack to park on the "leftover" bit and > wait for more data, which comes in another >64k chunk, and from there > on out there's no more correlation between TCP packets and NFS > operations, so an operation doesn't begin on a packet boundary. That > continues as long as load keeps up. That's probably not good for > performance either. And it certainly confuses the heck out of > tcpdump. > Well, since NFS sets the TCP_NODELAY socket option, that shouldn't occur in the TCP layer. If some network device driver is delaying, waiting for more to send, then I'd say that device driver is broken. > Probably 60k would be the next most reasonable size, since it's the > largest page size multiple that will fit into a TCP packet while > still > leaving room for overhead. > > Since the max size of TCP packets is not an area where there's really > any flexibility, what would have to happen to NFS to make that (or > arbitrary values) perform at its best within that constraint? > For real NFS environments, the performance of the file system and underlying disk subsystem is generally more important than the network. (Your benchmark has artificially taken the file system on disk out of the mix, so you will see an exaggerated effect from network performance. This is fine if you are looking for network bottlenecks, but not if you want to relate this to performance of a real NFS environment.) I already mentioned that the Linux client doing file_sync 8K writes will result in poor performance of a server's disk file system. (Some NAS vendors avoid this by using non-volatile ram in the server as stable storage, but a FreeBSD server can't expect such hardware to be available.) > It's apparent from even trivial testing that performance is > dramatically affected if the "use a power of two for NFS rsize/wsize" > recommendation isn't followed, but what is the origin of that? Is it > something that could be changed? > Because disk file systems on file servers always use block sizes that are a power of 2. > > I don't think that m_collapse() is more likely to fail, since it > > only copies data to the previous mbuf when the entire mbuf that > > follows will fit and it's allowed. I'd assume that a ref count > > copied mbuf cluster doesn't allow this copy or things would be > > badly broken.) > > m_collapse checks M_WRITEABLE which appears to cover the ref count > case. (It's a dense macro, but it seems to require a ref count of 1 > if a cluster is used.) > > The cases where m_collapse can succeed are pretty slim. It pretty > much requires two consecutive underutilizied buffers, which probably > explains why it fails so often in this code path. Since one of its > two methods outright skips the packet header mbuf (to avoid risk of > moving it), possibly the only case where it succeeds is when the last > data mbuf is short enough that whatever NFS trailers are being > appended can fit with it. > Yes, I would agree with this. (I think I somehow mistyped what I meant to say. I didn't mean to imply that m_collapse() will usually succeed for these long NFS mbuf list RPC messages.) > > Bottom line, I think calling either m_collapse() or m_defrag() > > should be considered a "last resort". > > It definitely seems more designed for a case where 8 different stack > layers each put their own little header/trailer fingerprint on the > packet, and that's not what's happening here. > > > Maybe the driver could reduce the size of if_hw_tsomax whenever > > it finds it needs to call one of these functions, to try and avoid > > a re-occurrence? > > Since the issue is one of segment length rather than packet length, > this seems risky. If one of those touched-by-everybody packets goes > by, it may not be that large, but it would risk permanently (until > reboot) dropping the throughput of that interface. > Agreed. I think adding a if_hw_tsomaxseg that TCP can use is preferable. I didn't think of that until after sending the first post. Also, I think adding it implies a driver KPI change, which means it can't be done for 9.n or 10.n. rick > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 03:32:44 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 14BE032D; Fri, 31 Jan 2014 03:32:44 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 5C6CD13D3; Fri, 31 Jan 2014 03:32:40 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,754,1384318800"; d="scan'208";a="92529657" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 30 Jan 2014 22:32:32 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 23612B403B; Thu, 30 Jan 2014 22:32:32 -0500 (EST) Date: Thu, 30 Jan 2014 22:32:32 -0500 (EST) From: Rick Macklem To: Adrian Chadd Message-ID: <1856284835.584005.1391139152133.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_584003_402802933.1391139152131" X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 03:32:44 -0000 ------=_Part_584003_402802933.1391139152131 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Adrian Chadd wrote: > On 30 January 2014 07:06, Rick Macklem wrote: > > Hi, just adding one more idea on what to do about this > > to the list: > > - Add a if_hw_tsomaxseg and modify the loop in tcp_output() > > so that it uses both if_hw_tsomax and if_hw_tsomaxseg to > > decide how much to hand to the device driver in each mbuf list. > > (I haven't looked to see how easy it would be to change this > > loop.) > > I don't think that's a hack. I think adding that and setting > tsomaxseg > to say 30 for now would be a good comprimise. > Well, my TCP is very rusty and I have no way to test it (I don't have anything that does TSO), but I've attached a stab at a patch to do this. Maybe it can be used as a starting point for this, if others think it makes sense. The "#ifdef notyet" in the patch would become something like: # if __FreeBSD_version >= NNNN when a change to add if_hw_tsomaxseg is done, was what I was thinking. rick > > > -a > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > ------=_Part_584003_402802933.1391139152131 Content-Type: text/x-patch; name=tsomaxseg.patch Content-Disposition: attachment; filename=tsomaxseg.patch Content-Transfer-Encoding: base64 LS0tIGtlcm4vdWlwY19zb2NrYnVmLmMuc2F2CTIwMTQtMDEtMzAgMjA6Mjc6MTcuMDAwMDAwMDAw IC0wNTAwCisrKyBrZXJuL3VpcGNfc29ja2J1Zi5jCTIwMTQtMDEtMzAgMjI6MTI6MDguMDAwMDAw MDAwIC0wNTAwCkBAIC05NjUsNiArOTY1LDM5IEBAIHNic25kcHRyKHN0cnVjdCBzb2NrYnVmICpz YiwgdV9pbnQgb2ZmLCAKIH0KIAogLyoKKyAqIFJldHVybiB0aGUgZmlyc3QgbWJ1ZiBmb3IgdGhl IHByb3ZpZGVkIG9mZnNldC4KKyAqLworc3RydWN0IG1idWYgKgorc2JzbmRtYnVmKHN0cnVjdCBz b2NrYnVmICpzYiwgdV9pbnQgb2ZmLCBsb25nICpmaXJzdF9sZW4pCit7CisJc3RydWN0IG1idWYg Km07CisKKwlLQVNTRVJUKHNiLT5zYl9tYiAhPSBOVUxMLCAoIiVzOiBzYl9tYiBpcyBOVUxMIiwg X19mdW5jX18pKTsKKworCSpmaXJzdF9sZW4gPSAwOworCS8qCisJICogSXMgb2ZmIGJlbG93IHN0 b3JlZCBvZmZzZXQ/IEhhcHBlbnMgb24gcmV0cmFuc21pdHMuCisJICogSWYgc28sIGp1c3QgdXNl IHNiX21iLgorCSAqLworCWlmIChzYi0+c2Jfc25kcHRyID09IE5VTEwgfHwgc2ItPnNiX3NuZHB0 cm9mZiA+IG9mZikKKwkJbSA9IHNiLT5zYl9tYjsKKwllbHNlIHsKKwkJbSA9IHNiLT5zYl9zbmRw dHI7CisJCW9mZiAtPSBzYi0+c2Jfc25kcHRyb2ZmOworCX0KKwl3aGlsZSAob2ZmID4gMCAmJiBt ICE9IE5VTEwpIHsKKwkJaWYgKG9mZiA8IG0tPm1fbGVuKQorCQkJYnJlYWs7CisJCW9mZiAtPSBt LT5tX2xlbjsKKwkJbSA9IG0tPm1fbmV4dDsKKwl9CisJaWYgKG0gIT0gTlVMTCkKKwkJKmZpcnN0 X2xlbiA9IG0tPm1fbGVuIC0gb2ZmOworCisJcmV0dXJuIChtKTsKK30KKworLyoKICAqIERyb3Ag YSByZWNvcmQgb2ZmIHRoZSBmcm9udCBvZiBhIHNvY2tidWYgYW5kIG1vdmUgdGhlIG5leHQgcmVj b3JkIHRvIHRoZQogICogZnJvbnQuCiAgKi8KLS0tIHN5cy9zb2NrYnVmLmguc2F2CTIwMTQtMDEt MzAgMjA6NDI6MjguMDAwMDAwMDAwIC0wNTAwCisrKyBzeXMvc29ja2J1Zi5oCTIwMTQtMDEtMzAg MjI6MDg6NDMuMDAwMDAwMDAwIC0wNTAwCkBAIC0xNTMsNiArMTUzLDggQEAgaW50CXNicmVzZXJ2 ZV9sb2NrZWQoc3RydWN0IHNvY2tidWYgKnNiLAogCSAgICBzdHJ1Y3QgdGhyZWFkICp0ZCk7CiBz dHJ1Y3QgbWJ1ZiAqCiAJc2JzbmRwdHIoc3RydWN0IHNvY2tidWYgKnNiLCB1X2ludCBvZmYsIHVf aW50IGxlbiwgdV9pbnQgKm1vZmYpOworc3RydWN0IG1idWYgKgorCXNic25kbWJ1ZihzdHJ1Y3Qg c29ja2J1ZiAqc2IsIHVfaW50IG9mZiwgbG9uZyAqZmlyc3RfbGVuKTsKIHZvaWQJc2J0b3hzb2Nr YnVmKHN0cnVjdCBzb2NrYnVmICpzYiwgc3RydWN0IHhzb2NrYnVmICp4c2IpOwogaW50CXNid2Fp dChzdHJ1Y3Qgc29ja2J1ZiAqc2IpOwogaW50CXNibG9jayhzdHJ1Y3Qgc29ja2J1ZiAqc2IsIGlu dCBmbGFncyk7Ci0tLSBuZXRpbmV0L3RjcF9pbnB1dC5jLnNhdgkyMDE0LTAxLTMwIDE5OjM3OjUy LjAwMDAwMDAwMCAtMDUwMAorKysgbmV0aW5ldC90Y3BfaW5wdXQuYwkyMDE0LTAxLTMwIDE5OjM5 OjA3LjAwMDAwMDAwMCAtMDUwMApAQCAtMzYyNyw2ICszNjI3LDcgQEAgdGNwX21zcyhzdHJ1Y3Qg dGNwY2IgKnRwLCBpbnQgb2ZmZXIpCiAJaWYgKGNhcC5pZmNhcCAmIENTVU1fVFNPKSB7CiAJCXRw LT50X2ZsYWdzIHw9IFRGX1RTTzsKIAkJdHAtPnRfdHNvbWF4ID0gY2FwLnRzb21heDsKKwkJdHAt PnRfdHNvbWF4c2VncyA9IGNhcC50c29tYXhzZWdzOwogCX0KIH0KIAotLS0gbmV0aW5ldC90Y3Bf b3V0cHV0LmMuc2F2CTIwMTQtMDEtMzAgMTg6NTU6MTUuMDAwMDAwMDAwIC0wNTAwCisrKyBuZXRp bmV0L3RjcF9vdXRwdXQuYwkyMDE0LTAxLTMwIDIyOjE4OjU2LjAwMDAwMDAwMCAtMDUwMApAQCAt MTY2LDggKzE2Niw4IEBAIGludAogdGNwX291dHB1dChzdHJ1Y3QgdGNwY2IgKnRwKQogewogCXN0 cnVjdCBzb2NrZXQgKnNvID0gdHAtPnRfaW5wY2ItPmlucF9zb2NrZXQ7Ci0JbG9uZyBsZW4sIHJl Y3dpbiwgc2VuZHdpbjsKLQlpbnQgb2ZmLCBmbGFncywgZXJyb3IgPSAwOwkvKiBLZWVwIGNvbXBp bGVyIGhhcHB5ICovCisJbG9uZyBsZW4sIHJlY3dpbiwgc2VuZHdpbiwgdHNvX3RsZW47CisJaW50 IGNudCwgb2ZmLCBmbGFncywgZXJyb3IgPSAwOwkvKiBLZWVwIGNvbXBpbGVyIGhhcHB5ICovCiAJ c3RydWN0IG1idWYgKm07CiAJc3RydWN0IGlwICppcCA9IE5VTEw7CiAJc3RydWN0IGlwb3ZseSAq aXBvdiA9IE5VTEw7CkBAIC03ODAsNiArNzgwLDI0IEBAIHNlbmQ6CiAJCQl9CiAKIAkJCS8qCisJ CQkgKiBMaW1pdCB0aGUgbnVtYmVyIG9mIFRTTyB0cmFuc21pdCBzZWdtZW50cyAobWJ1ZnMKKwkJ CSAqIGluIG1idWYgbGlzdCkgdG8gdHAtPnRfdHNvbWF4c2Vncy4KKwkJCSAqLworCQkJY250ID0g MDsKKwkJCW0gPSBzYnNuZG1idWYoJnNvLT5zb19zbmQsIG9mZiwgJnRzb190bGVuKTsKKwkJCXdo aWxlIChtICE9IE5VTEwgJiYgY250IDwgdHAtPnRfdHNvbWF4c2VncyAmJgorCQkJICAgIHRzb190 bGVuIDwgbGVuKSB7CisJCQkJaWYgKGNudCA+IDApCisJCQkJCXRzb190bGVuICs9IG0tPm1fbGVu OworCQkJCWNudCsrOworCQkJCW0gPSBtLT5tX25leHQ7CisJCQl9CisJCQlpZiAobSAhPSBOVUxM ICYmIHRzb190bGVuIDwgbGVuKSB7CisJCQkJbGVuID0gdHNvX3RsZW47CisJCQkJc2VuZGFsb3Qg PSAxOworCQkJfQorCisJCQkvKgogCQkJICogUHJldmVudCB0aGUgbGFzdCBzZWdtZW50IGZyb20g YmVpbmcKIAkJCSAqIGZyYWN0aW9uYWwgdW5sZXNzIHRoZSBzZW5kIHNvY2tidWYgY2FuCiAJCQkg KiBiZSBlbXB0aWVkLgotLS0gbmV0aW5ldC90Y3Bfc3Vici5jLnNhdgkyMDE0LTAxLTMwIDE5OjQ0 OjM1LjAwMDAwMDAwMCAtMDUwMAorKysgbmV0aW5ldC90Y3Bfc3Vici5jCTIwMTQtMDEtMzAgMjA6 NTY6MTIuMDAwMDAwMDAwIC0wNTAwCkBAIC0xODAwLDYgKzE4MDAsMTIgQEAgdGNwX21heG10dShz dHJ1Y3QgaW5fY29ubmluZm8gKmluYywgc3RydQogCQkJICAgIGlmcC0+aWZfaHdhc3Npc3QgJiBD U1VNX1RTTykKIAkJCQljYXAtPmlmY2FwIHw9IENTVU1fVFNPOwogCQkJCWNhcC0+dHNvbWF4ID0g aWZwLT5pZl9od190c29tYXg7CisjaWZkZWYgbm90eWV0CisJCQkJY2FwLT50c29tYXhzZWdzID0g aWZwLT5pZl9od190c29tYXhzZWdzOworI2VuZGlmCisJCQkJaWYgKGNhcC0+dHNvbWF4c2VncyA9 PSAwKQorCQkJCQljYXAtPnRzb21heHNlZ3MgPQorCQkJCQkgICAgVENQVFNPX01BWF9UWF9TRUdT X0RFRkFVTFQ7CiAJCX0KIAkJUlRGUkVFKHNyby5yb19ydCk7CiAJfQotLS0gbmV0aW5ldC90Y3Bf dmFyLmguc2F2CTIwMTQtMDEtMzAgMTk6Mzk6MjIuMDAwMDAwMDAwIC0wNTAwCisrKyBuZXRpbmV0 L3RjcF92YXIuaAkyMDE0LTAxLTMwIDIwOjUyOjU3LjAwMDAwMDAwMCAtMDUwMApAQCAtMjA5LDYg KzIwOSw3IEBAIHN0cnVjdCB0Y3BjYiB7CiAJdV9pbnQJdF9rZWVwY250OwkJLyogbnVtYmVyIG9m IGtlZXBhbGl2ZXMgYmVmb3JlIGNsb3NlICovCiAKIAl1X2ludAl0X3Rzb21heDsJCS8qIHRzbyBi dXJzdCBsZW5ndGggbGltaXQgKi8KKwl1X2ludAl0X3Rzb21heHNlZ3M7CQkvKiB0c28gYnVyc3Qg c2VnbWVudCBsaW1pdCAqLwogCiAJdWludDMyX3QgdF9pc3BhcmVbOF07CQkvKiA1IFVUTywgMyBU QkQgKi8KIAl2b2lkCSp0X3BzcGFyZTJbNF07CQkvKiA0IFRCRCAqLwpAQCAtMjY4LDYgKzI2OSwx MSBAQCBzdHJ1Y3QgdGNwY2IgewogI2RlZmluZQlUQ1BPT0JfSEFWRURBVEEJMHgwMQogI2RlZmlu ZQlUQ1BPT0JfSEFEREFUQQkweDAyCiAKKy8qCisgKiBEZWZhdWx0IHZhbHVlIGZvciBUU08gbWF4 aW11bSBudW1iZXIgb2YgdHJhbnNtaXQgc2VnbWVudHMgKGNvdW50IG9mIG1idWZzKS4KKyAqLwor I2RlZmluZQlUQ1BUU09fTUFYX1RYX1NFR1NfREVGQVVMVAkzMAorCiAjaWZkZWYgVENQX1NJR05B VFVSRQogLyoKICAqIERlZmluZXMgd2hpY2ggYXJlIG5lZWRlZCBieSB0aGUgeGZvcm1fdGNwIG1v ZHVsZSBhbmQgdGNwX1tpbnxvdXRdcHV0CkBAIC0zMzMsNiArMzM5LDcgQEAgc3RydWN0IGhjX21l dHJpY3NfbGl0ZSB7CS8qIG11c3Qgc3RheSBpbgogc3RydWN0IHRjcF9pZmNhcCB7CiAJaW50CWlm Y2FwOwogCXVfaW50CXRzb21heDsKKwl1X2ludAl0c29tYXhzZWdzOwogfTsKIAogI2lmbmRlZiBf TkVUSU5FVF9JTl9QQ0JfSF8K ------=_Part_584003_402802933.1391139152131-- From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 03:37:24 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BE83C3C8; Fri, 31 Jan 2014 03:37:24 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 5A2D5142B; Fri, 31 Jan 2014 03:37:24 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,754,1384318800"; d="scan'208";a="92530012" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 30 Jan 2014 22:37:23 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 0A2A3B4066; Thu, 30 Jan 2014 22:37:23 -0500 (EST) Date: Thu, 30 Jan 2014 22:37:23 -0500 (EST) From: Rick Macklem To: J David Message-ID: <122461163.585673.1391139443031.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Bryan Venteicher , Garrett Wollman , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 03:37:24 -0000 J David wrote: > On Wed, Jan 29, 2014 at 10:31 PM, Rick Macklem > wrote: > >> I've been busy the last few days, and won't be able to get to any > >> code > >> until the weekend. > > Is there likely to be more to it than just cranking the MAX_TX_SEGS > value and recompiling? If so, is it something I could take on? > > > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then > > increasing it from 34 to 35 would be all it takes. However, see > > below. > > One thing I don't want to miss here is that an NFS block size of > 65,536 is really suboptimal. The largest size of a TCP datagram is > 65535. So by the time NFS adds the overhead on and the total amount > of data to be sent winds up in that ~65k range, it guarantees that > the > operation has to be split it into at least two TCP packets, one > max-size and one tiny one. This doubles a lot of the network stack > overhead, regardless of whether the packet ends up being segmented > into tiny bits down the road or not. > > If NFS could be modified to respect the actual size of a TCP packet, > generating a steady stream of 63.9k (or thereabout) writes instead of > the current 64k-1k-64k-1k, performance would likely see another > significant boost. This would nearly double the average throughput > per packet, which would help with network latency and CPU load. > > It's also not 100% clear but it seems like in some cases the existing > behavior also causes the TCP stack to park on the "leftover" bit and > wait for more data, which comes in another >64k chunk, and from there > on out there's no more correlation between TCP packets and NFS > operations, so an operation doesn't begin on a packet boundary. That > continues as long as load keeps up. That's probably not good for > performance either. And it certainly confuses the heck out of > tcpdump. > > Probably 60k would be the next most reasonable size, since it's the > largest page size multiple that will fit into a TCP packet while > still > leaving room for overhead. > > Since the max size of TCP packets is not an area where there's really > any flexibility, what would have to happen to NFS to make that (or > arbitrary values) perform at its best within that constraint? > > It's apparent from even trivial testing that performance is > dramatically affected if the "use a power of two for NFS rsize/wsize" > recommendation isn't followed, but what is the origin of that? Is it > something that could be changed? > > > I don't think that m_collapse() is more likely to fail, since it > > only copies data to the previous mbuf when the entire mbuf that > > follows will fit and it's allowed. I'd assume that a ref count > > copied mbuf cluster doesn't allow this copy or things would be > > badly broken.) > > m_collapse checks M_WRITEABLE which appears to cover the ref count > case. (It's a dense macro, but it seems to require a ref count of 1 > if a cluster is used.) > > The cases where m_collapse can succeed are pretty slim. It pretty > much requires two consecutive underutilizied buffers, which probably > explains why it fails so often in this code path. Since one of its > two methods outright skips the packet header mbuf (to avoid risk of > moving it), possibly the only case where it succeeds is when the last > data mbuf is short enough that whatever NFS trailers are being > appended can fit with it. > Btw, in the previous post I agreed "in general". For this specific case of the 64K NFS read reply/write request the first two mbufs don't have much data in them. The first is the Sun RPC header generated by the krpc and the 2nd is the first part of the NFS args that preceeds the data. As such, I suspect that m_collapse() will often succeed in copying the 2nd mbuf's data into the first and reducing the mbuf count to 33. (You could find out by adding a counter for calls to m_collapse() and testing 64K without my patch. rick > > Bottom line, I think calling either m_collapse() or m_defrag() > > should be considered a "last resort". > > It definitely seems more designed for a case where 8 different stack > layers each put their own little header/trailer fingerprint on the > packet, and that's not what's happening here. > > > Maybe the driver could reduce the size of if_hw_tsomax whenever > > it finds it needs to call one of these functions, to try and avoid > > a re-occurrence? > > Since the issue is one of segment length rather than packet length, > this seems risky. If one of those touched-by-everybody packets goes > by, it may not be that large, but it would risk permanently (until > reboot) dropping the throughput of that interface. > > Thanks! > From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 03:53:05 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0C36A726; Fri, 31 Jan 2014 03:53:05 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id D654E154A; Fri, 31 Jan 2014 03:53:04 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0V3r3HF029166 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 30 Jan 2014 19:53:03 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0V3r3QN029165; Thu, 30 Jan 2014 19:53:03 -0800 (PST) (envelope-from jmg) Date: Thu, 30 Jan 2014 19:53:03 -0800 From: John-Mark Gurney To: Rick Macklem Subject: Re: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO Message-ID: <20140131035303.GT93141@funkthat.com> Mail-Followup-To: Rick Macklem , Adrian Chadd , FreeBSD Net References: <1856284835.584005.1391139152133.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1856284835.584005.1391139152133.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Thu, 30 Jan 2014 19:53:03 -0800 (PST) Cc: FreeBSD Net , Adrian Chadd X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 03:53:05 -0000 Rick Macklem wrote this message on Thu, Jan 30, 2014 at 22:32 -0500: > Adrian Chadd wrote: > > On 30 January 2014 07:06, Rick Macklem wrote: > > > Hi, just adding one more idea on what to do about this > > > to the list: > > > - Add a if_hw_tsomaxseg and modify the loop in tcp_output() > > > so that it uses both if_hw_tsomax and if_hw_tsomaxseg to > > > decide how much to hand to the device driver in each mbuf list. > > > (I haven't looked to see how easy it would be to change this > > > loop.) > > > > I don't think that's a hack. I think adding that and setting > > tsomaxseg > > to say 30 for now would be a good comprimise. > > > Well, my TCP is very rusty and I have no way to test it (I don't > have anything that does TSO), but I've attached a stab at a patch > to do this. > > Maybe it can be used as a starting point for this, if others think > it makes sense. > > The "#ifdef notyet" in the patch would become something like: > # if __FreeBSD_version >= NNNN > when a change to add if_hw_tsomaxseg is done, was what I was > thinking. Definately need to make sure you fix the drivers that support large enough sg arrays like ixgb which supports 100... Just a sampling of ones that use a _SCATTER define: ./e1000/if_igb.h:#define IGB_MAX_SCATTER 64 ./e1000/if_lem.h:#define EM_MAX_SCATTER 64 ./e1000/if_em.h:#define EM_MAX_SCATTER 32 ./nfe/if_nfereg.h:#define NFE_MAX_SCATTER 32 ./ixgbe/ixgbe.h:#define IXGBE_82598_SCATTER 100 ./ixgbe/ixgbe.h:#define IXGBE_82599_SCATTER 32 ./ixgb/if_ixgb.h:#define IXGB_MAX_SCATTER 100 I wonder how many of these are hardware limits, or just I don't want to allocate too much space on the stack, as 16 bytes per bus_dma_segment_t (on amd64) adds up... The other question is should the drivers w/ a limit on the segments reduce the size of the TSO packet so that we don't need to m_defrag/m_collapse which are expensive operations... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 04:36:19 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BA598C9; Fri, 31 Jan 2014 04:36:19 +0000 (UTC) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com [IPv6:2607:f8b0:4001:c05::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 728B51779; Fri, 31 Jan 2014 04:36:19 +0000 (UTC) Received: by mail-ig0-f178.google.com with SMTP id uq10so8705475igb.5 for ; Thu, 30 Jan 2014 20:36:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=aagJ+YALSALUvr3mHWx7BykmqSFhuPfJR6jgzI5O/W0=; b=ACOTWZ3Vs8rwaqpmcxcKSYdq6E/SJAwybpmZvADtLItLslyKfGImiiX5eblRgUyu7Z DA/vxbCud1fGcl23L/vaYV08J31goWMupdHyLxDWr6/f/OnmJ/I/M2kCfXaLifo9xctL 0nHDDDjXt/fJHjXq3xOnaHx1pLvKg/WicfhaGF9V/g7xVMB+4czuiEmPnKVwoYzVSFAX sXn2BO7ocqs5QBy5bZJ749NP3kotbZ+kYSViHsCLjCpDvw6d1M1JsQXeFA8nCiSXzfgg QsH5ui7IahSFfaGv8Q5aXnvt03fw+YGBfcO4oxvSng9pX5OGJn0niSGZGsMwPd/BjwdY jNrg== MIME-Version: 1.0 X-Received: by 10.43.51.65 with SMTP id vh1mr13779261icb.24.1391142978559; Thu, 30 Jan 2014 20:36:18 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 30 Jan 2014 20:36:18 -0800 (PST) In-Reply-To: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> References: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> Date: Thu, 30 Jan 2014 23:36:18 -0500 X-Google-Sender-Auth: ziCZJJVV4QHzrdWTvg3C_eY9gU4 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: Bryan Venteicher , Garrett Wollman , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 04:36:19 -0000 On Thu, Jan 30, 2014 at 5:44 PM, Rick Macklem wrote: > I'd like to see MAXBSIZE > increased to at least 128K, since that is the default block size for > ZFS, I've been told. Regrettably, that is incomplete. The ZFS record size is variable *up to* 128kiB by default; it's more of an upper limit than a hard and fast rule. Also, it is configurable at runtime on a per-filesystem basis. Although any file >128kiB probably does use 128kiB blocks, ZFS has ARC and L2ARC and manages its own prefetch. Probably as long as NFS treats the rsize/wsize as a fixed-sized block, the number of workloads benefited by pushing it to 128kiB may be very limited. > Also, for real networks, the NFS RPC message will be broken into > quite a few packets to go on the wire, as far as I know. (I don't > think there are real networks using a 64K jumbo packet, is there?) > For my hardware, the packets will be 1500bytes each on the wire, > since nothing I have does jumbo packets. Real environments for NFS in 2014 are 10gig LANs with hardware TSO that makes the overhead of TSO negligible. As someone else on this thread has already pointed out, efficiently utilizing TSO is essentially mandatory to make good use of 10gig hardware. So as far as FreeBSD is concerned, yes, many networks effectively have a 64k MTU (for TCP only since FreeBSD does not implement GSO at this time) and it should act accordingly when dealing with them. This NFS buffer size is nearly doubling the number of TCP packets it takes to move the same amount data. Regardless of how those packets are eventually segmented -- which can be effectively ignored in the real world of hardware TSO -- the overhead of TCP and IP is not nil, cannot be offloaded, and doubling it is not a good thing. It doubles every step down to the very bottom, including optional stuff like PF if it is hanging around in there. > Unfortunately, NFS adds a little bit to the front of the data, so > an NFS RPC will always be a little bit more than a power of 2 in > size for reads/writes of a power of 2. That's why NFS should be able to operate on page-sized multiples rather than powers of 2. Then it can operate on the filesystem using the best size for that, operate on the network using the best size for that, and mediate the two using page-sized jumbo clusters. If you know the underlying filesystem block size, by all means, read or write based on it where appropriate. > Now, I am not sure why 65535 (largest ip datagram) has been chosen > as the default limit for TSO segments? The process of TCP segmentation, whether offloaded or not, is performed on a single TCP packet. It operates by reusing that packet's header over and over for each segment with slight modifications. Consequently the maximum size that can be offloaded is the maximum size that can be segmented: one packet. > Well, since NFS sets the TCP_NODELAY socket option, that shouldn't > occur in the TCP layer. If some network device driver is delaying, > waiting for more to send, then I'd say that device driver is broken. This is not a driver issue. TCP_NODELAY means "don't wait for more data." It doesn't mean "don't send more data that is ready to be sent." If there's more data already present on the stream by the time the TCP stack gets to it, which is possible in an SMP environment, TCP_NODELAY won't, as far as I know, prevent it from being sent in the next available packet. This isn't necessarily something that happens every time, or even consistently, but when you're sending a hundred thousand packets per second, it looks like the chain can indeed come off the bicycle. NFS is not sending packets to the TCP stack, it is sending stream data. With TCP_NODELAY it should be possible to engineer a one send = one packet correlation, but that's true if and only if that send is less than the max packet size. > For real NFS environments, the performance of the file system and > underlying disk subsystem is generally more important than the network. Maybe this is the case if NFS is serving from one spinning disk. It's definitely not the case for ZFS installs with 128GiB RAM, shelves of SAS drives, TB of SSD L2ARC, and STEC slog devices. The performance of the virtual environment we're using as a test platform is remarkably close to that. It just has the benefit of being two orders of magnitude cheaper and therefore something that can be set aside for testing stuff like this. > (Some > NAS vendors avoid this by using non-volatile ram in the server as stable > storage, but a FreeBSD server can't expect such hardware to be available.) Nonvolatile slogs are all but mandatory in any ZFS-backed-NFS fileserver deployment. Like TSO, it's not hypothetical, it is standard for production deployments. >> but what is the origin of that? Is it >> something that could be changed? >> > Because disk file systems on file servers always use block sizes that > are a power of 2. Maybe my question wasn't phrased well. What is the origin of the huge performance drop when a non-multiple-of-2 size is used? This is visible under small random ops where the data difference between a 60k read and a 64k read isn't ever used and the next block is almost certainly not going to be read next. So it's very weird (to me) that performance drops as much as it does. > Agreed. I think adding a if_hw_tsomaxseg that TCP can use is preferable. It may be valuable for other workloads to prevent drops on some kind of pathologically sliced-up packets, but jumbo cluster support in NFS should pretty much guarantee that it is not going to have a problem in this area with any interface in common use. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 06:18:34 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CF903D91 for ; Fri, 31 Jan 2014 06:18:34 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9ED5D1E26 for ; Fri, 31 Jan 2014 06:18:34 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0V6IVEN027168; Fri, 31 Jan 2014 01:18:31 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0V6IVJv027167; Fri, 31 Jan 2014 01:18:31 -0500 (EST) (envelope-from wollman) Date: Fri, 31 Jan 2014 01:18:31 -0500 (EST) Message-Id: <201401310618.s0V6IVJv027167@hergotha.csail.mit.edu> From: wollman@freebsd.org To: j.david.lists@gmail.com Subject: Re: Terrible NFS performance under 9.2-RELEASE? X-Newsgroups: mit.lcs.mail.freebsd-net In-Reply-To: References: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> Organization: none X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Fri, 31 Jan 2014 01:18:31 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 06:18:35 -0000 In article , J David writes: >The process of TCP segmentation, whether offloaded or not, is >performed on a single TCP packet. It operates by reusing that >packet's header over and over for each segment with slight >modifications. Consequently the maximum size that can be offloaded is >the maximum size that can be segmented: one packet. This is almost entirely wrong in its description of the non-offload case. A segment is a PDU at the transport layer. In normal operation, TCP figures out how much it can send, constructs a header, and copies an mbuf chain referencing one segment's worth of data out of the socket's transmit buffer. tcp_output() repeats this process (possibly using the same mbuf cluster multiple times, if it's larger than the receiver's or the path's maximum segment size) until it either runs out of stuff to send, or runs out of transmit window to send into. THAT IS WHY TSO IS A WIN: as you describe, the packet headers are mostly identical, and (if the transmit window allows) it's much cheaper to build the header and do the DMA setup once, then let the NIC take over from there, rather than having to DMA a different (but nearly identical) header for every individual segment. >NFS is not sending packets to the TCP stack, it is sending stream >data. With TCP_NODELAY it should be possible to engineer a one send = >one packet correlation, but that's true if and only if that send is >less than the max packet size. Yes and no. NFS constructs a chain of mbufs and calls the socket's sosend() routine. This ultimately results in a call to tcp_output(), and in the normal case where there is no data awaiting transmission, that mbuf chain will be shallow-copied (bumping all the mbuf cluster reference counts) up to the limit of what the transmit window allows, and Ethernet, IP, and TCP headers will be prepended (possibly in a separate mbuf). The whole mess is then passed on to the hardware for offload, if it fits. RPC responses will only get smushed together if tcp_output() wasn't able to schedule the transmit immediately, and if the network is working properly, that will only happen if there's more than one client-side-receive-window's-worth of data to be transmitted. This shallow-copy behavior, by the way, is why the drivers need m_defrag() rather than m_collapse(): M_WRITABLE is never true for clusters coming out of tcp_output(), because the refcount will never be less than 2 (one for the socket buffer and at least one for the interface's transmit queue, depending on how many segments include some data from the cluster). But it's also part of why having a "gigantic" cluster (e.g., 128k) would be a big win for NFS. -GAWollman From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 17:58:16 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C69861A1; Fri, 31 Jan 2014 17:58:16 +0000 (UTC) Received: from mail-ie0-x232.google.com (mail-ie0-x232.google.com [IPv6:2607:f8b0:4001:c03::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 895181945; Fri, 31 Jan 2014 17:58:16 +0000 (UTC) Received: by mail-ie0-f178.google.com with SMTP id x13so4720460ief.9 for ; Fri, 31 Jan 2014 09:58:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=OuiCkJqdf7ZDhRD3EbTuIQDZsZXdHn1CR3hukVW+0tI=; b=ZFlmxfNWy3UAL1DvI9ja4/X7Fc1SRLbmhUVRi0GVQqcnoR/buj/CbfEGN7I2VbTTqm 6QHg8ni0bHYDadbkM5mftBxLqMEfcHg4ntNlGaaTTxv0Ap3fqaKfrGwnVAWBezgIGCSS wRmHKYvt44rQ6BYjNlxf5S/9x9EqpMwVu3FXUbF+EVVQENQ2otSjV2ARqCucaHHq9CTG 86QfqKd7JvTQOADpTmBRfXEH7MGRtV93I5JUoA1v/x5kQb/R8d+6keWOEZ9KLebb0akP Cb6RDbQC98e37hyk8Xsi31eTaEEtMjc4O1bB94wdBqjJjfYWPZlHDZwowCf07g8efrDG UdBQ== MIME-Version: 1.0 X-Received: by 10.43.82.69 with SMTP id ab5mr992946icc.95.1391191096052; Fri, 31 Jan 2014 09:58:16 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 31 Jan 2014 09:58:15 -0800 (PST) In-Reply-To: <201401310618.s0V6IVJv027167@hergotha.csail.mit.edu> References: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> <201401310618.s0V6IVJv027167@hergotha.csail.mit.edu> Date: Fri, 31 Jan 2014 12:58:15 -0500 X-Google-Sender-Auth: RGZ1-iXDuUwPxh2EdH2xGmc6rCo Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Garrett Wollman Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 17:58:16 -0000 On Fri, Jan 31, 2014 at 1:18 AM, wrote: > This is almost entirely wrong in its description of the non-offload > case. Yes, you're quite right; I confused myself. GSO works a little differently, but FreeBSD doesn't use that. > The whole mess is then passed on to the hardware for > offload, if it fits. That's the point, NFS is creating a situation where it never fits. It can't shove 65k into 64k, so it ends up looping back through the whole output routine again for a tiny tail of data, and then the same for the input routine on the other side. Arguably that makes rsize/wsize 65536 negligibly different than rsize/wsize 32768 in the long run because the average data output per pass is about the same (64k + 1k vs 33k + 33k). Except, of course, in the case where almost all files are between 32k and 60k. Please don't get me wrong, I'm not suggesting there's anything more than a small CPU reduction to be obtained by changing this. Which is not nothing if the client is CPU-limited due to the other work it's doing, but it's not much. To get real speedups from NFS would require a change to the punishing read-before-write behavior, which is pretty clearly not going to happen. > RPC responses will only get smushed together if > tcp_output() wasn't able to schedule the transmit immediately, and if > the network is working properly, that will only happen if there's more > than one client-side-receive-window's-worth of data to be transmitted. This is something I have seen live in tcpdump, but then I have had so many problems with NFS and congestion control that the "network is working properly" condition probably isn't satisfied. Hopefully the jumbo cluster changes will resolve that once and for all. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 19:41:39 2014 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 74D7FA62 for ; Fri, 31 Jan 2014 19:41:39 +0000 (UTC) Received: from mx1.sbone.de (mx1.sbone.de [IPv6:2a01:4f8:130:3ffc::401:25]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2BC86125C for ; Fri, 31 Jan 2014 19:41:39 +0000 (UTC) Received: from mail.sbone.de (mail.sbone.de [IPv6:fde9:577b:c1a9:31::2013:587]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.sbone.de (Postfix) with ESMTPS id 56D9825D3897 for ; Fri, 31 Jan 2014 19:41:35 +0000 (UTC) Received: from content-filter.sbone.de (content-filter.sbone.de [IPv6:fde9:577b:c1a9:31::2013:2742]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPS id D4012C22C60 for ; Fri, 31 Jan 2014 19:41:34 +0000 (UTC) X-Virus-Scanned: amavisd-new at sbone.de Received: from mail.sbone.de ([IPv6:fde9:577b:c1a9:31::2013:587]) by content-filter.sbone.de (content-filter.sbone.de [fde9:577b:c1a9:31::2013:2742]) (amavisd-new, port 10024) with ESMTP id DPEHQt2VJF5y for ; Fri, 31 Jan 2014 19:41:33 +0000 (UTC) Received: from nv.sbone.de (nv.sbone.de [IPv6:fde9:577b:c1a9:31::2013:138]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPSA id 38EC9C22C0D for ; Fri, 31 Jan 2014 19:41:33 +0000 (UTC) Date: Fri, 31 Jan 2014 19:41:29 +0000 (UTC) From: "Bjoern A. Zeeb" To: FreeBSD Net Subject: 10.0-R noinet snapshots available Message-ID: X-OpenPGP-Key-Id: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 19:41:39 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, it's been a while but I have produced a new set of noinet snapshots for FreeBSD 10.0-RELEASE. Download i386 or amd64 noinet-snapshot-10.0-RELEASE-r260789 install media from any of the mirrors mentioned on https://wiki.freebsd.org/IPv6Only or learn how to build a noinet system yourself on https://www.freebsd.org/ipv6/ipv6only.html . With the new package system in place you can turn these snapshots into an IPv6-only desktop or server in minutes. Try it out! Make sure your applications, your research project, your web presence, you name it works with IPv6. Enjoy, Bjoern Checksums: amd64: SHA256 (FreeBSD-10.0-RELEASE-amd64-bootonly.iso) = 24400220df2a2728ab45b85d9aa48a8bc5cbd3166f39c1d9d5d1ce1f41bb05bc SHA256 (FreeBSD-10.0-RELEASE-amd64-disc1.iso) = 629f041dee6d127ca94c62d6aa41991f25d208dbe19c424484004bdc92bf5150 SHA256 (FreeBSD-10.0-RELEASE-amd64-memstick.img) = bf9b04754dc809d47cad4d4bbb8893a7a29d0c6e2988c27a080f1380547c23e6 SHA256 (ftp/MANIFEST) = 8d1eeb8d12892a69d2402d1294293962bb31b588ab52666490d98ac2b19642ba MD5 (FreeBSD-10.0-RELEASE-amd64-bootonly.iso) = 68f0be479177a698686bf956632140e2 MD5 (FreeBSD-10.0-RELEASE-amd64-disc1.iso) = cd5f5e5575919082317fd378fa4b5105 MD5 (FreeBSD-10.0-RELEASE-amd64-memstick.img) = edbc9b48ebac4e7f8166c91105ffafdb MD5 (ftp/MANIFEST) = 4ee8ba1f71c04caca70e021830bde370 i386: SHA256 (FreeBSD-10.0-RELEASE-i386-bootonly.iso) = e3c81250dd0cdabc78cdd767bc5ee0f3a81e992923aa9aae722336161d67198f SHA256 (FreeBSD-10.0-RELEASE-i386-disc1.iso) = da8bfc78464997baf00b179ce4d307e6fe24aa8c8fc5aec84a680d00b21ac080 SHA256 (FreeBSD-10.0-RELEASE-i386-memstick.img) = bea592a2a87344722cf127ed89462a3a4f7a6a970ca0d89182bd1c9218495846 SHA256 (ftp/MANIFEST) = a17d5fe9b8bb27340d125d3d0aa6ffe68861a769e004f97a159069f10802fbbc MD5 (FreeBSD-10.0-RELEASE-i386-bootonly.iso) = 5a1753c397a5f58811c41f759e68b0e3 MD5 (FreeBSD-10.0-RELEASE-i386-disc1.iso) = 26f3cfbdd5f8f2fd0b182fec624e1abc MD5 (FreeBSD-10.0-RELEASE-i386-memstick.img) = 7781d6e5f0c41225311b7752d80e0ea5 MD5 (ftp/MANIFEST) = 0d7f4fe729870b206b23a7f1c56c5773 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAlLr/GkACgkQK1i4+DzPGEJBDACcCHKSonlGKkBu7wJZY7pPk3um 6m4AoKI716/125C7bIr5Y8cDBq5jZB7i =YCT+ -----END PGP SIGNATURE----- From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 23:17:02 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 701F2D5C; Fri, 31 Jan 2014 23:17:02 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 21E5B1364; Fri, 31 Jan 2014 23:17:01 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAKwu7FKDaFve/2dsb2JhbABZg0RXgwG6CU+BInSCJQEBAQMBAQEBICsgCwUWGAICDRkCKQEJJgYIBwQBHASHXAgNrAChMBeBKY0BBwEBARo0B4JvgUkEiUmMDoQFkG+DSx4xewkXIg X-IronPort-AV: E=Sophos;i="4.95,760,1384318800"; d="scan'208";a="92176026" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 31 Jan 2014 18:16:45 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 79BA8B3EFE; Fri, 31 Jan 2014 18:16:45 -0500 (EST) Date: Fri, 31 Jan 2014 18:16:45 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1622306213.1079665.1391210205488.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 23:17:02 -0000 J David wrote: > On Fri, Jan 31, 2014 at 1:18 AM, wrote: > > This is almost entirely wrong in its description of the non-offload > > case. > > Yes, you're quite right; I confused myself. GSO works a little > differently, but FreeBSD doesn't use that. > > > The whole mess is then passed on to the hardware for > > offload, if it fits. > > That's the point, NFS is creating a situation where it never fits. > It > can't shove 65k into 64k, so it ends up looping back through the > whole > output routine again for a tiny tail of data, and then the same for > the input routine on the other side. Arguably that makes rsize/wsize > 65536 negligibly different than rsize/wsize 32768 in the long run > because the average data output per pass is about the same (64k + 1k > vs 33k + 33k). Except, of course, in the case where almost all files > are between 32k and 60k. > You can certainly try "-o rsize=61440,wsize=61440" (assuming a 4K page size) for the mount, if you'd like. There is a bug (that is a 1 line patch I keep forgetting to put in) where, if you choose an rsize,wsize not an exact multiple of PAGE_SIZE, mmap'd files can get garbage from the partially valid pages. However, I'm pretty sure you are safe so long as you specify exact multiples of PAGE_SIZE. The default size is the size recommended by the NFS server, capped at MAXBSIZE. (Btw, Solaris10 recommends 256K and allows 1Mbyte. FreeBSD recommends and allows MAXBSIZE.) I'll admit I'm not convinced that the reduced overheads of using 61440 outweight the fact that the server file systems use blocksizes that are always a power of 2. Without good evidence that using 61440 is better, I wouldn't want the server recommending that. (And I don't know how NFS would know that it is sending on a TSO enabled interface.) rick > Please don't get me wrong, I'm not suggesting there's anything more > than a small CPU reduction to be obtained by changing this. Which is > not nothing if the client is CPU-limited due to the other work it's > doing, but it's not much. To get real speedups from NFS would > require > a change to the punishing read-before-write behavior, which is pretty > clearly not going to happen. > > > RPC responses will only get smushed together if > > tcp_output() wasn't able to schedule the transmit immediately, and > > if > > the network is working properly, that will only happen if there's > > more > > than one client-side-receive-window's-worth of data to be > > transmitted. > > This is something I have seen live in tcpdump, but then I have had so > many problems with NFS and congestion control that the "network is > working properly" condition probably isn't satisfied. Hopefully the > jumbo cluster changes will resolve that once and for all. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 23:20:58 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B456DBA; Fri, 31 Jan 2014 23:20:58 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 676DD13E2; Fri, 31 Jan 2014 23:20:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,760,1384318800"; d="scan'208";a="92702741" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 31 Jan 2014 18:20:56 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id A57A0B3F43; Fri, 31 Jan 2014 18:20:56 -0500 (EST) Date: Fri, 31 Jan 2014 18:20:56 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1609454808.1083115.1391210456671.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 23:20:58 -0000 J David wrote: > On Fri, Jan 31, 2014 at 1:18 AM, wrote: > > This is almost entirely wrong in its description of the non-offload > > case. > > Yes, you're quite right; I confused myself. GSO works a little > differently, but FreeBSD doesn't use that. > > > The whole mess is then passed on to the hardware for > > offload, if it fits. > > That's the point, NFS is creating a situation where it never fits. > It > can't shove 65k into 64k, so it ends up looping back through the > whole > output routine again for a tiny tail of data, and then the same for > the input routine on the other side. Arguably that makes rsize/wsize > 65536 negligibly different than rsize/wsize 32768 in the long run > because the average data output per pass is about the same (64k + 1k > vs 33k + 33k). Except, of course, in the case where almost all files > are between 32k and 60k. > Oh, and remember to try setting readahead=8 in your mounts, too. NFS will do a read + N readaheads (where N == 1 by default) and then wait for replies to those before continuing on. If the product of rsize * readahead isn't enough bits to fill the pipe (bandwidth * transit delay), then you won't be using the bandwidth your network interface provides. rick ps: Any you probably want your nfsd threads to be at least 16 instead of the default of 4. > Please don't get me wrong, I'm not suggesting there's anything more > than a small CPU reduction to be obtained by changing this. Which is > not nothing if the client is CPU-limited due to the other work it's > doing, but it's not much. To get real speedups from NFS would > require > a change to the punishing read-before-write behavior, which is pretty > clearly not going to happen. > > > RPC responses will only get smushed together if > > tcp_output() wasn't able to schedule the transmit immediately, and > > if > > the network is working properly, that will only happen if there's > > more > > than one client-side-receive-window's-worth of data to be > > transmitted. > > This is something I have seen live in tcpdump, but then I have had so > many problems with NFS and congestion control that the "network is > working properly" condition probably isn't satisfied. Hopefully the > jumbo cluster changes will resolve that once and for all. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 23:45:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9F37086D for ; Fri, 31 Jan 2014 23:45:12 +0000 (UTC) Received: from mail-ee0-x22a.google.com (mail-ee0-x22a.google.com [IPv6:2a00:1450:4013:c00::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3B66F15DC for ; Fri, 31 Jan 2014 23:45:12 +0000 (UTC) Received: by mail-ee0-f42.google.com with SMTP id b15so751833eek.29 for ; Fri, 31 Jan 2014 15:45:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=ItOilkexPfeiNLizJhiOTpVezO51thfPcHeAy8iIWDI=; b=YXT0x1D56gYZM2e4kupKHP3SaRNkKBNFireiCdS+DYXjKzlfLaEs3m+19qGJSSBqjn L7bGfL0IknU8V+ljJcRkUNnPw66VBGTIwNhTZMKqeGv3dafAIB8Kgj/MB0m2wAqdPZMM 262t4qMlV1J8pyMF0lJ7AXfjf96cmgHACB+/xzLmroREH7N39grArxa5giARYNzjb4fo RRDT5vpcHoVp4QjYQ7j1UxdS9mD4GjUC34flROaScMXluO6mfQZf8NH5EkzozRbdFmE8 lrIUbFfxedYojdfEcKmO9ztDdKFyMTcGr1GLm2S33+o/Ua4xfEj3tW/ChW03Vw0W8D5g 6dyA== MIME-Version: 1.0 X-Received: by 10.14.126.9 with SMTP id a9mr5850552eei.95.1391211910555; Fri, 31 Jan 2014 15:45:10 -0800 (PST) Received: by 10.14.65.4 with HTTP; Fri, 31 Jan 2014 15:45:10 -0800 (PST) Date: Fri, 31 Jan 2014 15:45:10 -0800 Message-ID: Subject: Errors using span interface on if_bridge(4) From: hiren panchasara To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 23:45:12 -0000 Below is my setup: 11.0-CURRENT FreeBSD 11.0-CURRENT #1 r260789:260806M: Thu Jan 23 21:18:08 UTC 2014 (n/w stack is untouched) ix1: flags=8943 metric 0 mtu 1500 options=8400b8 ether 38:ea:a7:8b:af:c4 inet6 fe80::3aea:a7ff:fe8b:afc4%ix1 prefixlen 64 scopeid 0x6 inet 10.73.149.91 netmask 0xffffff00 broadcast 10.73.149.255 nd6 options=29 media: Ethernet autoselect (10Gbase-Twinax ) status: active ix2: flags=8943 metric 0 mtu 1500 options=8400b8 ether 90:e2:ba:30:73:40 inet6 fe80::92e2:baff:fe30:7340%ix2 prefixlen 64 scopeid 0x7 inet 192.168.0.2 netmask 0xffffff00 broadcast 192.168.0.255 nd6 options=29 media: Ethernet autoselect (10Gbase-Twinax ) status: active ix3: flags=8943 metric 0 mtu 1500 options=8400b8 ether 90:e2:ba:30:73:41 inet6 fe80::92e2:baff:fe30:7341%ix3 prefixlen 64 scopeid 0x8 inet 192.168.0.3 netmask 0xffffff00 broadcast 192.168.0.255 nd6 options=29 media: Ethernet autoselect (autoselect ) status: active bridge0: flags=8843 metric 0 mtu 1500 ether 02:a1:25:9a:8f:00 nd6 options=9 id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 member: ix1 flags=143 ifmaxaddr 0 port 6 priority 128 path cost 2000 member: ix2 flags=8 ifmaxaddr 0 port 7 priority 128 path cost 2000 ix2 and ix3 are connected back to back via a cable so that I can snoop any traffic arriving on bridge0 on to ix3. I have tcpdump going on all 3 interfaces. What I am seeing is interesting when I send data to ix1 via iperf3 (iperf3 -c 10.73.149.91) . I see packets coming to ix1, getting copied to ix2 but on ix3 I only see a few packets making it successfully, for rest I see: 23:30:01.308691 IP bad-hlen 0 23:30:01.308700 IP bad-hlen 0 23:30:01.308711 IP bad-hlen 0 Failure is intermittent. Some packets get through but I see this error for others. Looking at the packet carefully, for all those packets with errors, header length for ipv4 is being reported as 0. Only other indication I could see was: -bash-4.2$ sysctl -a | grep checksum_errs dev.ix.0.mac_stats.checksum_errs: 0 dev.ix.1.mac_stats.checksum_errs: 0 dev.ix.2.mac_stats.checksum_errs: 0 dev.ix.3.mac_stats.checksum_errs: 5686743 I also disabled tso and lro on all of them. Looking at the code: if_bridge.c has bridge_span() which does m_copypacket() to span interface. 2549 mc = m_copypacket(m, M_NOWAIT); 2550 if (mc == NULL) { 2551 sc->sc_ifp->if_oerrors++; 2552 continue; 2553 } 2554 2555 bridge_enqueue(sc, dst_if, mc); Now, I am not sure if its failing at m_copypacket() or after that in bridge_enqueue(). Not sure how do I look at if_oerrors count. Any further help in debugging would be great. cheers, Hiren From owner-freebsd-net@FreeBSD.ORG Sat Feb 1 01:20:01 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2975B96A; Sat, 1 Feb 2014 01:20:01 +0000 (UTC) Received: from mail-ie0-x233.google.com (mail-ie0-x233.google.com [IPv6:2607:f8b0:4001:c03::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E2A7A1CA7; Sat, 1 Feb 2014 01:20:00 +0000 (UTC) Received: by mail-ie0-f179.google.com with SMTP id ar20so4977053iec.10 for ; Fri, 31 Jan 2014 17:20:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=S5aYt//r4llUtWOUnBsX/w7f4ApMl+41Eh8RwIVf65s=; b=eHYjYbP/vEfYlRhYyANgSY1Ee9HEvpj8KE0Kfkjlriz0x0nGCX5sD4M5K574/o4BQ3 UC750yAC0JCYvMqBnj2BH0gk3C7fn97XmNVck/amSksH7Tvj6qDNfMDIsT1bJWhiF1KX gFE2WRmnOpDGeZ1X4ZRq+FYmXRmyU3PUBGXzRYK/dXx4Lb0LBLJMVYbNXfl5Ps4JIm66 JfbU08s1M0AUqbOJojMfrIgsdxH9VfnlzvI65k3QftHggL8HMpNfzlLHJ5Bv4JFV1rXc B/lXxte2RHxgLRrLqzq7uvhlC6DAItCTiqrVcKPaXYzWeIrrnjFNjLWKNzTTe91Q/MH5 TfhQ== MIME-Version: 1.0 X-Received: by 10.50.60.105 with SMTP id g9mr1447813igr.14.1391217600360; Fri, 31 Jan 2014 17:20:00 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 31 Jan 2014 17:20:00 -0800 (PST) In-Reply-To: <1622306213.1079665.1391210205488.JavaMail.root@uoguelph.ca> References: <1622306213.1079665.1391210205488.JavaMail.root@uoguelph.ca> Date: Fri, 31 Jan 2014 20:20:00 -0500 X-Google-Sender-Auth: fACMXLJidVL3k7PeVpegChTlAJY Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 01:20:01 -0000 On Fri, Jan 31, 2014 at 6:16 PM, Rick Macklem wrote: > You can certainly try "-o rsize=61440,wsize=61440" (assuming a 4K page size) > for the mount, if you'd like. This has previously been tested with all 4k steps between 16k and 32k. All of them perform worse than With 61440, NFS fails outright on the random read test: $ iozone -e -I -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: freebsd [...] Include fsync in write timing O_DIRECT feature enabled File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -I -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution = 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 1048576 4 24688 23891 Error reading block at 1073729536 read: Bad file descriptor Upon using the -w option, which leaves the file intact on exit, it's possible to see that it's not even 1gig in length: $ ls -aln iozone.tmp -rw-r----- 1 1000 0 1073709056 Feb 1 01:18 iozone.tmp It's 32k short, which is a pretty surprising result. Thanks! From owner-freebsd-net@FreeBSD.ORG Sat Feb 1 01:24:00 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E2B95A34 for ; Sat, 1 Feb 2014 01:24:00 +0000 (UTC) Received: from mail-ea0-x229.google.com (mail-ea0-x229.google.com [IPv6:2a00:1450:4013:c01::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7CFE91D14 for ; Sat, 1 Feb 2014 01:24:00 +0000 (UTC) Received: by mail-ea0-f169.google.com with SMTP id h10so2697882eak.28 for ; Fri, 31 Jan 2014 17:23:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=DldUvhk2YYHjl7UFf7K2u2lvsRgxn091VS6CsQw3f0M=; b=l6jdXn+2wjwuNp2EHC+4vVMyvcnDfw1yjMFieqOxhfsD0uOvt93IeSUO46C7RsT9uc DS1udWMaXm4bQ7303SKhajinu6gGvaVhBfCD0mV+5E9XGakFYogTE1ALqKkjKe05fPbq 9pRPGqaUSPA0LRPIJ7f05cFAgITuBzk8nByALAnjr53Bhp30tZgF/8Rwi6JDKAtdfeKx dbmcOR2AoQXgQKmJ14Tk2R81PoOt19RPUSEw/gvuTLHm7ttfBfG7ndEdsbT7qeSCyeNw 6ru+cW3e5+XahqmqkG0m2MynVcWOYJkrLg8qNjuHwKcHSCbx9z4nbllWAIL24NtNIZHH 46bg== MIME-Version: 1.0 X-Received: by 10.14.6.5 with SMTP id 5mr21987956eem.51.1391217838906; Fri, 31 Jan 2014 17:23:58 -0800 (PST) Received: by 10.14.65.4 with HTTP; Fri, 31 Jan 2014 17:23:58 -0800 (PST) In-Reply-To: References: Date: Fri, 31 Jan 2014 17:23:58 -0800 Message-ID: Subject: Re: Errors using span interface on if_bridge(4) From: hiren panchasara To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 01:24:00 -0000 On Fri, Jan 31, 2014 at 3:45 PM, hiren panchasara wrote: > Looking at the code: if_bridge.c has bridge_span() which does > m_copypacket() to span interface. > > 2549 mc = m_copypacket(m, M_NOWAIT); > 2550 if (mc == NULL) { > 2551 sc->sc_ifp->if_oerrors++; > 2552 continue; > 2553 } > 2554 > 2555 bridge_enqueue(sc, dst_if, mc); > > Now, I am not sure if its failing at m_copypacket() or after that in > bridge_enqueue(). Not sure how do I look at if_oerrors count. -bash-4.2$ netstat -I ix3 Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll ix3 1500 90:e2:ba:30:73:41 9869468123 0 439521 28167217 0 0 ix3 - fe80::92e2:ba fe80::92e2:baff:f 0 - - 2 - - ix3 - 192.168.0.0 192.168.0.3 0 - - 0 - - (sorry if this doesn't format/line-wrap correctly). Basically Oerrs is 0 here. So I _think_ its failing in/after bridge_enqueue()?? cheers, Hiren From owner-freebsd-net@FreeBSD.ORG Sat Feb 1 01:41:06 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CC1C0E2D; Sat, 1 Feb 2014 01:41:06 +0000 (UTC) Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com [IPv6:2607:f8b0:4001:c05::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 910071E27; Sat, 1 Feb 2014 01:41:06 +0000 (UTC) Received: by mail-ig0-f169.google.com with SMTP id uq10so2217954igb.0 for ; Fri, 31 Jan 2014 17:41:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=809WyS4AowujPVBOoSqcmfMWqqLhGWjfV5hkWBXl1Zg=; b=H0NuY9jQ63yowaFL2gZKDpBBw0wl2OAyHHRhJhZv1PIn75iQEIlbNHD6yjLj6L0pxt pllsj2XNL3HXdCGItxjOqR1bHvM2sRfnmv8EoJnqnEq3tihDuHsqsnebNMSN8SI1hcq5 VbG1jwVFUX1n4dvCtIR22S7OithFqNHZQ2yHQbr8pRvUpSHiOf+RWiEoxGGnYEE9/1h7 rCAK1X05pqpFqA/WvnhOffXhfqBuuCTjB2QFkxvCicKQGU1MhP3jr2ewfAtAm1ssHL7v uE0AHSkKon2LS+FJyxFO7Ip+QFYR6bHTzOz8Afu2RP4u+/aBNfS94sIVngi5x/lQVGJI WHdg== MIME-Version: 1.0 X-Received: by 10.42.52.209 with SMTP id k17mr17146362icg.1.1391218866028; Fri, 31 Jan 2014 17:41:06 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 31 Jan 2014 17:41:05 -0800 (PST) In-Reply-To: <1609454808.1083115.1391210456671.JavaMail.root@uoguelph.ca> References: <1609454808.1083115.1391210456671.JavaMail.root@uoguelph.ca> Date: Fri, 31 Jan 2014 20:41:05 -0500 X-Google-Sender-Auth: B8v5sRIw6xBSVZsi1DwpfAzKP5Y Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 01:41:06 -0000 On Fri, Jan 31, 2014 at 6:20 PM, Rick Macklem wrote: > Oh, and remember to try setting readahead=8 in your mounts, too. NFS will > do a read + N readaheads (where N == 1 by default) and then wait for > replies to those before continuing on. Predictably, this has no effect on anything but sequential reads. No tuning is going to change the fact that writing 14MiB/sec from the client to the server results in 200+ MiB/sec of wasted traffic being sent from the server back to the client. This is from the client's interface during a write-only test: Interface Traffic Peak Total vtnet1 in 202.838 MB/s 219.467 MB/s 359.898 GB out 14.127 MB/s 14.346 MB/s 96.503 GB If write performance did get to wire speed on this workload, the most it could ever do would be <128MiB/sec, because the unused backflow of 2GiB/sec would max out the interface. Thanks! From owner-freebsd-net@FreeBSD.ORG Sat Feb 1 02:09:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A2F531B3 for ; Sat, 1 Feb 2014 02:09:27 +0000 (UTC) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 328291F5E for ; Sat, 1 Feb 2014 02:09:27 +0000 (UTC) Received: from [IPv6:2607:f3e0:0:4:f025:8813:7603:7e4a] (saphire3.sentex.ca [IPv6:2607:f3e0:0:4:f025:8813:7603:7e4a]) by smarthost1.sentex.ca (8.14.7/8.14.7) with ESMTP id s1129Pw6085759; Fri, 31 Jan 2014 21:09:26 -0500 (EST) (envelope-from mike@sentex.net) Message-ID: <52EC573B.109@sentex.net> Date: Fri, 31 Jan 2014 21:08:59 -0500 From: Mike Tancsa Organization: Sentex Communications User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: "freebsd-net@freebsd.org" Subject: missing missing packets in igb stats ? Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.74 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 02:09:27 -0000 Hi Jack, I was testing out forwarding and firewalling speeds of the igb driver on RELENG_10 and noticed something odd. I have 2 boxes connected to a FreeBSD box in the middle FreeBSD-A(em1)-----------(igb1)Router-1(igb0)----------(em1)FreeBSD-B So Box A generates packets as fast as it can to FreeBSD box B's em1 nic. Router-1 is a FreeBSD box as releng10. Watching ifstat on Router-1 as I execute the command on FreeBSD-A # ./netblast 1.1.1.2 500 100 20 start: 1391219372.477992294 finish: 1391219392.496952108 send calls: 10877557 send errors: 0 approx send rate: 543877 approx error rate: 0 I see on the router-1 box igb0 igb1 Kbps in Kbps out Kbps in Kbps out 0.00 0.00 0.00 0.00 1600.61 191639.1 280888.7 0.00 3669.84 434348.6 636134.9 0.00 3706.56 438636.7 596650.5 0.00 3755.10 444358.9 562814.3 0.00 3714.89 439478.5 562056.4 0.00 3796.79 449397.9 562042.9 0.00 3786.02 447957.2 577561.4 0.00 3629.18 429453.4 601285.7 0.00 3728.48 441312.7 597785.3 0.00 3806.67 450401.2 596247.0 0.00 3854.79 456150.2 597865.7 0.00 3690.11 436552.1 596695.8 0.00 3676.08 435002.6 596462.8 0.00 3730.35 441535.2 597132.1 0.00 3680.43 435518.2 596960.3 0.00 3741.41 442685.3 597750.8 0.00 3691.93 436870.6 596236.9 0.00 3627.31 429120.5 594116.8 0.00 3661.97 433492.7 595812.0 0.00 3693.86 437169.0 597826.9 0.00 2046.18 240635.3 331656.3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Notice the rate of traffic coming in on igb1 is higher than what is going out on igb0. Box A thinks it sent traffic at some 536,616 pkts per second or 590Mb/s. However, traffic going out is slower, and what is seen at box B is less. It sees the traffic at 286Mb/s and 357,873 pps Given the lost packets, should this not show up somewhere in the igb statistics ? dev.igb.0.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0 dev.igb.0.%driver: igb dev.igb.0.%location: slot=0 function=0 handle=\_SB_.PCI0.PEG0.PEGP dev.igb.0.%pnpinfo: vendor=0x8086 device=0x10c9 subvendor=0x8086 subdevice=0xa03c class=0x020000 dev.igb.0.%parent: pci1 dev.igb.0.nvm: -1 dev.igb.0.enable_aim: 1 dev.igb.0.fc: 3 dev.igb.0.rx_processing_limit: 100 dev.igb.0.link_irq: 2 dev.igb.0.dropped: 0 dev.igb.0.tx_dma_fail: 0 dev.igb.0.rx_overruns: 0 dev.igb.0.watchdog_timeouts: 0 dev.igb.0.device_control: 1488978497 dev.igb.0.rx_control: 67141634 dev.igb.0.interrupt_mask: 4 dev.igb.0.extended_int_mask: 2147483655 dev.igb.0.tx_buf_alloc: 0 dev.igb.0.rx_buf_alloc: 0 dev.igb.0.fc_high_water: 58976 dev.igb.0.fc_low_water: 58960 dev.igb.0.queue0.no_desc_avail: 19682298 dev.igb.0.queue0.tx_packets: 20962740 dev.igb.0.queue0.rx_packets: 1101622 dev.igb.0.queue0.rx_bytes: 66097424 dev.igb.0.queue0.lro_queued: 0 dev.igb.0.queue0.lro_flushed: 0 dev.igb.0.queue1.no_desc_avail: 32582207 dev.igb.0.queue1.tx_packets: 50082567 dev.igb.0.queue1.rx_packets: 6598 dev.igb.0.queue1.rx_bytes: 462728 dev.igb.0.queue1.lro_queued: 0 dev.igb.0.queue1.lro_flushed: 0 dev.igb.0.mac_stats.excess_coll: 0 dev.igb.0.mac_stats.single_coll: 0 dev.igb.0.mac_stats.multiple_coll: 0 dev.igb.0.mac_stats.late_coll: 0 dev.igb.0.mac_stats.collision_count: 0 dev.igb.0.mac_stats.symbol_errors: 0 dev.igb.0.mac_stats.sequence_errors: 0 dev.igb.0.mac_stats.defer_count: 138912 dev.igb.0.mac_stats.missed_packets: 0 dev.igb.0.mac_stats.recv_no_buff: 0 dev.igb.0.mac_stats.recv_undersize: 0 dev.igb.0.mac_stats.recv_fragmented: 0 dev.igb.0.mac_stats.recv_oversize: 0 dev.igb.0.mac_stats.recv_jabber: 0 dev.igb.0.mac_stats.recv_errs: 0 dev.igb.0.mac_stats.crc_errs: 0 dev.igb.0.mac_stats.alignment_errs: 0 dev.igb.0.mac_stats.coll_ext_errs: 0 dev.igb.0.mac_stats.xon_recvd: 550808 dev.igb.0.mac_stats.xon_txd: 0 dev.igb.0.mac_stats.xoff_recvd: 550808 dev.igb.0.mac_stats.xoff_txd: 0 dev.igb.0.mac_stats.total_pkts_recvd: 1108220 dev.igb.0.mac_stats.good_pkts_recvd: 6604 dev.igb.0.mac_stats.bcast_pkts_recvd: 0 dev.igb.0.mac_stats.mcast_pkts_recvd: 0 dev.igb.0.mac_stats.rx_frames_64: 1 dev.igb.0.mac_stats.rx_frames_65_127: 6603 dev.igb.0.mac_stats.rx_frames_128_255: 0 dev.igb.0.mac_stats.rx_frames_256_511: 0 dev.igb.0.mac_stats.rx_frames_512_1023: 0 dev.igb.0.mac_stats.rx_frames_1024_1522: 0 dev.igb.0.mac_stats.good_octets_recvd: 489608 dev.igb.0.mac_stats.good_octets_txd: 10120060648 dev.igb.0.mac_stats.total_pkts_txd: 71045307 dev.igb.0.mac_stats.good_pkts_txd: 71045307 dev.igb.0.mac_stats.bcast_pkts_txd: 2 dev.igb.0.mac_stats.mcast_pkts_txd: 0 dev.igb.0.mac_stats.tx_frames_64: 2 dev.igb.0.mac_stats.tx_frames_65_127: 5051081 dev.igb.0.mac_stats.tx_frames_128_255: 65994224 dev.igb.0.mac_stats.tx_frames_256_511: 0 dev.igb.0.mac_stats.tx_frames_512_1023: 0 dev.igb.0.mac_stats.tx_frames_1024_1522: 0 dev.igb.0.mac_stats.tso_txd: 0 dev.igb.0.mac_stats.tso_ctx_fail: 0 dev.igb.0.interrupts.asserts: 6564060 dev.igb.0.interrupts.rx_pkt_timer: 1108207 dev.igb.0.interrupts.rx_abs_timer: 0 dev.igb.0.interrupts.tx_pkt_timer: 0 dev.igb.0.interrupts.tx_abs_timer: 1108220 dev.igb.0.interrupts.tx_queue_empty: 71044772 dev.igb.0.interrupts.tx_queue_min_thresh: 0 dev.igb.0.interrupts.rx_desc_min_thresh: 0 dev.igb.0.interrupts.rx_overrun: 0 dev.igb.0.host.breaker_tx_pkt: 0 dev.igb.0.host.host_tx_pkt_discard: 0 dev.igb.0.host.rx_pkt: 13 dev.igb.0.host.breaker_rx_pkts: 0 dev.igb.0.host.breaker_rx_pkt_drop: 0 dev.igb.0.host.tx_good_pkt: 535 dev.igb.0.host.breaker_tx_pkt_drop: 0 dev.igb.0.host.rx_good_bytes: 70993032 dev.igb.0.host.tx_good_bytes: 10120060648 dev.igb.0.host.length_errors: 0 dev.igb.0.host.serdes_violation_pkt: 0 dev.igb.0.host.header_redir_missed: 0 dev.igb.0.wake: 0 dev.igb.1.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0 dev.igb.1.%driver: igb dev.igb.1.%location: slot=0 function=1 dev.igb.1.%pnpinfo: vendor=0x8086 device=0x10c9 subvendor=0x8086 subdevice=0xa03c class=0x020000 dev.igb.1.%parent: pci1 dev.igb.1.nvm: -1 dev.igb.1.enable_aim: 1 dev.igb.1.fc: 3 dev.igb.1.rx_processing_limit: 100 dev.igb.1.link_irq: 2 dev.igb.1.dropped: 0 dev.igb.1.tx_dma_fail: 0 dev.igb.1.rx_overruns: 0 dev.igb.1.watchdog_timeouts: 0 dev.igb.1.device_control: 1488978497 dev.igb.1.rx_control: 67141634 dev.igb.1.interrupt_mask: 4 dev.igb.1.extended_int_mask: 2147483655 dev.igb.1.tx_buf_alloc: 0 dev.igb.1.rx_buf_alloc: 0 dev.igb.1.fc_high_water: 58976 dev.igb.1.fc_low_water: 58960 dev.igb.1.queue0.no_desc_avail: 0 dev.igb.1.queue0.tx_packets: 14 dev.igb.1.queue0.rx_packets: 27770289 dev.igb.1.queue0.rx_bytes: 3632804418 dev.igb.1.queue0.lro_queued: 0 dev.igb.1.queue0.lro_flushed: 0 dev.igb.1.queue1.no_desc_avail: 0 dev.igb.1.queue1.tx_packets: 6599 dev.igb.1.queue1.rx_packets: 58098597 dev.igb.1.queue1.rx_bytes: 8250006086 dev.igb.1.queue1.lro_queued: 0 dev.igb.1.queue1.lro_flushed: 0 dev.igb.1.mac_stats.excess_coll: 0 dev.igb.1.mac_stats.single_coll: 0 dev.igb.1.mac_stats.multiple_coll: 0 dev.igb.1.mac_stats.late_coll: 0 dev.igb.1.mac_stats.collision_count: 0 dev.igb.1.mac_stats.symbol_errors: 0 dev.igb.1.mac_stats.sequence_errors: 0 dev.igb.1.mac_stats.defer_count: 0 dev.igb.1.mac_stats.missed_packets: 0 dev.igb.1.mac_stats.recv_no_buff: 0 dev.igb.1.mac_stats.recv_undersize: 0 dev.igb.1.mac_stats.recv_fragmented: 0 dev.igb.1.mac_stats.recv_oversize: 0 dev.igb.1.mac_stats.recv_jabber: 0 dev.igb.1.mac_stats.recv_errs: 0 dev.igb.1.mac_stats.crc_errs: 0 dev.igb.1.mac_stats.alignment_errs: 0 dev.igb.1.mac_stats.coll_ext_errs: 0 dev.igb.1.mac_stats.xon_recvd: 0 dev.igb.1.mac_stats.xon_txd: 0 dev.igb.1.mac_stats.xoff_recvd: 0 dev.igb.1.mac_stats.xoff_txd: 0 dev.igb.1.mac_stats.total_pkts_recvd: 85868886 dev.igb.1.mac_stats.good_pkts_recvd: 85868886 dev.igb.1.mac_stats.bcast_pkts_recvd: 31 dev.igb.1.mac_stats.mcast_pkts_recvd: 0 dev.igb.1.mac_stats.rx_frames_64: 5 dev.igb.1.mac_stats.rx_frames_65_127: 6211527 dev.igb.1.mac_stats.rx_frames_128_255: 79657327 dev.igb.1.mac_stats.rx_frames_256_511: 27 dev.igb.1.mac_stats.rx_frames_512_1023: 0 dev.igb.1.mac_stats.rx_frames_1024_1522: 0 dev.igb.1.mac_stats.good_octets_recvd: 12226286048 dev.igb.1.mac_stats.good_octets_txd: 490260 dev.igb.1.mac_stats.total_pkts_txd: 6613 dev.igb.1.mac_stats.good_pkts_txd: 6613 dev.igb.1.mac_stats.bcast_pkts_txd: 4 dev.igb.1.mac_stats.mcast_pkts_txd: 0 dev.igb.1.mac_stats.tx_frames_64: 8 dev.igb.1.mac_stats.tx_frames_65_127: 6605 dev.igb.1.mac_stats.tx_frames_128_255: 0 dev.igb.1.mac_stats.tx_frames_256_511: 0 dev.igb.1.mac_stats.tx_frames_512_1023: 0 dev.igb.1.mac_stats.tx_frames_1024_1522: 0 dev.igb.1.mac_stats.tso_txd: 0 dev.igb.1.mac_stats.tso_ctx_fail: 0 dev.igb.1.interrupts.asserts: 8707927 dev.igb.1.interrupts.rx_pkt_timer: 85867976 dev.igb.1.interrupts.rx_abs_timer: 0 dev.igb.1.interrupts.tx_pkt_timer: 0 dev.igb.1.interrupts.tx_abs_timer: 85868886 dev.igb.1.interrupts.tx_queue_empty: 6613 dev.igb.1.interrupts.tx_queue_min_thresh: 0 dev.igb.1.interrupts.rx_desc_min_thresh: 0 dev.igb.1.interrupts.rx_overrun: 0 dev.igb.1.host.breaker_tx_pkt: 0 dev.igb.1.host.host_tx_pkt_discard: 0 dev.igb.1.host.rx_pkt: 910 dev.igb.1.host.breaker_rx_pkts: 0 dev.igb.1.host.breaker_rx_pkt_drop: 0 dev.igb.1.host.tx_good_pkt: 0 dev.igb.1.host.breaker_tx_pkt_drop: 0 dev.igb.1.host.rx_good_bytes: 12226288092 dev.igb.1.host.tx_good_bytes: 490260 dev.igb.1.host.length_errors: 0 dev.igb.1.host.serdes_violation_pkt: 0 dev.igb.1.host.header_redir_missed: 0 Motherboard is Intel Base Board Information Manufacturer: Intel Corporation Product Name: DH87RL Version: AAG74240-401 Serial Number: BQRL330000Q9 NIC is dual port igb0@pci0:1:0:0: class=0x020000 card=0xa03c8086 chip=0x10c98086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82576 Gigabit Network Connection' class = network subclass = ethernet bar [10] = type Memory, range 32, base 0xf7c20000, size 131072, enabled bar [14] = type Memory, range 32, base 0xf7800000, size 4194304, enabled bar [18] = type I/O Port, range 32, base 0xe020, size 32, enabled bar [1c] = type Memory, range 32, base 0xf7c44000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks cap 11[70] = MSI-X supports 10 messages, enabled Table in map 0x1c[0x0], PBA in map 0x1c[0x2000] cap 10[a0] = PCI-Express 2 endpoint max data 128(512) FLR link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ecap 0001[100] = AER 1 0 fatal 0 non-fatal 2 corrected ecap 0003[140] = Serial 1 90e2baffff5eb48a ecap 000e[150] = ARI 1 ecap 0010[160] = SRIOV 1 igb1@pci0:1:0:1: class=0x020000 card=0xa03c8086 chip=0x10c98086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82576 Gigabit Network Connection' class = network subclass = ethernet bar [10] = type Memory, range 32, base 0xf7c00000, size 131072, enabled bar [14] = type Memory, range 32, base 0xf7000000, size 4194304, enabled bar [18] = type I/O Port, range 32, base 0xe000, size 32, enabled bar [1c] = type Memory, range 32, base 0xf7c40000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks cap 11[70] = MSI-X supports 10 messages, enabled Table in map 0x1c[0x0], PBA in map 0x1c[0x2000] cap 10[a0] = PCI-Express 2 endpoint max data 128(512) FLR link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ecap 0001[100] = AER 1 0 fatal 0 non-fatal 2 corrected ecap 0003[140] = Serial 1 90e2baffff5eb48a ecap 000e[150] = ARI 1 ecap 0010[160] = SRIOV 1 root@intel4gen-9:/usr/home/mdtancsa # netstat -m 6141/6489/12630 mbufs in use (current/cache/total) 6139/5871/12010/487416 mbuf clusters in use (current/cache/total/max) 6139/5861 mbuf+clusters out of packet secondary zone in use (current/cache) 0/5/5/243708 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/72209 9k jumbo clusters in use (current/cache/total/max) 0/0/0/40618 16k jumbo clusters in use (current/cache/total/max) 13813K/13384K/27197K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile root@intel4gen-9:/usr/home/mdtancsa # -- ------------------- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, mike@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ From owner-freebsd-net@FreeBSD.ORG Sat Feb 1 16:05:59 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EC63763E; Sat, 1 Feb 2014 16:05:59 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 9EDD71924; Sat, 1 Feb 2014 16:05:59 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,760,1384318800"; d="scan'208";a="92858180" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 01 Feb 2014 11:05:52 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6AFBCB4115; Sat, 1 Feb 2014 11:05:52 -0500 (EST) Date: Sat, 1 Feb 2014 11:05:52 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1966386250.1241234.1391270752429.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 16:06:00 -0000 J David wrote: > On Fri, Jan 31, 2014 at 6:16 PM, Rick Macklem > wrote: > > You can certainly try "-o rsize=61440,wsize=61440" (assuming a 4K > > page size) > > for the mount, if you'd like. > > This has previously been tested with all 4k steps between 16k and > 32k. > All of them perform worse than > > With 61440, NFS fails outright on the random read test: > > $ iozone -e -I -s 1g -r 4k -i 0 -i 2 > > Iozone: Performance Test of File I/O > > Version $Revision: 3.420 $ > > Compiled for 64 bit mode. > > Build: freebsd > > [...] > > Include fsync in write timing > > O_DIRECT feature enabled > > File size set to 1048576 KB > > Record Size 4 KB > > Command line used: iozone -e -I -s 1g -r 4k -i 0 -i 2 > > Output is in Kbytes/sec > > Time Resolution = 0.000005 seconds. > > Processor cache size set to 1024 Kbytes. > > Processor cache line size set to 32 bytes. > > File stride size set to 17 * record size. > > random > random bkwd record stride > > KB reclen write rewrite read reread read > write read rewrite read fwrite frewrite fread freread > > 1048576 4 24688 23891 > > Error reading block at 1073729536 > > read: Bad file descriptor > > > Upon using the -w option, which leaves the file intact on exit, it's > possible to see that it's not even 1gig in length: > > $ ls -aln iozone.tmp > > -rw-r----- 1 1000 0 1073709056 Feb 1 01:18 iozone.tmp > > > It's 32k short, which is a pretty surprising result. > Ok, I knew that non-powers of 2 could result in problems. I thought they only occurred when the size wasn't an exact multiple of page size, but it seems there are non-power of 2 problems. rick > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Sat Feb 1 18:53:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BA9ABEFA; Sat, 1 Feb 2014 18:53:31 +0000 (UTC) Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com [IPv6:2607:f8b0:4001:c03::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7F23014FE; Sat, 1 Feb 2014 18:53:31 +0000 (UTC) Received: by mail-ie0-f177.google.com with SMTP id at1so5323492iec.22 for ; Sat, 01 Feb 2014 10:53:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=D7Cm9i0bNIYPxSf70mShCRurGYAhCiIERJDutZAL9Ks=; b=krd6fao7JtKcz1/6Cnid7Ff/+5B2ph6/4xiRNbRA4Dk9SrmK1jQ6i2/jLTksHrzjfQ 8kiqOHDcVXdqJCrZFRObzopJCeoNkhea5AN2V7JUWXX/WBwcrQ+zWIxqKWlh/CH/zLdi TOV1dMZ3Wziea4OLkf8eQfTo5DfzyUgvCUnoLBsPSWlyUGfbCn+2xjVeDJdTbJk3Mp/u tIiNxQrd56OM8Y4yDUyJ3WnT3xH4zmjhq+iSfKrM0OVWQKXRc5r8p/tRjwNrr1U/WeSM OuOH/0YSilnlW5pIyHCvGKMZNMovDWZx7ZXw7iKmFaupzwhkhNMFp/uLL/Q/+D+KIviU BT3A== MIME-Version: 1.0 X-Received: by 10.43.153.68 with SMTP id kz4mr20116610icc.29.1391280810893; Sat, 01 Feb 2014 10:53:30 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Sat, 1 Feb 2014 10:53:30 -0800 (PST) In-Reply-To: <1966386250.1241234.1391270752429.JavaMail.root@uoguelph.ca> References: <1966386250.1241234.1391270752429.JavaMail.root@uoguelph.ca> Date: Sat, 1 Feb 2014 13:53:30 -0500 X-Google-Sender-Auth: fxWhU_1JA9pbEfHRNb4Qi0mISpU Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 18:53:31 -0000 On Sat, Feb 1, 2014 at 11:05 AM, Rick Macklem wrote: > Ok, I knew that non-powers of 2 could result in problems. I thought > they only occurred when the size wasn't an exact multiple of page > size, but it seems there are non-power of 2 problems. What can I do to help identify and resolve these problems? Thanks!