From owner-freebsd-net@FreeBSD.ORG Sun Jan 19 01:42:54 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 31857B31; Sun, 19 Jan 2014 01:42:54 +0000 (UTC) Received: from mail-vb0-x236.google.com (mail-vb0-x236.google.com [IPv6:2607:f8b0:400c:c02::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A64E61261; Sun, 19 Jan 2014 01:42:53 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id w20so2220696vbb.13 for ; Sat, 18 Jan 2014 17:42:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=txlSrkG18VxqdC9HA3N0fS0tt4Kt1swx2izVkHA4+Wc=; b=JwHVFel//a5lD0ZPaJMEE/FicflgJalgnNEsHH/TILs3XwMPJESsfvd3cTKH+Sh7D2 1tL4a2qAXOQdGQXJsZOM3zyoIPW/ZQ3HJ31PfIMTKjytRiCTsE2cFGdHBgi4rXlcojNN NwMYmqq8GILmJr56PV+21zBnWIMjXVY6gE9ft3Cy0gKGCwxPQe/m+oMqDmLYPbx4kVPR J5BIBYhZkZOGaMWW0MUNdwAlFa3itJwoIBbEPmAZJ6jgQLiHI3khI+eqQ/JjyHGK244t EJCj3KSv1oJ/HMAicRKVKIfuNcRxNQ8SVWB+1wSt7ySjz31RmFad+HtMU1qxjEjCaqn3 diJw== X-Received: by 10.52.244.49 with SMTP id xd17mr2400085vdc.26.1390095772638; Sat, 18 Jan 2014 17:42:52 -0800 (PST) MIME-Version: 1.0 Sender: cochard@gmail.com Received: by 10.58.171.1 with HTTP; Sat, 18 Jan 2014 17:42:32 -0800 (PST) In-Reply-To: <20140115113430.GK26504@FreeBSD.org> References: <20140115113430.GK26504@FreeBSD.org> From: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= Date: Sun, 19 Jan 2014 02:42:32 +0100 X-Google-Sender-Auth: sbrCmL7p60RZD0oI-qigzWcg6FA Message-ID: Subject: Re: Regression on 10-RC5 with a multicast routing daemon To: Gleb Smirnoff Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: "freebsd-net@freebsd.org" , "freebsd-current@freebsd.org" , Andre Oppermann X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jan 2014 01:42:54 -0000 On Wed, Jan 15, 2014 at 12:34 PM, Gleb Smirnoff wrote: > Olivier, > > > TL;DR version: you need not subtract iphdrlen in 10.0. Code in > igmp.c:accept_igmp() > should be smth like: > > iphdrlen = ip->ip_hl << 2; > #ifdef RAW_INPUT_IS_RAW /* Linux */ > ipdatalen = ntohs(ip->ip_len) - iphdrlen; > #else > #if __FreeBSD_version >= 1000000 > ipdatalen = ip->ip_len - iphdrlen; > #else > ipdatalen = ip->ip_len; > #endif > #endif > > With this patch I've no more the message "warning - Received packet from x.x.x.x shorter (28 bytes) than hdr+data length (20+28)":Thanks! But there is still a regression regarding the PIM socket behavior not related to the packet format. The pim.c include 2 functions (pim_read and pim_accept) that are called when the socket received a packet: There functions are never triggered when PIM packets are received on 10.0. In the same time igmp_read() and igmp_accept() are correctly triggered on 9.2 and 10.0. tcpdump in non-promiscious mode correctly see input of PIM packet: This should confirm that once this daemon is started, it correctly open a PIM socket and the multicast filter is updated. From owner-freebsd-net@FreeBSD.ORG Sun Jan 19 08:47:27 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0020FDB2; Sun, 19 Jan 2014 08:47:26 +0000 (UTC) Received: from mail-ig0-x231.google.com (mail-ig0-x231.google.com [IPv6:2607:f8b0:4001:c05::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AC31D1A4D; Sun, 19 Jan 2014 08:47:26 +0000 (UTC) Received: by mail-ig0-f177.google.com with SMTP id k19so5407855igc.4 for ; Sun, 19 Jan 2014 00:47:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=FO9F9f0F43wS5W9s6/ygoOysOku2id76mo8IpY4Eflc=; b=XABKLYA4s1oYL5/wclgqsj6fogtiwMpxzBlScFgwuBAZU7+u0IUr8WutYYbOly7TUn QlrlvGR8+fucRnl/DGng/91YErTQ4SVUsGFqgDyls62UvI1K4xvX+Zr2H9p+TRj46fL6 ULvRrzA6gOAc5KiRw6sK9O2NycuiNI8RIGUxY3TyiKDyH99CNmmtXqGQAyyXotuQOmpZ jWC281BwkyncuEvq/M7NjvmNcYeTczS4TyqroIwj9ioPPyCEm7SYmYR7tLfviAK4yt+5 v9+xEDt0rz8SUamoX0jw/caIJDtDmRzUAUpw2RtkySYjKm4hjS96FOp3UZYqDJkpWmEN rA2w== MIME-Version: 1.0 X-Received: by 10.51.17.101 with SMTP id gd5mr6719485igd.25.1390121245799; Sun, 19 Jan 2014 00:47:25 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Sun, 19 Jan 2014 00:47:25 -0800 (PST) Date: Sun, 19 Jan 2014 03:47:25 -0500 X-Google-Sender-Auth: us4bPXAWxvueZzYlem9wY1c_NbE Message-ID: Subject: Terrible NFS performance under 9.2-RELEASE? From: J David To: freebsd-net@freebsd.org, freebsd-stable , freebsd-virtualization@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jan 2014 08:47:27 -0000 While setting up a test for other purposes, I noticed some really horrible NFS performance issues. To explore this, I set up a test environment with two FreeBSD 9.2-RELEASE-p3 virtual machines running under KVM. The NFS server is configured to serve a 2 gig mfs on /mnt. The performance of the virtual network is outstanding: Server: $ iperf -c 172.20.20.169 ------------------------------------------------------------ Client connecting to 172.20.20.169, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 172.20.20.162 port 59717 connected with 172.20.20.169 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec $ iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 port 45655 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec Client: $ iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 59717 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec ^C$ iperf -c 172.20.20.162 ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 172.20.20.169 port 45655 connected with 172.20.20.162 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec The performance of the mfs filesystem on the server is also good. Server: $ sudo mdconfig -a -t swap -s 2g md0 $ sudo newfs -U -b 4k -f 4k /dev/md0 /dev/md0: 2048.0MB (4194304 sectors) block size 4096, fragment size 4096 using 43 cylinder groups of 48.12MB, 12320 blks, 6160 inodes. with soft updates super-block backups (for fsck_ffs -b #) at: 144, 98704, 197264, 295824, 394384, 492944, 591504, 690064, 788624, 887184, 985744, 1084304, 1182864, 1281424, 1379984, 1478544, 1577104, 1675664, 1774224, 1872784, 1971344, 2069904, 2168464, 2267024, 2365584, 2464144, 2562704, 2661264, 2759824, 2858384, 2956944, 3055504, 3154064, 3252624, 3351184, 3449744, 3548304, 3646864, 3745424, 3843984, 3942544, 4041104, 4139664 $ sudo mount /dev/md0 /mnt $ cd /mnt $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ [...] random random KB reclen write rewrite read reread read write 524288 4 560145 1114593 933699 831902 56347 158904 iozone test complete. But introduce NFS into the mix and everything falls apart. Client: $ sudo mount -o tcp,nfsv3 f12.phxi:/mnt /mnt $ cd /mnt $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ [...] random random KB reclen write rewrite read reread read write 524288 4 67246 2923 103295 1272407 172475 196 And the above took 48 minutes to run, compared to 14 seconds for the local version. So it's 200x slower over NFS. The random write test is over 800x slower. Of course NFS is slower, that's expected, but it definitely wasn't this exaggerated in previous releases. To emphasize that iozone reflects real workloads here, I tried doing an svn co of the 9-STABLE source tree over NFS but after two hours it was still in llvm so I gave up. While all this not-much-of-anything NFS traffic is going on, both systems are essentially idle. The process on the client sits in "newnfs" wait state with nearly no CPU. The server is completely idle except for the occasional 0.10% in an nfsd thread, which otherwise spend their lives in rpcsvc wait state. Server iostat: $ iostat -x -w 10 md0 extended device statistics device r/s w/s kr/s kw/s qlen svc_t %b [...] md0 0.0 36.0 0.0 0.0 0 1.2 0 md0 0.0 38.8 0.0 0.0 0 1.5 0 md0 0.0 73.6 0.0 0.0 0 1.0 0 md0 0.0 53.3 0.0 0.0 0 2.5 0 md0 0.0 33.7 0.0 0.0 0 1.1 0 md0 0.0 45.5 0.0 0.0 0 1.8 0 Server nfsstat: $ nfsstat -s -w 10 GtAttr Lookup Rdlink Read Write Rename Access Rddir [...] 0 0 0 471 816 0 0 0 0 0 0 480 751 0 0 0 0 0 0 481 36 0 0 0 0 0 0 469 550 0 0 0 0 0 0 485 814 0 0 0 0 0 0 467 503 0 0 0 0 0 0 473 345 0 0 0 Client nfsstat: $ nfsstat -c -w 10 GtAttr Lookup Rdlink Read Write Rename Access Rddir [...] 0 0 0 0 518 0 0 0 0 0 0 0 498 0 0 0 0 0 0 0 503 0 0 0 0 0 0 0 474 0 0 0 0 0 0 0 525 0 0 0 0 0 0 0 497 0 0 0 Server vmstat: $ vmstat -w 10 procs memory page disks faults cpu r b w avm fre flt re pi po fr sr vt0 vt1 in sy cs us sy id [...] 0 4 0 634M 6043M 37 0 0 0 1 0 0 0 1561 46 3431 0 2 98 0 4 0 640M 6042M 62 0 0 0 28 0 0 0 1598 94 3552 0 2 98 0 4 0 648M 6042M 38 0 0 0 0 0 0 0 1609 47 3485 0 1 99 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1615 46 3667 0 2 98 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1606 45 3678 0 2 98 0 4 0 648M 6042M 37 0 0 0 0 0 1 0 1561 45 3377 0 2 98 Client vmstat: $ vmstat -w 10 procs memory page disks faults cpu r b w avm fre flt re pi po fr sr md0 da0 in sy cs us sy id [...] 0 0 0 639M 593M 33 0 0 0 1237 0 0 0 281 5575 1043 0 3 97 0 0 0 639M 591M 0 0 0 0 712 0 0 0 235 122 889 0 2 98 0 0 0 639M 583M 0 0 0 0 571 0 0 1 227 120 851 0 2 98 0 0 0 639M 592M 198 0 0 0 1212 0 0 0 251 2497 950 0 3 97 0 0 0 639M 586M 0 0 0 0 614 0 0 0 250 121 924 0 2 98 0 0 0 639M 586M 0 0 0 0 765 0 0 0 250 120 918 0 3 97 Top on the KVM host says it is 93-95% idle and that each VM sits around 7-10% CPU. So basically nobody is doing anything. There's no visible bottleneck, and I've no idea where to go from here to figure out what's going on. Does anyone have any suggestions for debugging this? Thanks! From owner-freebsd-net@FreeBSD.ORG Sun Jan 19 14:32:16 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 810908E4 for ; Sun, 19 Jan 2014 14:32:14 +0000 (UTC) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id D8DD411AD for ; Sun, 19 Jan 2014 14:32:14 +0000 (UTC) Received: from Alfreds-MacBook-Pro-9.local (c-76-21-10-192.hsd1.ca.comcast.net [76.21.10.192]) by elvis.mu.org (Postfix) with ESMTPSA id 880851A3C1A for ; Sun, 19 Jan 2014 06:32:14 -0800 (PST) Message-ID: <52DBE1F0.5000507@freebsd.org> Date: Sun, 19 Jan 2014 06:32:16 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: Terrible NFS performance under 9.2-RELEASE? References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jan 2014 14:32:16 -0000 9.x has pretty poor mbuf tuning by default. I hit nearly the same problem and raising the mbufs worked for me. I'd suggest raising that and retrying. -Alfred On 1/19/14 12:47 AM, J David wrote: > While setting up a test for other purposes, I noticed some really > horrible NFS performance issues. > > To explore this, I set up a test environment with two FreeBSD > 9.2-RELEASE-p3 virtual machines running under KVM. The NFS server is > configured to serve a 2 gig mfs on /mnt. > > The performance of the virtual network is outstanding: > > Server: > > $ iperf -c 172.20.20.169 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.169, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.162 port 59717 connected with 172.20.20.169 port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec > > $ iperf -s > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 port 45655 > > [ ID] Interval Transfer Bandwidth > > [ 4] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > > > Client: > > > $ iperf -s > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 59717 > > [ ID] Interval Transfer Bandwidth > > [ 4] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec > > ^C$ iperf -c 172.20.20.162 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 45655 connected with 172.20.20.162 port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > > > The performance of the mfs filesystem on the server is also good. > > Server: > > $ sudo mdconfig -a -t swap -s 2g > > md0 > > $ sudo newfs -U -b 4k -f 4k /dev/md0 > > /dev/md0: 2048.0MB (4194304 sectors) block size 4096, fragment size 4096 > > using 43 cylinder groups of 48.12MB, 12320 blks, 6160 inodes. > > with soft updates > > super-block backups (for fsck_ffs -b #) at: > > 144, 98704, 197264, 295824, 394384, 492944, 591504, 690064, 788624, 887184, > > 985744, 1084304, 1182864, 1281424, 1379984, 1478544, 1577104, 1675664, > > 1774224, 1872784, 1971344, 2069904, 2168464, 2267024, 2365584, 2464144, > > 2562704, 2661264, 2759824, 2858384, 2956944, 3055504, 3154064, 3252624, > > 3351184, 3449744, 3548304, 3646864, 3745424, 3843984, 3942544, 4041104, > > 4139664 > > $ sudo mount /dev/md0 /mnt > > $ cd /mnt > > $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 > > Iozone: Performance Test of File I/O > > Version $Revision: 3.420 $ > > [...] > > random random > > KB reclen write rewrite read reread read write > > 524288 4 560145 1114593 933699 831902 56347 > 158904 > > > iozone test complete. > > > But introduce NFS into the mix and everything falls apart. > > Client: > > $ sudo mount -o tcp,nfsv3 f12.phxi:/mnt /mnt > > $ cd /mnt > > $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 > > Iozone: Performance Test of File I/O > > Version $Revision: 3.420 $ > > [...] > > random random > > KB reclen write rewrite read reread read write > > 524288 4 67246 2923 103295 1272407 172475 > 196 > > > And the above took 48 minutes to run, compared to 14 seconds for the > local version. So it's 200x slower over NFS. The random write test > is over 800x slower. Of course NFS is slower, that's expected, but it > definitely wasn't this exaggerated in previous releases. > > To emphasize that iozone reflects real workloads here, I tried doing > an svn co of the 9-STABLE source tree over NFS but after two hours it > was still in llvm so I gave up. > > While all this not-much-of-anything NFS traffic is going on, both > systems are essentially idle. The process on the client sits in > "newnfs" wait state with nearly no CPU. The server is completely idle > except for the occasional 0.10% in an nfsd thread, which otherwise > spend their lives in rpcsvc wait state. > > Server iostat: > > $ iostat -x -w 10 md0 > > extended device statistics > > device r/s w/s kr/s kw/s qlen svc_t %b > > [...] > > md0 0.0 36.0 0.0 0.0 0 1.2 0 > md0 0.0 38.8 0.0 0.0 0 1.5 0 > md0 0.0 73.6 0.0 0.0 0 1.0 0 > md0 0.0 53.3 0.0 0.0 0 2.5 0 > md0 0.0 33.7 0.0 0.0 0 1.1 0 > md0 0.0 45.5 0.0 0.0 0 1.8 0 > > Server nfsstat: > > $ nfsstat -s -w 10 > > GtAttr Lookup Rdlink Read Write Rename Access Rddir > > [...] > > 0 0 0 471 816 0 0 0 > > 0 0 0 480 751 0 0 0 > > 0 0 0 481 36 0 0 0 > > 0 0 0 469 550 0 0 0 > > 0 0 0 485 814 0 0 0 > > 0 0 0 467 503 0 0 0 > > 0 0 0 473 345 0 0 0 > > > Client nfsstat: > > $ nfsstat -c -w 10 > > GtAttr Lookup Rdlink Read Write Rename Access Rddir > > [...] > > 0 0 0 0 518 0 0 0 > > 0 0 0 0 498 0 0 0 > > 0 0 0 0 503 0 0 0 > > 0 0 0 0 474 0 0 0 > > 0 0 0 0 525 0 0 0 > > 0 0 0 0 497 0 0 0 > > > Server vmstat: > > $ vmstat -w 10 > > procs memory page disks faults cpu > > r b w avm fre flt re pi po fr sr vt0 vt1 in sy > cs us sy id > > [...] > > 0 4 0 634M 6043M 37 0 0 0 1 0 0 0 1561 46 > 3431 0 2 98 > > 0 4 0 640M 6042M 62 0 0 0 28 0 0 0 1598 94 > 3552 0 2 98 > > 0 4 0 648M 6042M 38 0 0 0 0 0 0 0 1609 47 > 3485 0 1 99 > > 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1615 46 > 3667 0 2 98 > > 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1606 45 > 3678 0 2 98 > > 0 4 0 648M 6042M 37 0 0 0 0 0 1 0 1561 45 > 3377 0 2 98 > > > Client vmstat: > > $ vmstat -w 10 > > procs memory page disks faults cpu > > r b w avm fre flt re pi po fr sr md0 da0 in sy > cs us sy id > > [...] > > 0 0 0 639M 593M 33 0 0 0 1237 0 0 0 281 5575 > 1043 0 3 97 > > 0 0 0 639M 591M 0 0 0 0 712 0 0 0 235 122 > 889 0 2 98 > > 0 0 0 639M 583M 0 0 0 0 571 0 0 1 227 120 > 851 0 2 98 > > 0 0 0 639M 592M 198 0 0 0 1212 0 0 0 251 2497 > 950 0 3 97 > > 0 0 0 639M 586M 0 0 0 0 614 0 0 0 250 121 > 924 0 2 98 > > 0 0 0 639M 586M 0 0 0 0 765 0 0 0 250 120 > 918 0 3 97 > > > Top on the KVM host says it is 93-95% idle and that each VM sits > around 7-10% CPU. So basically nobody is doing anything. There's no > visible bottleneck, and I've no idea where to go from here to figure > out what's going on. > > Does anyone have any suggestions for debugging this? > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Sun Jan 19 17:58:35 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6AD40F56 for ; Sun, 19 Jan 2014 17:58:35 +0000 (UTC) Received: from mail.egr.msu.edu (dauterive.egr.msu.edu [35.9.37.168]) by mx1.freebsd.org (Postfix) with ESMTP id 39FBA1F2B for ; Sun, 19 Jan 2014 17:58:34 +0000 (UTC) Received: from dauterive (localhost [127.0.0.1]) by mail.egr.msu.edu (Postfix) with ESMTP id 244C32FAE3 for ; Sun, 19 Jan 2014 12:58:27 -0500 (EST) X-Virus-Scanned: amavisd-new at egr.msu.edu Received: from mail.egr.msu.edu ([127.0.0.1]) by dauterive (dauterive.egr.msu.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LmIsG5iT4RqL for ; Sun, 19 Jan 2014 12:58:27 -0500 (EST) Received: from EGR authenticated sender Message-ID: <52DC1241.7010004@egr.msu.edu> Date: Sun, 19 Jan 2014 12:58:25 -0500 From: Adam McDougall User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: Terrible NFS performance under 9.2-RELEASE? References: <52DBE1F0.5000507@freebsd.org> In-Reply-To: <52DBE1F0.5000507@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jan 2014 17:58:35 -0000 Also try rsize=32768,wsize=32768 in your mount options, made a huge difference for me. I've noticed slow file transfers on NFS in 9 and finally did some searching a couple months ago, someone suggested it and they were on to something. On 01/19/2014 09:32, Alfred Perlstein wrote: > 9.x has pretty poor mbuf tuning by default. > > I hit nearly the same problem and raising the mbufs worked for me. > > I'd suggest raising that and retrying. > > -Alfred > > On 1/19/14 12:47 AM, J David wrote: >> While setting up a test for other purposes, I noticed some really >> horrible NFS performance issues. >> >> To explore this, I set up a test environment with two FreeBSD >> 9.2-RELEASE-p3 virtual machines running under KVM. The NFS server is >> configured to serve a 2 gig mfs on /mnt. >> >> The performance of the virtual network is outstanding: >> >> Server: >> >> $ iperf -c 172.20.20.169 >> >> ------------------------------------------------------------ >> >> Client connecting to 172.20.20.169, TCP port 5001 >> >> TCP window size: 1.00 MByte (default) >> >> ------------------------------------------------------------ >> >> [ 3] local 172.20.20.162 port 59717 connected with 172.20.20.169 port >> 5001 >> >> [ ID] Interval Transfer Bandwidth >> >> [ 3] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec >> >> $ iperf -s >> >> ------------------------------------------------------------ >> >> Server listening on TCP port 5001 >> >> TCP window size: 1.00 MByte (default) >> >> ------------------------------------------------------------ >> >> [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 port >> 45655 >> >> [ ID] Interval Transfer Bandwidth >> >> [ 4] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec >> >> >> Client: >> >> >> $ iperf -s >> >> ------------------------------------------------------------ >> >> Server listening on TCP port 5001 >> >> TCP window size: 1.00 MByte (default) >> >> ------------------------------------------------------------ >> >> [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port >> 59717 >> >> [ ID] Interval Transfer Bandwidth >> >> [ 4] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec >> >> ^C$ iperf -c 172.20.20.162 >> >> ------------------------------------------------------------ >> >> Client connecting to 172.20.20.162, TCP port 5001 >> >> TCP window size: 1.00 MByte (default) >> >> ------------------------------------------------------------ >> >> [ 3] local 172.20.20.169 port 45655 connected with 172.20.20.162 port >> 5001 >> >> [ ID] Interval Transfer Bandwidth >> >> [ 3] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec >> >> >> The performance of the mfs filesystem on the server is also good. >> >> Server: >> >> $ sudo mdconfig -a -t swap -s 2g >> >> md0 >> >> $ sudo newfs -U -b 4k -f 4k /dev/md0 >> >> /dev/md0: 2048.0MB (4194304 sectors) block size 4096, fragment size 4096 >> >> using 43 cylinder groups of 48.12MB, 12320 blks, 6160 inodes. >> >> with soft updates >> >> super-block backups (for fsck_ffs -b #) at: >> >> 144, 98704, 197264, 295824, 394384, 492944, 591504, 690064, 788624, >> 887184, >> >> 985744, 1084304, 1182864, 1281424, 1379984, 1478544, 1577104, 1675664, >> >> 1774224, 1872784, 1971344, 2069904, 2168464, 2267024, 2365584, 2464144, >> >> 2562704, 2661264, 2759824, 2858384, 2956944, 3055504, 3154064, 3252624, >> >> 3351184, 3449744, 3548304, 3646864, 3745424, 3843984, 3942544, 4041104, >> >> 4139664 >> >> $ sudo mount /dev/md0 /mnt >> >> $ cd /mnt >> >> $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 >> >> Iozone: Performance Test of File I/O >> >> Version $Revision: 3.420 $ >> >> [...] >> >> random >> random >> >> KB reclen write rewrite read reread read >> write >> >> 524288 4 560145 1114593 933699 831902 56347 >> 158904 >> >> >> iozone test complete. >> >> >> But introduce NFS into the mix and everything falls apart. >> >> Client: >> >> $ sudo mount -o tcp,nfsv3 f12.phxi:/mnt /mnt >> >> $ cd /mnt >> >> $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 >> >> Iozone: Performance Test of File I/O >> >> Version $Revision: 3.420 $ >> >> [...] >> >> random >> random >> >> KB reclen write rewrite read reread read >> write >> >> 524288 4 67246 2923 103295 1272407 172475 >> 196 >> >> >> And the above took 48 minutes to run, compared to 14 seconds for the >> local version. So it's 200x slower over NFS. The random write test >> is over 800x slower. Of course NFS is slower, that's expected, but it >> definitely wasn't this exaggerated in previous releases. >> >> To emphasize that iozone reflects real workloads here, I tried doing >> an svn co of the 9-STABLE source tree over NFS but after two hours it >> was still in llvm so I gave up. >> >> While all this not-much-of-anything NFS traffic is going on, both >> systems are essentially idle. The process on the client sits in >> "newnfs" wait state with nearly no CPU. The server is completely idle >> except for the occasional 0.10% in an nfsd thread, which otherwise >> spend their lives in rpcsvc wait state. >> >> Server iostat: >> >> $ iostat -x -w 10 md0 >> >> extended device statistics >> >> device r/s w/s kr/s kw/s qlen svc_t %b >> >> [...] >> >> md0 0.0 36.0 0.0 0.0 0 1.2 0 >> md0 0.0 38.8 0.0 0.0 0 1.5 0 >> md0 0.0 73.6 0.0 0.0 0 1.0 0 >> md0 0.0 53.3 0.0 0.0 0 2.5 0 >> md0 0.0 33.7 0.0 0.0 0 1.1 0 >> md0 0.0 45.5 0.0 0.0 0 1.8 0 >> >> Server nfsstat: >> >> $ nfsstat -s -w 10 >> >> GtAttr Lookup Rdlink Read Write Rename Access Rddir >> >> [...] >> >> 0 0 0 471 816 0 0 0 >> >> 0 0 0 480 751 0 0 0 >> >> 0 0 0 481 36 0 0 0 >> >> 0 0 0 469 550 0 0 0 >> >> 0 0 0 485 814 0 0 0 >> >> 0 0 0 467 503 0 0 0 >> >> 0 0 0 473 345 0 0 0 >> >> >> Client nfsstat: >> >> $ nfsstat -c -w 10 >> >> GtAttr Lookup Rdlink Read Write Rename Access Rddir >> >> [...] >> >> 0 0 0 0 518 0 0 0 >> >> 0 0 0 0 498 0 0 0 >> >> 0 0 0 0 503 0 0 0 >> >> 0 0 0 0 474 0 0 0 >> >> 0 0 0 0 525 0 0 0 >> >> 0 0 0 0 497 0 0 0 >> >> >> Server vmstat: >> >> $ vmstat -w 10 >> >> procs memory page disks >> faults cpu >> >> r b w avm fre flt re pi po fr sr vt0 vt1 in sy >> cs us sy id >> >> [...] >> >> 0 4 0 634M 6043M 37 0 0 0 1 0 0 0 1561 46 >> 3431 0 2 98 >> >> 0 4 0 640M 6042M 62 0 0 0 28 0 0 0 1598 94 >> 3552 0 2 98 >> >> 0 4 0 648M 6042M 38 0 0 0 0 0 0 0 1609 47 >> 3485 0 1 99 >> >> 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1615 46 >> 3667 0 2 98 >> >> 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1606 45 >> 3678 0 2 98 >> >> 0 4 0 648M 6042M 37 0 0 0 0 0 1 0 1561 45 >> 3377 0 2 98 >> >> >> Client vmstat: >> >> $ vmstat -w 10 >> >> procs memory page disks >> faults cpu >> >> r b w avm fre flt re pi po fr sr md0 da0 in sy >> cs us sy id >> >> [...] >> >> 0 0 0 639M 593M 33 0 0 0 1237 0 0 0 281 5575 >> 1043 0 3 97 >> >> 0 0 0 639M 591M 0 0 0 0 712 0 0 0 235 122 >> 889 0 2 98 >> >> 0 0 0 639M 583M 0 0 0 0 571 0 0 1 227 120 >> 851 0 2 98 >> >> 0 0 0 639M 592M 198 0 0 0 1212 0 0 0 251 2497 >> 950 0 3 97 >> >> 0 0 0 639M 586M 0 0 0 0 614 0 0 0 250 121 >> 924 0 2 98 >> >> 0 0 0 639M 586M 0 0 0 0 765 0 0 0 250 120 >> 918 0 3 97 >> >> >> Top on the KVM host says it is 93-95% idle and that each VM sits >> around 7-10% CPU. So basically nobody is doing anything. There's no >> visible bottleneck, and I've no idea where to go from here to figure >> out what's going on. >> >> Does anyone have any suggestions for debugging this? >> >> Thanks! >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >> > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Sun Jan 19 23:23:25 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 311A9E37; Sun, 19 Jan 2014 23:23:25 +0000 (UTC) Received: from mail-qe0-x231.google.com (mail-qe0-x231.google.com [IPv6:2607:f8b0:400d:c02::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id D2FCD161F; Sun, 19 Jan 2014 23:23:24 +0000 (UTC) Received: by mail-qe0-f49.google.com with SMTP id w4so5750736qeb.22 for ; Sun, 19 Jan 2014 15:23:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=0q9dJTb7RECFfSOKXxbQE0tEBBCzypoq/yY+yAF2pq0=; b=VxNRa0gUj1e/U6QMWUvLVXp2SAg+23iwQ1pkWwDvYGncazFNhXKpq+EoMa4oj3XMjd jOY6DgY0tIY0hUtv1rQmc0mQYjV8yKDq98SVqem3YKiF3MNxWtdR3Ddpxfj3ktOsseoS VdxsLhAR2om3CE4aa7KM218eqRc4ppX6suQ7fSE6Alqsjc/OJqtyrPdOD934/79KY6Nj 5WdhTni39SOsehwYG/QMuaDO3oHHvCkgzBGO4ribVJToFT2D8QxGv7Nte/hBLd1RsNEl ksmGFh5IFTrfwLhXljyF/cIBNTtr8PRFE8g4L7IsXoILoTUebc2e9GT6LpygOhgIDvRK jWiw== MIME-Version: 1.0 X-Received: by 10.140.96.180 with SMTP id k49mr19171647qge.4.1390173804048; Sun, 19 Jan 2014 15:23:24 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Sun, 19 Jan 2014 15:23:23 -0800 (PST) In-Reply-To: References: Date: Sun, 19 Jan 2014 15:23:23 -0800 X-Google-Sender-Auth: OPdobZwUZScu0Q6itl6_OEu-NrY Message-ID: Subject: Re: [rfc] set inp_flowid on initial TCP connection From: Adrian Chadd To: FreeBSD Net , "freebsd-arch@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jan 2014 23:23:25 -0000 Ok, I've committed this to -HEAD. Thanks, -a On 16 January 2014 12:28, Adrian Chadd wrote: > Hi, > > This patch sets the inp_flowid on incoming connections. Without this, > the initial connection has no flowid, so things like the per-CPU TCP > callwheel stuff would map to a different CPU on the initial incoming > setup. > > > > -a > > Index: sys/netinet/tcp_syncache.c > =================================================================== > --- sys/netinet/tcp_syncache.c (revision 260499) > +++ sys/netinet/tcp_syncache.c (working copy) > @@ -722,6 +722,16 @@ > #endif > > /* > + * If there's an mbuf and it has a flowid, then let's initialise the > + * inp with that particular flowid. > + */ > + if (m != NULL && m->m_flags & M_FLOWID) { > + inp->inp_flags |= INP_HW_FLOWID; > + inp->inp_flags &= ~INP_SW_FLOWID; > + inp->inp_flowid = m->m_pkthdr.flowid; > + } > + > + /* > * Install in the reservation hash table for now, but don't yet > * install a connection group since the full 4-tuple isn't yet > * configured. From owner-freebsd-net@FreeBSD.ORG Sun Jan 19 23:36:19 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 72D7F1C1 for ; Sun, 19 Jan 2014 23:36:19 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id ECE1A16CA for ; Sun, 19 Jan 2014 23:36:18 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,687,1384318800"; d="scan'208";a="89050440" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 19 Jan 2014 18:36:17 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 8D306B40D0; Sun, 19 Jan 2014 18:36:17 -0500 (EST) Date: Sun, 19 Jan 2014 18:36:17 -0500 (EST) From: Rick Macklem To: Adam McDougall Message-ID: <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> In-Reply-To: <52DC1241.7010004@egr.msu.edu> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jan 2014 23:36:19 -0000 Adam McDougall wrote: > Also try rsize=32768,wsize=32768 in your mount options, made a huge > difference for me. I've noticed slow file transfers on NFS in 9 and > finally did some searching a couple months ago, someone suggested it > and > they were on to something. > Yes, it shouldn't make a big difference but it sometimes does. When it does, I believe that indicates there is a problem with your network fabric. The problem might be TSO for segments near 64K in size or some network device that can't handle the larger burst of received packets. (Slight differences could be related to vm issues related to fragmentation caused by the larger mapped buffer cache blocks, but I'm pretty sure this wouldn't cause a massive difference.) rick > On 01/19/2014 09:32, Alfred Perlstein wrote: > > 9.x has pretty poor mbuf tuning by default. > > > > I hit nearly the same problem and raising the mbufs worked for me. > > > > I'd suggest raising that and retrying. > > > > -Alfred > > > > On 1/19/14 12:47 AM, J David wrote: > >> While setting up a test for other purposes, I noticed some really > >> horrible NFS performance issues. > >> > >> To explore this, I set up a test environment with two FreeBSD > >> 9.2-RELEASE-p3 virtual machines running under KVM. The NFS server > >> is > >> configured to serve a 2 gig mfs on /mnt. > >> > >> The performance of the virtual network is outstanding: > >> > >> Server: > >> > >> $ iperf -c 172.20.20.169 > >> > >> ------------------------------------------------------------ > >> > >> Client connecting to 172.20.20.169, TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 3] local 172.20.20.162 port 59717 connected with 172.20.20.169 > >> port > >> 5001 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 3] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec > >> > >> $ iperf -s > >> > >> ------------------------------------------------------------ > >> > >> Server listening on TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 > >> port > >> 45655 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 4] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > >> > >> > >> Client: > >> > >> > >> $ iperf -s > >> > >> ------------------------------------------------------------ > >> > >> Server listening on TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 > >> port > >> 59717 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 4] 0.0-10.0 sec 16.1 GBytes 13.8 Gbits/sec > >> > >> ^C$ iperf -c 172.20.20.162 > >> > >> ------------------------------------------------------------ > >> > >> Client connecting to 172.20.20.162, TCP port 5001 > >> > >> TCP window size: 1.00 MByte (default) > >> > >> ------------------------------------------------------------ > >> > >> [ 3] local 172.20.20.169 port 45655 connected with 172.20.20.162 > >> port > >> 5001 > >> > >> [ ID] Interval Transfer Bandwidth > >> > >> [ 3] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > >> > >> > >> The performance of the mfs filesystem on the server is also good. > >> > >> Server: > >> > >> $ sudo mdconfig -a -t swap -s 2g > >> > >> md0 > >> > >> $ sudo newfs -U -b 4k -f 4k /dev/md0 > >> > >> /dev/md0: 2048.0MB (4194304 sectors) block size 4096, fragment > >> size 4096 > >> > >> using 43 cylinder groups of 48.12MB, 12320 blks, 6160 inodes. > >> > >> with soft updates > >> > >> super-block backups (for fsck_ffs -b #) at: > >> > >> 144, 98704, 197264, 295824, 394384, 492944, 591504, 690064, > >> 788624, > >> 887184, > >> > >> 985744, 1084304, 1182864, 1281424, 1379984, 1478544, 1577104, > >> 1675664, > >> > >> 1774224, 1872784, 1971344, 2069904, 2168464, 2267024, 2365584, > >> 2464144, > >> > >> 2562704, 2661264, 2759824, 2858384, 2956944, 3055504, 3154064, > >> 3252624, > >> > >> 3351184, 3449744, 3548304, 3646864, 3745424, 3843984, 3942544, > >> 4041104, > >> > >> 4139664 > >> > >> $ sudo mount /dev/md0 /mnt > >> > >> $ cd /mnt > >> > >> $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 > >> > >> Iozone: Performance Test of File I/O > >> > >> Version $Revision: 3.420 $ > >> > >> [...] > >> > >> random > >> random > >> > >> KB reclen write rewrite read reread > >> read > >> write > >> > >> 524288 4 560145 1114593 933699 831902 > >> 56347 > >> 158904 > >> > >> > >> iozone test complete. > >> > >> > >> But introduce NFS into the mix and everything falls apart. > >> > >> Client: > >> > >> $ sudo mount -o tcp,nfsv3 f12.phxi:/mnt /mnt > >> > >> $ cd /mnt > >> > >> $ sudo iozone -e -I -s 512m -r 4k -i 0 -i 1 -i 2 > >> > >> Iozone: Performance Test of File I/O > >> > >> Version $Revision: 3.420 $ > >> > >> [...] > >> > >> random > >> random > >> > >> KB reclen write rewrite read reread > >> read > >> write > >> > >> 524288 4 67246 2923 103295 1272407 > >> 172475 > >> 196 > >> > >> > >> And the above took 48 minutes to run, compared to 14 seconds for > >> the > >> local version. So it's 200x slower over NFS. The random write > >> test > >> is over 800x slower. Of course NFS is slower, that's expected, > >> but it > >> definitely wasn't this exaggerated in previous releases. > >> > >> To emphasize that iozone reflects real workloads here, I tried > >> doing > >> an svn co of the 9-STABLE source tree over NFS but after two hours > >> it > >> was still in llvm so I gave up. > >> > >> While all this not-much-of-anything NFS traffic is going on, both > >> systems are essentially idle. The process on the client sits in > >> "newnfs" wait state with nearly no CPU. The server is completely > >> idle > >> except for the occasional 0.10% in an nfsd thread, which otherwise > >> spend their lives in rpcsvc wait state. > >> > >> Server iostat: > >> > >> $ iostat -x -w 10 md0 > >> > >> extended device statistics > >> > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> > >> [...] > >> > >> md0 0.0 36.0 0.0 0.0 0 1.2 0 > >> md0 0.0 38.8 0.0 0.0 0 1.5 0 > >> md0 0.0 73.6 0.0 0.0 0 1.0 0 > >> md0 0.0 53.3 0.0 0.0 0 2.5 0 > >> md0 0.0 33.7 0.0 0.0 0 1.1 0 > >> md0 0.0 45.5 0.0 0.0 0 1.8 0 > >> > >> Server nfsstat: > >> > >> $ nfsstat -s -w 10 > >> > >> GtAttr Lookup Rdlink Read Write Rename Access Rddir > >> > >> [...] > >> > >> 0 0 0 471 816 0 0 0 > >> > >> 0 0 0 480 751 0 0 0 > >> > >> 0 0 0 481 36 0 0 0 > >> > >> 0 0 0 469 550 0 0 0 > >> > >> 0 0 0 485 814 0 0 0 > >> > >> 0 0 0 467 503 0 0 0 > >> > >> 0 0 0 473 345 0 0 0 > >> > >> > >> Client nfsstat: > >> > >> $ nfsstat -c -w 10 > >> > >> GtAttr Lookup Rdlink Read Write Rename Access Rddir > >> > >> [...] > >> > >> 0 0 0 0 518 0 0 0 > >> > >> 0 0 0 0 498 0 0 0 > >> > >> 0 0 0 0 503 0 0 0 > >> > >> 0 0 0 0 474 0 0 0 > >> > >> 0 0 0 0 525 0 0 0 > >> > >> 0 0 0 0 497 0 0 0 > >> > >> > >> Server vmstat: > >> > >> $ vmstat -w 10 > >> > >> procs memory page disks > >> faults cpu > >> > >> r b w avm fre flt re pi po fr sr vt0 vt1 in > >> sy > >> cs us sy id > >> > >> [...] > >> > >> 0 4 0 634M 6043M 37 0 0 0 1 0 0 0 1561 > >> 46 > >> 3431 0 2 98 > >> > >> 0 4 0 640M 6042M 62 0 0 0 28 0 0 0 1598 > >> 94 > >> 3552 0 2 98 > >> > >> 0 4 0 648M 6042M 38 0 0 0 0 0 0 0 1609 > >> 47 > >> 3485 0 1 99 > >> > >> 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1615 > >> 46 > >> 3667 0 2 98 > >> > >> 0 4 0 648M 6042M 37 0 0 0 0 0 0 0 1606 > >> 45 > >> 3678 0 2 98 > >> > >> 0 4 0 648M 6042M 37 0 0 0 0 0 1 0 1561 > >> 45 > >> 3377 0 2 98 > >> > >> > >> Client vmstat: > >> > >> $ vmstat -w 10 > >> > >> procs memory page disks > >> faults cpu > >> > >> r b w avm fre flt re pi po fr sr md0 da0 in > >> sy > >> cs us sy id > >> > >> [...] > >> > >> 0 0 0 639M 593M 33 0 0 0 1237 0 0 0 281 > >> 5575 > >> 1043 0 3 97 > >> > >> 0 0 0 639M 591M 0 0 0 0 712 0 0 0 235 > >> 122 > >> 889 0 2 98 > >> > >> 0 0 0 639M 583M 0 0 0 0 571 0 0 1 227 > >> 120 > >> 851 0 2 98 > >> > >> 0 0 0 639M 592M 198 0 0 0 1212 0 0 0 251 > >> 2497 > >> 950 0 3 97 > >> > >> 0 0 0 639M 586M 0 0 0 0 614 0 0 0 250 > >> 121 > >> 924 0 2 98 > >> > >> 0 0 0 639M 586M 0 0 0 0 765 0 0 0 250 > >> 120 > >> 918 0 3 97 > >> > >> > >> Top on the KVM host says it is 93-95% idle and that each VM sits > >> around 7-10% CPU. So basically nobody is doing anything. There's > >> no > >> visible bottleneck, and I've no idea where to go from here to > >> figure > >> out what's going on. > >> > >> Does anyone have any suggestions for debugging this? > >> > >> Thanks! > >> _______________________________________________ > >> freebsd-net@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-net > >> To unsubscribe, send any mail to > >> "freebsd-net-unsubscribe@freebsd.org" > >> > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Mon Jan 20 04:11:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EFB5CEB7 for ; Mon, 20 Jan 2014 04:11:09 +0000 (UTC) Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com [IPv6:2607:f8b0:4001:c05::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BCD4D1D44 for ; Mon, 20 Jan 2014 04:11:09 +0000 (UTC) Received: by mail-ig0-f172.google.com with SMTP id k19so6911980igc.5 for ; Sun, 19 Jan 2014 20:11:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=c4kVwfaA/IHNcabUnGkf8lt3HJkxn1iMhcF2dl0HeF8=; b=OnLTdbbRxjieq1DSK6tIWRPuHsP/ww8CgX/FrSg9Z8NESixZSm9oSdg9bWh09+LJMO QLyX+UDVajurg07QhiZEuS4sVqxJb+zMNkyiWUllJ90ItU9bhO+nA0zi4C4OczQL1edj QqPYmKTLbgtltrPVt22i/qX8Em9FziCmFmHs14TgaIPBkXWi+RCJdPsZT97gnBUxtabi GqJn2Uddv+NTRmEH1HNcPOJajdcTllCIQHN2ZllSN6CIDQG7k5FMI6zwG4eVnGK5gSwP GrqDy1AEd8zUj1LzXCTP1scK9Xbgjfr703w/eKDWyUTrv9pDvYscK/zAeRQT8QNoeDs/ 4gYg== MIME-Version: 1.0 X-Received: by 10.42.53.10 with SMTP id l10mr12143131icg.33.1390191069203; Sun, 19 Jan 2014 20:11:09 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Sun, 19 Jan 2014 20:11:09 -0800 (PST) In-Reply-To: <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> References: <52DC1241.7010004@egr.msu.edu> <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> Date: Sun, 19 Jan 2014 23:11:09 -0500 X-Google-Sender-Auth: E8I1h8lGK6_LM73w_DcQSgDd6Pk Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jan 2014 04:11:10 -0000 MIME-Version: 1.0 Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Sun, 19 Jan 2014 20:08:04 -0800 (PST) In-Reply-To: <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> References: <52DC1241.7010004@egr.msu.edu> <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> Date: Sun, 19 Jan 2014 23:08:04 -0500 Delivered-To: jdavidlists@gmail.com X-Google-Sender-Auth: 2XgnsPkoaEEkfTqW1ZVFM_Lel3o Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 On Sun, Jan 19, 2014 at 9:32 AM, Alfred Perlstein wrote: > I hit nearly the same problem and raising the mbufs worked for me. > > I'd suggest raising that and retrying. That doesn't seem to be an issue here; mbufs are well below max on both client and server and all the "delayed"/"denied" lines are 0/0/0. On Sun, Jan 19, 2014 at 12:58 PM, Adam McDougall wrote: > Also try rsize=32768,wsize=32768 in your mount options, made a huge > difference for me. This does make a difference, but inconsistently. In order to test this further, I created a Debian guest on the same host as these two FreeBSD hosts and re-ran the tests with it acting as both client and server, and ran them for both 32k and 64k. Findings: random random write rewrite read reread read write S:FBSD,C:FBSD,Z:64k 67246 2923 103295 1272407 172475 196 S:FBSD,C:FBSD,Z:32k 11951 99896 223787 1051948 223276 13686 S:FBSD,C:DEB,Z:64k 11414 14445 31554 30156 30368 13799 S:FBSD,C:DEB,Z:32k 11215 14442 31439 31026 29608 13769 S:DEB,C:FBSD,Z:64k 36844 173312 313919 1169426 188432 14273 S:DEB,C:FBSD,Z:32k 66928 120660 257830 1048309 225807 18103 So the rsize/wsize makes a difference between two FreeBSD nodes, but with a Debian node as either client or server, it no longer seems to matter much. And /proc/mounts on the debian box confirms that it negotiates and honors the 64k size as a client. On Sun, Jan 19, 2014 at 6:36 PM, Rick Macklem wrote: > Yes, it shouldn't make a big difference but it sometimes does. When it > does, I believe that indicates there is a problem with your network > fabric. Given that this is an entirely virtual environment, if your belief is correct, where would supporting evidence be found? As far as I can tell, there are no interface errors reported on the host (checking both taps and the bridge) or any of the guests, nothing in sysctl dev.vtnet of concern, etc. Also the improvement from using debian on either side, even with 64k sizes, seems counterintuitive. To try to help vindicate the network stack, I did iperf -d between the two FreeBSD nodes while the iozone was running: Server: $ iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 port 37449 ------------------------------------------------------------ Client connecting to 172.20.20.169, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 6] local 172.20.20.162 port 28634 connected with 172.20.20.169 port 5001 Waiting for server threads to complete. Interrupt again to force quit. [ ID] Interval Transfer Bandwidth [ 6] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec [ 4] 0.0-10.0 sec 15.6 GBytes 13.4 Gbits/sec Client: $ iperf -c 172.20.20.162 -d ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 5] local 172.20.20.169 port 32533 connected with 172.20.20.162 port 5001 [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 36617 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.0 sec 15.6 GBytes 13.4 Gbits/sec [ 4] 0.0-10.0 sec 15.5 GBytes 13.3 Gbits/sec mbuf usage is pretty low. Server: $ netstat -m 545/4075/4620 mbufs in use (current/cache/total) 535/1819/2354/131072 mbuf clusters in use (current/cache/total/max) 535/1641 mbuf+clusters out of packet secondary zone in use (current/cache) 0/2034/2034/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 1206K/12792K/13999K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines Client: $ netstat -m 1841/3544/5385 mbufs in use (current/cache/total) 1172/1198/2370/32768 mbuf clusters in use (current/cache/total/max) 512/896 mbuf+clusters out of packet secondary zone in use (current/cache) 0/2314/2314/16384 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/8192 9k jumbo clusters in use (current/cache/total/max) 0/0/0/4096 16k jumbo clusters in use (current/cache/total/max) 2804K/12538K/15342K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines Here's 60 seconds of netstat -ss for ip and tcp from the server with 64k mount running ozone: ip: 4776 total packets received 4758 packets for this host 18 packets for unknown/unsupported protocol 2238 packets sent from this host tcp: 2244 packets sent 1427 data packets (238332 bytes) 5 data packets (820 bytes) retransmitted 812 ack-only packets (587 delayed) 2235 packets received 1428 acks (for 238368 bytes) 2007 packets (91952792 bytes) received in-sequence 225 out-of-order packets (325800 bytes) 1428 segments updated rtt (of 1426 attempts) 5 retransmit timeouts 587 correct data packet header predictions 225 SACK options (SACK blocks) sent And with 32k mount: ip: 24172 total packets received 24167 packets for this host 5 packets for unknown/unsupported protocol 26130 packets sent from this host tcp: 26130 packets sent 23506 data packets (5362120 bytes) 2624 ack-only packets (454 delayed) 21671 packets received 18143 acks (for 5362192 bytes) 20278 packets (756617316 bytes) received in-sequence 96 out-of-order packets (145964 bytes) 18143 segments updated rtt (of 17469 attempts) 1093 correct ACK header predictions 3449 correct data packet header predictions 111 SACK options (SACK blocks) sent So the 32k mount sends about 6x the packet volume. (This is on iozone's linear write test.) One thing I've noticed is that when the 64k connection bogs down, it seems to "poison" things for awhile. For example, iperf will start doing this afterward: >From the client to the server: $ iperf -c 172.20.20.162 ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 172.20.20.169 port 14337 connected with 172.20.20.162 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 4.88 MBytes 4.05 Mbits/sec Ouch! That's quite a drop from 13Gbit/sec. Weirdly, iperf to the debian node not affected: >From the client to the debian node: $ iperf -c 172.20.20.166 ------------------------------------------------------------ Client connecting to 172.20.20.166, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 172.20.20.169 port 24376 connected with 172.20.20.166 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 20.4 GBytes 17.5 Gbits/sec >From the debian node to the server: $ iperf -c 172.20.20.162 ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 23.5 KByte (default) ------------------------------------------------------------ [ 3] local 172.20.20.166 port 43166 connected with 172.20.20.162 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 12.9 GBytes 11.1 Gbits/sec But if I let it run for longer, it will apprently figure things out and creep back up to normal speed and stay there until NFS strikes again. It's like the kernel is caching some sort of hint that connectivity to that other host sucks, and it has to either expire or be slowly overcome. Client: $ iperf -c 172.20.20.162 -t 60 ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 172.20.20.169 port 59367 connected with 172.20.20.162 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-60.0 sec 56.2 GBytes 8.04 Gbits/sec Server: $ netstat -I vtnet1 -ihw 1 input (vtnet1) output packets errs idrops bytes packets errs bytes colls 7 0 0 420 0 0 0 0 7 0 0 420 0 0 0 0 8 0 0 480 0 0 0 0 8 0 0 480 0 0 0 0 7 0 0 420 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 11 0 0 12k 3 0 206 0 <--- starts here 17 0 0 227k 10 0 660 0 17 0 0 408k 10 0 660 0 17 0 0 417k 10 0 660 0 17 0 0 425k 10 0 660 0 17 0 0 438k 10 0 660 0 17 0 0 444k 10 0 660 0 16 0 0 453k 10 0 660 0 input (vtnet1) output packets errs idrops bytes packets errs bytes colls 16 0 0 463k 10 0 660 0 16 0 0 469k 10 0 660 0 16 0 0 482k 10 0 660 0 16 0 0 487k 10 0 660 0 16 0 0 496k 10 0 660 0 16 0 0 504k 10 0 660 0 18 0 0 510k 10 0 660 0 16 0 0 521k 10 0 660 0 17 0 0 524k 10 0 660 0 17 0 0 538k 10 0 660 0 17 0 0 540k 10 0 660 0 17 0 0 552k 10 0 660 0 17 0 0 554k 10 0 660 0 17 0 0 567k 10 0 660 0 16 0 0 568k 10 0 660 0 16 0 0 581k 10 0 660 0 16 0 0 582k 10 0 660 0 16 0 0 595k 10 0 660 0 16 0 0 595k 10 0 660 0 16 0 0 609k 10 0 660 0 16 0 0 609k 10 0 660 0 input (vtnet1) output packets errs idrops bytes packets errs bytes colls 16 0 0 620k 10 0 660 0 16 0 0 623k 10 0 660 0 17 0 0 632k 10 0 660 0 17 0 0 637k 10 0 660 0 8.7k 0 0 389M 4.4k 0 288k 0 42k 0 0 2.1G 21k 0 1.4M 0 41k 0 0 2.1G 20k 0 1.4M 0 38k 0 0 1.9G 19k 0 1.2M 0 40k 0 0 2.0G 20k 0 1.3M 0 40k 0 0 2.0G 20k 0 1.3M 0 40k 0 0 2G 20k 0 1.3M 0 39k 0 0 2G 20k 0 1.3M 0 43k 0 0 2.2G 22k 0 1.4M 0 42k 0 0 2.2G 21k 0 1.4M 0 39k 0 0 2G 19k 0 1.3M 0 38k 0 0 1.9G 19k 0 1.2M 0 42k 0 0 2.1G 21k 0 1.4M 0 44k 0 0 2.2G 22k 0 1.4M 0 41k 0 0 2.1G 20k 0 1.3M 0 41k 0 0 2.1G 21k 0 1.4M 0 40k 0 0 2.0G 20k 0 1.3M 0 input (vtnet1) output packets errs idrops bytes packets errs bytes colls 43k 0 0 2.2G 22k 0 1.4M 0 41k 0 0 2.1G 20k 0 1.3M 0 40k 0 0 2.0G 20k 0 1.3M 0 42k 0 0 2.2G 21k 0 1.4M 0 39k 0 0 2G 19k 0 1.3M 0 42k 0 0 2.1G 21k 0 1.4M 0 40k 0 0 2.0G 20k 0 1.3M 0 42k 0 0 2.1G 21k 0 1.4M 0 38k 0 0 2G 19k 0 1.3M 0 39k 0 0 2G 20k 0 1.3M 0 45k 0 0 2.3G 23k 0 1.5M 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 It almost looks like something is limiting it to 10 packets per second. So confusing! TCP super slow start? Thanks! (Sorry Rick, forgot to reply all so you got an extra! :( ) Also, here's the netstat from the client side showing the 10 packets per second limit and eventual recovery: $ netstat -I net1 -ihw 1 input (net1) output packets errs idrops bytes packets errs bytes colls 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 15 0 0 962 11 0 114k 0 17 0 0 1.1k 10 0 368k 0 17 0 0 1.1k 10 0 411k 0 17 0 0 1.1k 10 0 425k 0 17 0 0 1.1k 10 0 432k 0 17 0 0 1.1k 10 0 439k 0 17 0 0 1.1k 10 0 452k 0 16 0 0 1k 10 0 457k 0 16 0 0 1k 10 0 467k 0 16 0 0 1k 10 0 477k 0 16 0 0 1k 10 0 481k 0 16 0 0 1k 10 0 495k 0 16 0 0 1k 10 0 498k 0 16 0 0 1k 10 0 510k 0 16 0 0 1k 10 0 515k 0 16 0 0 1k 10 0 524k 0 17 0 0 1.1k 10 0 532k 0 input (net1) output packets errs idrops bytes packets errs bytes colls 17 0 0 1.1k 10 0 538k 0 17 0 0 1.1k 10 0 548k 0 17 0 0 1.1k 10 0 552k 0 17 0 0 1.1k 10 0 562k 0 17 0 0 1.1k 10 0 566k 0 16 0 0 1k 10 0 576k 0 16 0 0 1k 10 0 580k 0 16 0 0 1k 10 0 590k 0 17 0 0 1.1k 10 0 594k 0 16 0 0 1k 10 0 603k 0 16 0 0 1k 10 0 609k 0 16 0 0 1k 10 0 614k 0 16 0 0 1k 10 0 623k 0 16 0 0 1k 10 0 626k 0 17 0 0 1.1k 10 0 637k 0 18 0 0 1.1k 10 0 637k 0 17k 0 0 1.1M 34k 0 1.7G 0 21k 0 0 1.4M 42k 0 2.1G 0 20k 0 0 1.3M 39k 0 2G 0 19k 0 0 1.2M 38k 0 1.9G 0 20k 0 0 1.3M 41k 0 2.0G 0 input (net1) output packets errs idrops bytes packets errs bytes colls 20k 0 0 1.3M 40k 0 2.0G 0 19k 0 0 1.2M 38k 0 1.9G 0 22k 0 0 1.5M 45k 0 2.3G 0 20k 0 0 1.3M 40k 0 2.1G 0 20k 0 0 1.3M 40k 0 2.1G 0 18k 0 0 1.2M 36k 0 1.9G 0 21k 0 0 1.4M 41k 0 2.1G 0 22k 0 0 1.4M 44k 0 2.2G 0 21k 0 0 1.4M 43k 0 2.2G 0 20k 0 0 1.3M 41k 0 2.1G 0 20k 0 0 1.3M 40k 0 2.0G 0 21k 0 0 1.4M 43k 0 2.2G 0 21k 0 0 1.4M 43k 0 2.2G 0 20k 0 0 1.3M 40k 0 2.0G 0 21k 0 0 1.4M 43k 0 2.2G 0 19k 0 0 1.2M 38k 0 1.9G 0 21k 0 0 1.4M 42k 0 2.1G 0 20k 0 0 1.3M 40k 0 2.0G 0 21k 0 0 1.4M 42k 0 2.1G 0 20k 0 0 1.3M 40k 0 2.0G 0 20k 0 0 1.3M 40k 0 2.0G 0 input (net1) output packets errs idrops bytes packets errs bytes colls 24k 0 0 1.6M 48k 0 2.5G 0 6.3k 0 0 417k 12k 0 647M 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 6 0 0 360 0 0 0 0 From owner-freebsd-net@FreeBSD.ORG Mon Jan 20 11:06:49 2014 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id AD87AA76 for ; Mon, 20 Jan 2014 11:06:49 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 97D931D6F for ; Mon, 20 Jan 2014 11:06:49 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0KB6nID088410 for ; Mon, 20 Jan 2014 11:06:49 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0KB6nSF088408 for freebsd-net@FreeBSD.org; Mon, 20 Jan 2014 11:06:49 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 20 Jan 2014 11:06:49 GMT Message-Id: <201401201106.s0KB6nSF088408@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-net@FreeBSD.org Subject: Current problem reports assigned to freebsd-net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jan 2014 11:06:49 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/185496 net [re] RTL8169 doesn't receive unicast ethernet packets o kern/185427 net [igb] freebsd 8.4, 9.1 and 9.2 panic Double-Fault with o kern/185023 net [tun] Closing tun interface deconfigures IP address o kern/185022 net [tun] ls /dev/tun creates tun interface o kern/184311 net [bge] [panic] kernel panic with bge(4) on SunFire X210 o kern/184084 net [ral] kernel crash by ral (RT3090) o bin/183687 net [patch] route(8): route add -net 172.20 add wrong host o kern/183659 net [tcp] ]TCP stack lock contention with short-lived conn o conf/183407 net [rc.d] [patch] Routing restart returns non-zero exitco o kern/183391 net [oce] 10gigabit networking problems with Emulex OCE 11 o kern/183390 net [ixgbe] 10gigabit networking problems o kern/182917 net [igb] strange out traffic with igb interfaces o kern/182847 net [netinet6] [patch] Remove dead code o kern/182665 net [wlan] Kernel panic when creating second wlandev. o kern/182382 net [tcp] sysctl to set TCP CC method on BIG ENDIAN system o kern/182297 net [cm] ArcNet driver fails to detect the link address - o kern/182212 net [patch] [ng_mppc] ng_mppc(4) blocks on network errors o kern/181970 net [re] LAN Realtek® 8111G is not supported by re driver o kern/181931 net [vlan] [lagg] vlan over lagg over mlxen crashes the ke o kern/181823 net [ip6] [patch] make ipv6 mroute return same errror code o kern/181741 net [kernel] [patch] Packet loss when 'control' messages a o kern/181703 net [re] [patch] Fix Realtek 8111G Ethernet controller not o kern/181657 net [bpf] [patch] BPF_COP/BPF_COPX instruction reservation o kern/181257 net [bge] bge link status change o kern/181236 net [igb] igb driver unstable work o kern/181135 net [netmap] [patch] sys/dev/netmap patch for Linux compat o kern/181131 net [netmap] [patch] sys/dev/netmap memory allocation impr o kern/181006 net [run] [patch] mbuf leak in run(4) driver o kern/180893 net [if_ethersubr] [patch] Packets received with own LLADD o kern/180844 net [panic] [re] Intermittent panic (re driver?) o kern/180775 net [bxe] if_bxe driver broken with Broadcom BCM57711 card o kern/180722 net [bluetooth] bluetooth takes 30-50 attempts to pair to s kern/180468 net [request] LOCAL_PEERCRED support for PF_INET o kern/180065 net [netinet6] [patch] Multicast loopback to own host brok o kern/179926 net [lacp] [patch] active aggregator selection bug o kern/179824 net [ixgbe] System (9.1-p4) hangs on heavy ixgbe network t o kern/179733 net [lagg] [patch] interface loses capabilities when proto o kern/179429 net [tap] STP enabled tap bridge o kern/179299 net [igb] Intel X540-T2 - unstable driver a kern/179264 net [vimage] [pf] Core dump with Packet filter and VIMAGE o kern/178947 net [arp] arp rejecting not working o kern/178782 net [ixgbe] 82599EB SFP does not work with passthrough und o kern/178612 net [run] kernel panic due the problems with run driver o kern/178472 net [ip6] [patch] make return code consistent with IPv4 co o kern/178079 net [tcp] Switching TCP CC algorithm panics on sparc64 wit s kern/178071 net FreeBSD unable to recongize Kontron (Industrial Comput o kern/177905 net [xl] [panic] ifmedia_set when pluging CardBus LAN card o kern/177618 net [bridge] Problem with bridge firewall with trunk ports o kern/177402 net [igb] [pf] problem with ethernet driver igb + pf / alt o kern/177400 net [jme] JMC25x 1000baseT establishment issues o kern/177366 net [ieee80211] negative malloc(9) statistics for 80211nod f kern/177362 net [netinet] [patch] Wrong control used to return TOS o kern/177194 net [netgraph] Unnamed netgraph nodes for vlan interfaces o kern/177184 net [bge] [patch] enable wake on lan o kern/177139 net [igb] igb drops ethernet ports 2 and 3 o kern/176884 net [re] re0 flapping up/down o kern/176671 net [epair] MAC address for epair device not unique o kern/176484 net [ipsec] [enc] [patch] panic: IPsec + enc(4); device na o kern/176446 net [netinet] [patch] Concurrency in ixgbe driving out-of- o kern/176420 net [kernel] [patch] incorrect errno for LOCAL_PEERCRED o kern/176419 net [kernel] [patch] socketpair support for LOCAL_PEERCRED o kern/176401 net [netgraph] page fault in netgraph o kern/176167 net [ipsec][lagg] using lagg and ipsec causes immediate pa o kern/176027 net [em] [patch] flow control systcl consistency for em dr o kern/176026 net [tcp] [patch] TCP wrappers caused quite a lot of warni o kern/175864 net [re] Intel MB D510MO, onboard ethernet not working aft o kern/175852 net [amd64] [patch] in_cksum_hdr() behaves differently on o kern/175734 net no ethernet detected on system with EG20T PCH chipset o kern/175267 net [pf] [tap] pf + tap keep state problem o kern/175236 net [epair] [gif] epair and gif Devices On Bridge o kern/175182 net [panic] kernel panic on RADIX_MPATH when deleting rout o kern/175153 net [tcp] will there miss a FIN when do TSO? o kern/174959 net [net] [patch] rnh_walktree_from visits spurious nodes o kern/174958 net [net] [patch] rnh_walktree_from makes unreasonable ass o kern/174897 net [route] Interface routes are broken o kern/174851 net [bxe] [patch] UDP checksum offload is wrong in bxe dri o kern/174850 net [bxe] [patch] bxe driver does not receive multicasts o kern/174849 net [bxe] [patch] bxe driver can hang kernel when reset o kern/174822 net [tcp] Page fault in tcp_discardcb under high traffic o kern/174602 net [gif] [ipsec] traceroute issue on gif tunnel with ipse o kern/174535 net [tcp] TCP fast retransmit feature works strange o kern/173871 net [gif] process of 'ifconfig gif0 create hangs' when if_ o kern/173475 net [tun] tun(4) stays opened by PID after process is term o kern/173201 net [ixgbe] [patch] Missing / broken ixgbe sysctl's and tu o kern/173137 net [em] em(4) unable to run at gigabit with 9.1-RC2 o kern/173002 net [patch] data type size problem in if_spppsubr.c o kern/172895 net [ixgb] [ixgbe] do not properly determine link-state o kern/172683 net [ip6] Duplicate IPv6 Link Local Addresses o kern/172675 net [netinet] [patch] sysctl_tcp_hc_list (net.inet.tcp.hos p kern/172113 net [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4 o kern/171840 net [ip6] IPv6 packets transmitting only on queue 0 o kern/171739 net [bce] [panic] bce related kernel panic o kern/171711 net [dummynet] [panic] Kernel panic in dummynet o kern/171532 net [ndis] ndis(4) driver includes 'pccard'-specific code, o kern/171531 net [ndis] undocumented dependency for ndis(4) o kern/171524 net [ipmi] ipmi driver crashes kernel by reboot or shutdow s kern/171508 net [epair] [request] Add the ability to name epair device o kern/171228 net [re] [patch] if_re - eeprom write issues o kern/170701 net [ppp] killl ppp or reboot with active ppp connection c o kern/170267 net [ixgbe] IXGBE_LE32_TO_CPUS is probably an unintentiona o kern/170081 net [fxp] pf/nat/jails not working if checksum offloading o kern/169898 net ifconfig(8) fails to set MTU on multiple interfaces. o kern/169676 net [bge] [hang] system hangs, fully or partially after re o kern/169620 net [ng] [pf] ng_l2tp incoming packet bypass pf firewall o kern/169459 net [ppp] umodem/ppp/3g stopped working after update from o kern/169438 net [ipsec] ipv4-in-ipv6 tunnel mode IPsec does not work p kern/168294 net [ixgbe] [patch] ixgbe driver compiled in kernel has no o kern/168246 net [em] Multiple em(4) not working with qemu o kern/168245 net [arp] [regression] Permanent ARP entry not deleted on o kern/168244 net [arp] [regression] Unable to manually remove permanent o kern/168183 net [bce] bce driver hang system o kern/167603 net [ip] IP fragment reassembly's broken: file transfer ov o kern/167500 net [em] [panic] Kernel panics in em driver o kern/167325 net [netinet] [patch] sosend sometimes return EINVAL with o kern/167202 net [igmp]: Sending multiple IGMP packets crashes kernel o kern/166462 net [gre] gre(4) when using a tunnel source address from c o kern/166285 net [arp] FreeBSD v8.1 REL p8 arp: unknown hardware addres o kern/166255 net [net] [patch] It should be possible to disable "promis p kern/165903 net mbuf leak o kern/165622 net [ndis][panic][patch] Unregistered use of FPU in kernel s kern/165562 net [request] add support for Intel i350 in FreeBSD 7.4 o kern/165526 net [bxe] UDP packets checksum calculation whithin if_bxe o kern/165488 net [ppp] [panic] Fatal trap 12 jails and ppp , kernel wit o kern/165305 net [ip6] [request] Feature parity between IP_TOS and IPV6 o kern/165296 net [vlan] [patch] Fix EVL_APPLY_VLID, update EVL_APPLY_PR o kern/165181 net [igb] igb freezes after about 2 weeks of uptime o kern/165174 net [patch] [tap] allow tap(4) to keep its address on clos o kern/165152 net [ip6] Does not work through the issue of ipv6 addresse o kern/164495 net [igb] connect double head igb to switch cause system t o kern/164490 net [pfil] Incorrect IP checksum on pfil pass from ip_outp o kern/164475 net [gre] gre misses RUNNING flag after a reboot o kern/164265 net [netinet] [patch] tcp_lro_rx computes wrong checksum i o kern/163903 net [igb] "igb0:tx(0)","bpf interface lock" v2.2.5 9-STABL o kern/163481 net freebsd do not add itself to ping route packet o kern/162927 net [tun] Modem-PPP error ppp[1538]: tun0: Phase: Clearing o kern/162558 net [dummynet] [panic] seldom dummynet panics o kern/162153 net [em] intel em driver 7.2.4 don't compile o kern/162110 net [igb] [panic] RELENG_9 panics on boot in IGB driver - o kern/162028 net [ixgbe] [patch] misplaced #endif in ixgbe.c o kern/161277 net [em] [patch] BMC cannot receive IPMI traffic after loa o kern/160873 net [igb] igb(4) from HEAD fails to build on 7-STABLE o kern/160750 net Intel PRO/1000 connection breaks under load until rebo o kern/160693 net [gif] [em] Multicast packet are not passed from GIF0 t o kern/160293 net [ieee80211] ppanic] kernel panic during network setup o kern/160206 net [gif] gifX stops working after a while (IPv6 tunnel) o kern/159817 net [udp] write UDPv4: No buffer space available (code=55) o kern/159629 net [ipsec] [panic] kernel panic with IPsec in transport m o kern/159621 net [tcp] [panic] panic: soabort: so_count o kern/159603 net [netinet] [patch] in_ifscrubprefix() - network route c o kern/159601 net [netinet] [patch] in_scrubprefix() - loopback route re o kern/159294 net [em] em watchdog timeouts o kern/159203 net [wpi] Intel 3945ABG Wireless LAN not support IBSS o kern/158930 net [bpf] BPF element leak in ifp->bpf_if->bif_dlist o kern/158726 net [ip6] [patch] ICMPv6 Router Announcement flooding limi o kern/158694 net [ix] [lagg] ix0 is not working within lagg(4) o kern/158665 net [ip6] [panic] kernel pagefault in in6_setscope() o kern/158635 net [em] TSO breaks BPF packet captures with em driver f kern/157802 net [dummynet] [panic] kernel panic in dummynet o kern/157785 net amd64 + jail + ipfw + natd = very slow outbound traffi o kern/157418 net [em] em driver lockup during boot on Supermicro X9SCM- o kern/157410 net [ip6] IPv6 Router Advertisements Cause Excessive CPU U o kern/157287 net [re] [panic] INVARIANTS panic (Memory modified after f o kern/157200 net [network.subr] [patch] stf(4) can not communicate betw o kern/157182 net [lagg] lagg interface not working together with epair o kern/156877 net [dummynet] [panic] dummynet move_pkt() null ptr derefe o kern/156667 net [em] em0 fails to init on CURRENT after March 17 o kern/156408 net [vlan] Routing failure when using VLANs vs. Physical e o kern/156328 net [icmp]: host can ping other subnet but no have IP from o kern/156317 net [ip6] Wrong order of IPv6 NS DAD/MLD Report o kern/156279 net [if_bridge][divert][ipfw] unable to correctly re-injec o kern/156226 net [lagg]: failover does not announce the failover to swi o kern/156030 net [ip6] [panic] Crash in nd6_dad_start() due to null ptr o kern/155680 net [multicast] problems with multicast s kern/155642 net [new driver] [request] Add driver for Realtek RTL8191S o kern/155597 net [panic] Kernel panics with "sbdrop" message o kern/155420 net [vlan] adding vlan break existent vlan o kern/155177 net [route] [panic] Panic when inject routes in kernel o kern/155010 net [msk] ntfs-3g via iscsi using msk driver cause kernel o kern/154943 net [gif] ifconfig gifX create on existing gifX clears IP s kern/154851 net [new driver] [request]: Port brcm80211 driver from Lin o kern/154850 net [netgraph] [patch] ng_ether fails to name nodes when t o kern/154679 net [em] Fatal trap 12: "em1 taskq" only at startup (8.1-R o kern/154600 net [tcp] [panic] Random kernel panics on tcp_output o kern/154557 net [tcp] Freeze tcp-session of the clients, if in the gat o kern/154443 net [if_bridge] Kernel module bridgestp.ko missing after u o kern/154286 net [netgraph] [panic] 8.2-PRERELEASE panic in netgraph o kern/154255 net [nfs] NFS not responding o kern/154214 net [stf] [panic] Panic when creating stf interface o kern/154185 net race condition in mb_dupcl p kern/154169 net [multicast] [ip6] Node Information Query multicast add o kern/154134 net [ip6] stuck kernel state in LISTEN on ipv6 daemon whic o kern/154091 net [netgraph] [panic] netgraph, unaligned mbuf? o conf/154062 net [vlan] [patch] change to way of auto-generatation of v o kern/153937 net [ral] ralink panics the system (amd64 freeBSDD 8.X) wh o kern/153936 net [ixgbe] [patch] MPRC workaround incorrectly applied to o kern/153816 net [ixgbe] ixgbe doesn't work properly with the Intel 10g o kern/153772 net [ixgbe] [patch] sysctls reference wrong XON/XOFF varia o kern/153497 net [netgraph] netgraph panic due to race conditions o kern/153454 net [patch] [wlan] [urtw] Support ad-hoc and hostap modes o kern/153308 net [em] em interface use 100% cpu o kern/153244 net [em] em(4) fails to send UDP to port 0xffff o kern/152893 net [netgraph] [panic] 8.2-PRERELEASE panic in netgraph o kern/152853 net [em] tftpd (and likely other udp traffic) fails over e o kern/152828 net [em] poor performance on 8.1, 8.2-PRE o kern/152569 net [net]: Multiple ppp connections and routing table prob o kern/152235 net [arp] Permanent local ARP entries are not properly upd o kern/152141 net [vlan] [patch] encapsulate vlan in ng_ether before out o kern/152036 net [libc] getifaddrs(3) returns truncated sockaddrs for n o kern/151690 net [ep] network connectivity won't work until dhclient is o kern/151681 net [nfs] NFS mount via IPv6 leads to hang on client with o kern/151593 net [igb] [panic] Kernel panic when bringing up igb networ o kern/150920 net [ixgbe][igb] Panic when packets are dropped with heade o kern/150557 net [igb] igb0: Watchdog timeout -- resetting o kern/150251 net [patch] [ixgbe] Late cable insertion broken o kern/150249 net [ixgbe] Media type detection broken o bin/150224 net ppp(8) does not reassign static IP after kill -KILL co f kern/149969 net [wlan] [ral] ralink rt2661 fails to maintain connectio o kern/149643 net [rum] device not sending proper beacon frames in ap mo o kern/149609 net [panic] reboot after adding second default route o kern/149117 net [inet] [patch] in_pcbbind: redundant test o kern/149086 net [multicast] Generic multicast join failure in 8.1 o kern/148018 net [flowtable] flowtable crashes on ia64 o kern/147912 net [boot] FreeBSD 8 Beta won't boot on Thinkpad i1300 11 o kern/147894 net [ipsec] IPv6-in-IPv4 does not work inside an ESP-only o kern/147155 net [ip6] setfb not work with ipv6 o kern/146845 net [libc] close(2) returns error 54 (connection reset by f kern/146792 net [flowtable] flowcleaner 100% cpu's core load o kern/146719 net [pf] [panic] PF or dumynet kernel panic o kern/146534 net [icmp6] wrong source address in echo reply o kern/146427 net [mwl] Additional virtual access points don't work on m f kern/146394 net [vlan] IP source address for outgoing connections o bin/146377 net [ppp] [tun] Interface doesn't clear addresses when PPP o kern/146358 net [vlan] wrong destination MAC address o kern/146165 net [wlan] [panic] Setting bssid in adhoc mode causes pani o kern/146037 net [panic] mpd + CoA = kernel panic o kern/145825 net [panic] panic: soabort: so_count o kern/145728 net [lagg] Stops working lagg between two servers. p kern/145600 net TCP/ECN behaves different to CE/CWR than ns2 reference f kern/144917 net [flowtable] [panic] flowtable crashes system [regressi o kern/144882 net MacBookPro =>4.1 does not connect to BSD in hostap wit o kern/144874 net [if_bridge] [patch] if_bridge frees mbuf after pfil ho o conf/144700 net [rc.d] async dhclient breaks stuff for too many people o kern/144616 net [nat] [panic] ip_nat panic FreeBSD 7.2 f kern/144315 net [ipfw] [panic] freebsd 8-stable reboot after add ipfw o kern/144231 net bind/connect/sendto too strict about sockaddr length o kern/143846 net [gif] bringing gif3 tunnel down causes gif0 tunnel to s kern/143673 net [stf] [request] there should be a way to support multi o kern/143622 net [pfil] [patch] unlock pfil lock while calling firewall o kern/143593 net [ipsec] When using IPSec, tcpdump doesn't show outgoin o kern/143591 net [ral] RT2561C-based DLink card (DWL-510) fails to work o kern/143208 net [ipsec] [gif] IPSec over gif interface not working o kern/143034 net [panic] system reboots itself in tcp code [regression] o kern/142877 net [hang] network-related repeatable 8.0-STABLE hard hang o kern/142774 net Problem with outgoing connections on interface with mu o kern/142772 net [libc] lla_lookup: new lle malloc failed f kern/142518 net [em] [lagg] Problem on 8.0-STABLE with em and lagg o kern/142018 net [iwi] [patch] Possibly wrong interpretation of beacon- o kern/141861 net [wi] data garbled with WEP and wi(4) with Prism 2.5 f kern/141741 net Etherlink III NIC won't work after upgrade to FBSD 8, o kern/140742 net rum(4) Two asus-WL167G adapters cannot talk to each ot o kern/140682 net [netgraph] [panic] random panic in netgraph f kern/140634 net [vlan] destroying if_lagg interface with if_vlan membe o kern/140619 net [ifnet] [patch] refine obsolete if_var.h comments desc o kern/140346 net [wlan] High bandwidth use causes loss of wlan connecti o kern/140142 net [ip6] [panic] FreeBSD 7.2-amd64 panic w/IPv6 o kern/140066 net [bwi] install report for 8.0 RC 2 (multiple problems) o kern/139387 net [ipsec] Wrong lenth of PF_KEY messages in promiscuous o bin/139346 net [patch] arp(8) add option to remove static entries lis o kern/139268 net [if_bridge] [patch] allow if_bridge to forward just VL p kern/139204 net [arp] DHCP server replies rejected, ARP entry lost bef o kern/139117 net [lagg] + wlan boot timing (EBUSY) o kern/138850 net [dummynet] dummynet doesn't work correctly on a bridge o kern/138782 net [panic] sbflush_internal: cc 0 || mb 0xffffff004127b00 o kern/138688 net [rum] possibly broken on 8 Beta 4 amd64: able to wpa a o kern/138678 net [lo] FreeBSD does not assign linklocal address to loop o kern/138407 net [gre] gre(4) interface does not come up after reboot o kern/138332 net [tun] [lor] ifconfig tun0 destroy causes LOR if_adata/ o kern/138266 net [panic] kernel panic when udp benchmark test used as r f kern/138029 net [bpf] [panic] periodically kernel panic and reboot o kern/137881 net [netgraph] [panic] ng_pppoe fatal trap 12 p bin/137841 net [patch] wpa_supplicant(8) cannot verify SHA256 signed p kern/137776 net [rum] panic in rum(4) driver on 8.0-BETA2 o bin/137641 net ifconfig(8): various problems with "vlan_device.vlan_i o kern/137392 net [ip] [panic] crash in ip_nat.c line 2577 o kern/137372 net [ral] FreeBSD doesn't support wireless interface from o kern/137089 net [lagg] lagg falsely triggers IPv6 duplicate address de o kern/136911 net [netgraph] [panic] system panic on kldload ng_bpf.ko t o kern/136618 net [pf][stf] panic on cloning interface without unit numb o kern/135502 net [periodic] Warning message raised by rtfree function i o kern/134583 net [hang] Machine with jail freezes after random amount o o kern/134531 net [route] [panic] kernel crash related to routes/zebra o kern/134157 net [dummynet] dummynet loads cpu for 100% and make a syst o kern/133969 net [dummynet] [panic] Fatal trap 12: page fault while in o kern/133968 net [dummynet] [panic] dummynet kernel panic o kern/133736 net [udp] ip_id not protected ... o kern/133595 net [panic] Kernel Panic at pcpu.h:195 o kern/133572 net [ppp] [hang] incoming PPTP connection hangs the system o kern/133490 net [bpf] [panic] 'kmem_map too small' panic on Dell r900 o kern/133235 net [netinet] [patch] Process SIOCDLIFADDR command incorre f kern/133213 net arp and sshd errors on 7.1-PRERELEASE o kern/133060 net [ipsec] [pfsync] [panic] Kernel panic with ipsec + pfs o kern/132889 net [ndis] [panic] NDIS kernel crash on load BCM4321 AGN d o conf/132851 net [patch] rc.conf(5): allow to setfib(1) for service run o kern/132734 net [ifmib] [panic] panic in net/if_mib.c o kern/132705 net [libwrap] [patch] libwrap - infinite loop if hosts.all o kern/132672 net [ndis] [panic] ndis with rt2860.sys causes kernel pani o kern/132354 net [nat] Getting some packages to ipnat(8) causes crash o kern/132277 net [crypto] [ipsec] poor performance using cryptodevice f o kern/131781 net [ndis] ndis keeps dropping the link o kern/131776 net [wi] driver fails to init o kern/131753 net [altq] [panic] kernel panic in hfsc_dequeue o bin/131365 net route(8): route add changes interpretation of network f kern/130820 net [ndis] wpa_supplicant(8) returns 'no space on device' o kern/130628 net [nfs] NFS / rpc.lockd deadlock on 7.1-R o kern/130525 net [ndis] [panic] 64 bit ar5008 ndisgen-erated driver cau o kern/130311 net [wlan_xauth] [panic] hostapd restart causing kernel pa o kern/130109 net [ipfw] Can not set fib for packets originated from loc f kern/130059 net [panic] Leaking 50k mbufs/hour f kern/129719 net [nfs] [panic] Panic during shutdown, tcp_ctloutput: in o kern/129517 net [ipsec] [panic] double fault / stack overflow f kern/129508 net [carp] [panic] Kernel panic with EtherIP (may be relat o kern/129219 net [ppp] Kernel panic when using kernel mode ppp o kern/129197 net [panic] 7.0 IP stack related panic o kern/129036 net [ipfw] 'ipfw fwd' does not change outgoing interface n o bin/128954 net ifconfig(8) deletes valid routes o bin/128602 net [an] wpa_supplicant(8) crashes with an(4) o kern/128448 net [nfs] 6.4-RC1 Boot Fails if NFS Hostname cannot be res o bin/128295 net [patch] ifconfig(8) does not print TOE4 or TOE6 capabi o bin/128001 net wpa_supplicant(8), wlan(4), and wi(4) issues o kern/127826 net [iwi] iwi0 driver has reduced performance and connecti o kern/127815 net [gif] [patch] if_gif does not set vlan attributes from o kern/127724 net [rtalloc] rtfree: 0xc5a8f870 has 1 refs f bin/127719 net [arp] arp: Segmentation fault (core dumped) f kern/127528 net [icmp]: icmp socket receives icmp replies not owned by p kern/127360 net [socket] TOE socket options missing from sosetopt() o bin/127192 net routed(8) removes the secondary alias IP of interface f kern/127145 net [wi]: prism (wi) driver crash at bigger traffic o kern/126895 net [patch] [ral] Add antenna selection (marked as TBD) o kern/126874 net [vlan]: Zebra problem if ifconfig vlanX destroy o kern/126695 net rtfree messages and network disruption upon use of if_ o kern/126339 net [ipw] ipw driver drops the connection o kern/126075 net [inet] [patch] internet control accesses beyond end of o bin/125922 net [patch] Deadlock in arp(8) o kern/125920 net [arp] Kernel Routing Table loses Ethernet Link status o kern/125845 net [netinet] [patch] tcp_lro_rx() should make use of hard o kern/125258 net [socket] socket's SO_REUSEADDR option does not work o kern/125239 net [gre] kernel crash when using gre o kern/124341 net [ral] promiscuous mode for wireless device ral0 looses o kern/124225 net [ndis] [patch] ndis network driver sometimes loses net o kern/124160 net [libc] connect(2) function loops indefinitely o kern/124021 net [ip6] [panic] page fault in nd6_output() o kern/123968 net [rum] [panic] rum driver causes kernel panic with WPA. o kern/123892 net [tap] [patch] No buffer space available o kern/123890 net [ppp] [panic] crash & reboot on work with PPP low-spee o kern/123858 net [stf] [patch] stf not usable behind a NAT o kern/123758 net [panic] panic while restarting net/freenet6 o bin/123633 net ifconfig(8) doesn't set inet and ether address in one o kern/123559 net [iwi] iwi periodically disassociates/associates [regre o bin/123465 net [ip6] route(8): route add -inet6 -interfac o kern/123463 net [ipsec] [panic] repeatable crash related to ipsec-tool o conf/123330 net [nsswitch.conf] Enabling samba wins in nsswitch.conf c o kern/123160 net [ip] Panic and reboot at sysctl kern.polling.enable=0 o kern/122989 net [swi] [panic] 6.3 kernel panic in swi1: net o kern/122954 net [lagg] IPv6 EUI64 incorrectly chosen for lagg devices f kern/122780 net [lagg] tcpdump on lagg interface during high pps wedge o kern/122685 net It is not visible passing packets in tcpdump(1) o kern/122319 net [wi] imposible to enable ad-hoc demo mode with Orinoco o kern/122290 net [netgraph] [panic] Netgraph related "kmem_map too smal o kern/122252 net [ipmi] [bge] IPMI problem with BCM5704 (does not work o kern/122033 net [ral] [lor] Lock order reversal in ral0 at bootup ieee o bin/121895 net [patch] rtsol(8)/rtsold(8) doesn't handle managed netw s kern/121774 net [swi] [panic] 6.3 kernel panic in swi1: net o kern/121555 net [panic] Fatal trap 12: current process = 12 (swi1: net o kern/121534 net [ipl] [nat] FreeBSD Release 6.3 Kernel Trap 12: o kern/121443 net [gif] [lor] icmp6_input/nd6_lookup o kern/121437 net [vlan] Routing to layer-2 address does not work on VLA o bin/121359 net [patch] [security] ppp(8): fix local stack overflow in o kern/121257 net [tcp] TSO + natd -> slow outgoing tcp traffic o kern/121181 net [panic] Fatal trap 3: breakpoint instruction fault whi o kern/120966 net [rum] kernel panic with if_rum and WPA encryption o kern/120566 net [request]: ifconfig(8) make order of arguments more fr o kern/120304 net [netgraph] [patch] netgraph source assumes 32-bit time o kern/120266 net [udp] [panic] gnugk causes kernel panic when closing U o bin/120060 net routed(8) deletes link-level routes in the presence of o kern/119945 net [rum] [panic] rum device in hostap mode, cause kernel o kern/119791 net [nfs] UDP NFS mount of aliased IP addresses from a Sol o kern/119617 net [nfs] nfs error on wpa network when reseting/shutdown f kern/119516 net [ip6] [panic] _mtx_lock_sleep: recursed on non-recursi o kern/119432 net [arp] route add -host -iface causes arp e o kern/119225 net [wi] 7.0-RC1 no carrier with Prism 2.5 wifi card [regr o kern/118727 net [netgraph] [patch] [request] add new ng_pf module o kern/117423 net [vlan] Duplicate IP on different interfaces o bin/117339 net [patch] route(8): loading routing management commands o bin/116643 net [patch] [request] fstat(1): add INET/INET6 socket deta o kern/116185 net [iwi] if_iwi driver leads system to reboot o kern/115239 net [ipnat] panic with 'kmem_map too small' using ipnat o kern/115019 net [netgraph] ng_ether upper hook packet flow stops on ad o kern/115002 net [wi] if_wi timeout. failed allocation (busy bit). ifco o kern/114915 net [patch] [pcn] pcn (sys/pci/if_pcn.c) ethernet driver f o kern/113432 net [ucom] WARNING: attempt to net_add_domain(netgraph) af o kern/112722 net [ipsec] [udp] IP v4 udp fragmented packet reject o kern/112686 net [patm] patm driver freezes System (FreeBSD 6.2-p4) i38 o bin/112557 net [patch] ppp(8) lock file should not use symlink name o kern/112528 net [nfs] NFS over TCP under load hangs with "impossible p o kern/111537 net [inet6] [patch] ip6_input() treats mbuf cluster wrong o kern/111457 net [ral] ral(4) freeze o kern/110284 net [if_ethersubr] Invalid Assumption in SIOCSIFADDR in et o kern/110249 net [kernel] [regression] [patch] setsockopt() error regre o kern/109470 net [wi] Orinoco Classic Gold PC Card Can't Channel Hop o bin/108895 net pppd(8): PPPoE dead connections on 6.2 [regression] f kern/108197 net [panic] [gif] [ip6] if_delmulti reference counting pan o kern/107944 net [wi] [patch] Forget to unlock mutex-locks o conf/107035 net [patch] bridge(8): bridge interface given in rc.conf n o kern/106444 net [netgraph] [panic] Kernel Panic on Binding to an ip to o kern/106316 net [dummynet] dummynet with multipass ipfw drops packets o kern/105945 net Address can disappear from network interface s kern/105943 net Network stack may modify read-only mbuf chain copies o bin/105925 net problems with ifconfig(8) and vlan(4) [regression] o kern/104851 net [inet6] [patch] On link routes not configured when usi o kern/104751 net [netgraph] kernel panic, when getting info about my tr o kern/104738 net [inet] [patch] Reentrant problem with inet_ntoa in the o kern/103191 net Unpredictable reboot o kern/103135 net [ipsec] ipsec with ipfw divert (not NAT) encodes a pac o kern/102540 net [netgraph] [patch] supporting vlan(4) by ng_fec(4) o conf/102502 net [netgraph] [patch] ifconfig name does't rename netgrap o kern/102035 net [plip] plip networking disables parallel port printing o kern/100709 net [libc] getaddrinfo(3) should return TTL info o kern/100519 net [netisr] suggestion to fix suboptimal network polling o kern/98597 net [inet6] Bug in FreeBSD 6.1 IPv6 link-local DAD procedu o bin/98218 net wpa_supplicant(8) blacklist not working o kern/97306 net [netgraph] NG_L2TP locks after connection with failed o conf/97014 net [gif] gifconfig_gif? in rc.conf does not recognize IPv f kern/96268 net [socket] TCP socket performance drops by 3000% if pack o kern/95519 net [ral] ral0 could not map mbuf o kern/95288 net [pppd] [tty] [panic] if_ppp panic in sys/kern/tty_subr o kern/95277 net [netinet] [patch] IP Encapsulation mask_match() return o kern/95267 net packet drops periodically appear f kern/93378 net [tcp] Slow data transfer in Postfix and Cyrus IMAP (wo o kern/93019 net [ppp] ppp and tunX problems: no traffic after restarti o kern/92880 net [libc] [patch] almost rewritten inet_network(3) functi s kern/92279 net [dc] Core faults everytime I reboot, possible NIC issu o kern/91859 net [ndis] if_ndis does not work with Asus WL-138 o kern/91364 net [ral] [wep] WF-511 RT2500 Card PCI and WEP o kern/91311 net [aue] aue interface hanging o kern/87421 net [netgraph] [panic]: ng_ether + ng_eiface + if_bridge o kern/86871 net [tcp] [patch] allocation logic for PCBs in TIME_WAIT s o kern/86427 net [lor] Deadlock with FASTIPSEC and nat o kern/85780 net 'panic: bogus refcnt 0' in routing/ipv6 o bin/85445 net ifconfig(8): deprecated keyword to ifconfig inoperativ o bin/82975 net route change does not parse classfull network as given o kern/82881 net [netgraph] [panic] ng_fec(4) causes kernel panic after o kern/82468 net Using 64MB tcp send/recv buffers, trafficflow stops, i o bin/82185 net [patch] ndp(8) can delete the incorrect entry o kern/81095 net IPsec connection stops working if associated network i o kern/78968 net FreeBSD freezes on mbufs exhaustion (network interface o kern/78090 net [ipf] ipf filtering on bridged packets doesn't work if o kern/77341 net [ip6] problems with IPV6 implementation o kern/75873 net Usability problem with non-RFC-compliant IP spoof prot s kern/75407 net [an] an(4): no carrier after short time a kern/71474 net [route] route lookup does not skip interfaces marked d o kern/71469 net default route to internet magically disappears with mu o kern/68889 net [panic] m_copym, length > size of mbuf chain o kern/66225 net [netgraph] [patch] extend ng_eiface(4) control message o kern/65616 net IPSEC can't detunnel GRE packets after real ESP encryp s kern/60293 net [patch] FreeBSD arp poison patch a kern/56233 net IPsec tunnel (ESP) over IPv6: MTU computation is wrong s bin/41647 net ifconfig(8) doesn't accept lladdr along with inet addr o kern/39937 net ipstealth issue a kern/38554 net [patch] changing interface ipaddress doesn't seem to w o kern/31940 net ip queue length too short for >500kpps o kern/31647 net [libc] socket calls can return undocumented EINVAL o kern/30186 net [libc] getaddrinfo(3) does not handle incorrect servna f kern/24959 net [patch] proper TCP_NOPUSH/TCP_CORK compatibility o conf/23063 net [arp] [patch] for static ARP tables in rc.network o kern/21998 net [socket] [patch] ident only for outgoing connections o kern/5877 net [socket] sb_cc counts control data as well as data dat 476 problems total. From owner-freebsd-net@FreeBSD.ORG Mon Jan 20 12:50:22 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 78A4EDD4 for ; Mon, 20 Jan 2014 12:50:22 +0000 (UTC) Received: from smtp.unipi.it (smtp1.unipi.it [131.114.21.19]) by mx1.freebsd.org (Postfix) with ESMTP id 0BC141D21 for ; Mon, 20 Jan 2014 12:50:20 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp.unipi.it (Postfix) with ESMTP id 8711E40E52; Mon, 20 Jan 2014 13:39:51 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at unipi.it Received: from [192.168.10.21] (triderg7.iet.unipi.it [131.114.58.42]) (Authenticated User) by smtp.unipi.it (Postfix) with ESMTPSA id 024A440B09; Mon, 20 Jan 2014 13:39:49 +0100 (CET) Message-ID: <52DD1914.7090506@iet.unipi.it> Date: Mon, 20 Jan 2014 13:39:48 +0100 From: Giuseppe Lettieri User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Wang Weidong , =?windows-1252?Q?facolt=E0?= Subject: Re: netmap: I got some troubles with netmap References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> In-Reply-To: <52D8A5E1.9020408@huawei.com> Content-Type: multipart/mixed; boundary="------------080404040009080203030301" X-Mailman-Approved-At: Mon, 20 Jan 2014 13:20:53 +0000 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: Luigi Rizzo , Vincenzo Maffione , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jan 2014 12:50:22 -0000 This is a multi-part message in MIME format. --------------080404040009080203030301 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Hi Wang, OK, you are using the netmap support in the upstream qemu git. That does not yet include all our modifications, some of which are very important for high throughput with VALE. In particular, the upstream qemu does not include the batching improvements in the frontend/backend interface, and it does not include the "map ring" optimization of the e1000 frontend. Please find attached a gzipped patch that contains all of our qemu code. The patch is against the latest upstream master (commit 1cf892ca). Please ./configure the patched qemu with the following option, in addition to any other option you may need: --enable-e1000-paravirt --enable-netmap \ --extra-cflags=-I/path/to/netmap/sys/directory Note that --enable-e1000-paravirt is needed to enable the "map ring" optimization in the e1000 frontend, even if you are not going to use the e1000-paravirt device. Now you should be able to rerun your tests. I am also attaching a README file that describes some more tests you may want to run. Cheers, Giuseppe Il 17/01/2014 04:39, Wang Weidong ha scritto: > On 2014/1/16 18:24, facoltà wrote: >> Hi Wang, >> >> I work with Luigi, please check the replies below. >> >> >> Il giorno 16/gen/2014, alle ore 04:53, Luigi Rizzo > ha scritto: >> >>> >>> > [...] >>> Problem 3: >>> "qemu-system-x86_64 -m 1024 -boot c -net nic -net netmap,ifname=vale0:1 -hda /home/disk/nm_d0 >>> -enable-kvm -vnc :0", Use that command to start a vm. >>> >>> I test on the vm. >>> #pkt-gen -i eth0 -f tx -l 60 -n 20000000, >>> the speed is up to 1.02 Mpps. >> >>> >>> I do "vale-ctl -h vale0:eth2", then I test on the vm, the speed is up to 558.57 Kpps. >>> While "vale-ctl -a vale0:eth2", the speed is up to 800 kpps. >>> >> >> The number you obtain in the first test is quite low. vale-ctl -h vale0:eth2 connects the host stack, which is very slow, so ~500 Kpps is not unexpected. I don’t know about the third test at the moment, I have to check. >> >> What version of our modified qemu are you using? Please note that there might be a qemu patch in the netmap sources, but that is only a leftover from our first attempts, so you should not use that. >> > Here, I use the qemu is from 'git clone git://git.qemu-project.org/qemu.git' origin/master and the commit is f976b09ea249 > ("PPC: Fix compilation with TCG debug"). The netmap is submit into the qemu in commit 58952137b0("net: Adding netmap > network backend"). Is the version I used is not right? Because of the netmap-20131019 doesn't support qemu, so I find the > newest qemu. > > Although, I try to use the netmap-20120813 which support qemu, and download the qemu-1.0.1 from http://wiki.qemu-project.org/download/, > then I patch the patch-zz-netmap-1 and copy the qemu-netmap to the qemu. I test the "pkt-gen -i eth0 -f tx -l 60 -n 20000000" on the vm, > (the pkt-gen is from netmap-20131019) And the speed is unsteadily, sometimes up to 2Mpps or 1.44, and avg is 1.74Mpps. > But when I use "./bridge -i vale0:eth2" on the host, then test "pkt-gen -i eth0 -f tx -l 60 -n 20000000" on the vm, > I got a NULL pointer dereference BUG that: > > -------------- > [ 2313.454871] BUG: unable to handle kernel NULL pointer dereference at (null) > [ 2313.547751] IP: [] get_rps_cpu+0x44/0x390 > [ 2313.613802] PGD 1f7cbe5067 PUD 1f7d792067 PMD 0 > [ 2313.668509] Oops: 0000 [#1] SMP > [ 2313.706703] CPU 0 > [ 2313.728373] Modules linked in: ixgbe(N) netmap_lin(N) edd(N) bridge(N) stp(N) llc(N) mperf(N) microcode(N) fuse(N) loop(N) dm_mod(N) vhost_net(N) macvtap(N) macvlan(N) tun(N) kvm_intel(N) sg(N) i2c_i801(N) ipv6(N) kvm(N) ipv6_lib(N) i2c_core(N) i7core_edac(N) mptctl(N) iTCO_wdt(N) igb(N) pcspkr(N) edac_core(N) rtc_cmos(N) serio_raw(N) iTCO_vendor_support(N) mdio(N) dca(N) button(N) ext3(N) jbd(N) mbcache(N) usbhid(N) hid(N) uhci_hcd(N) ehci_hcd(N) usbcore(N) usb_common(N) sd_mod(N) crc_t10dif(N) processor(N) thermal_sys(N) hwmon(N) scsi_dh_alua(N) scsi_dh_hp_sw(N) scsi_dh_rdac(N) scsi_dh_emc(N) scsi_dh(N) ata_generic(N) ata_piix(N) libata(N) mptsas(N) mptscsih(N) mptbase(N) scsi_transport_sas(N) scsi_mod(N) [last unloaded: ixgbe] > [ 2314.498465] Supported: Yes > [ 2314.530455] > [ 2314.548001] Pid: 10708, comm: bridge Tainted: G N 3.0.58-0.6.6-default #2 Huawei Technologies Co., Ltd. Tecal XH620 /BC21THSA > [ 2314.718261] RIP: 0010:[] [] get_rps_cpu+0x44/0x390 > [ 2314.813196] RSP: 0018:ffff881f5af75928 EFLAGS: 00010246 > [ 2314.876137] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > [ 2314.960745] RDX: ffff881f5af75990 RSI: ffff881f5b1da480 RDI: ffff881f59098000 > [ 2315.045354] RBP: ffff881f5b1da480 R08: 0000000000000000 R09: 0000000000000004 > [ 2315.129963] R10: 0000000080042000 R11: 0000000000000001 R12: ffff881f59098000 > [ 2315.214570] R13: ffff881f7a480000 R14: ffff881f5b1da480 R15: 00000000000003ff > [ 2315.299179] FS: 00007f948e25c700(0000) GS:ffff88203f200000(0000) knlGS:0000000000000000 > [ 2315.395135] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2315.463237] CR2: 0000000000000000 CR3: 0000001f7bb55000 CR4: 00000000000026e0 > [ 2315.547845] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 2315.632454] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 2315.717064] Process bridge (pid: 10708, threadinfo ffff881f5af74000, task ffff881f5903a3c0) > [ 2315.816120] Stack: > [ 2315.839856] ffff881f5af7598f 0000000000000258 ffff881f81aa1280 ffffffff8137ed57 > [ 2315.927586] ffff881f5af75990 0000000000000000 ffff881f5b1da480 0000000000000296 > [ 2316.015317] ffff881f7a480000 ffff881f5b1da480 00000000000003ff ffffffff8138e998 > [ 2316.103044] Call Trace: > [ 2316.131948] [] netif_rx+0xf8/0x190 > [ 2316.191799] [] netmap_sync_to_host+0x1de/0x2b0 [netmap_lin] > [ 2316.277452] [] netmap_poll+0x495/0x610 [netmap_lin] > [ 2316.354846] [] do_poll+0x115/0x2a0 > [ 2316.414696] [] do_sys_poll+0x18e/0x200 > [ 2316.478676] [] sys_poll+0x66/0x100 > [ 2316.538526] [] system_call_fastpath+0x16/0x1b > [ 2316.609726] [<00007f948d7724bf>] 0x7f948d7724be > [ 2316.664418] Code: 24 40 49 89 fc 4c 89 74 24 48 4c 89 7c 24 50 48 89 54 24 20 0f b7 86 ac 00 00 00 66 85 c0 0f 85 d3 00 00 00 48 8b 9f d8 02 00 00 <4c> 8b 2b 4d 85 ed 0f 84 83 01 00 00 41 83 7d 00 01 0f 84 05 01 > [ 2316.888727] RIP [] get_rps_cpu+0x44/0x390 > [ 2316.955804] RSP > ------------------------- > > As you point out that I shouldn't use these old version. So the BUG not occured in the netmap-20131019 and qemu-newest which integrated the netmap-backend. > > Btw, how can I use the bridge command for testing? > > Thanks, > Wang > >> Cheers, >> Giuseppe >> >>> I did something wrong? >>> ------ >>> >>> thanks, >>> >>> Wang >>> >>> >>> >>> >>> >>> >>> -- >>> -----------------------------------------+------------------------------- >>> Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione >>> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >>> TEL +39-050-2211611 . via Diotisalvi 2 >>> Mobile +39-338-6809875 . 56122 PISA (Italy) >>> -----------------------------------------+------------------------------- >> > > -- Dr. Ing. Giuseppe Lettieri Dipartimento di Ingegneria della Informazione Universita' di Pisa Largo Lucio Lazzarino 1, 56122 Pisa - Italy Ph. : (+39) 050-2217.649 (direct) .599 (switch) Fax : (+39) 050-2217.600 e-mail: g.lettieri@iet.unipi.it --------------080404040009080203030301 Content-Type: text/plain; charset=UTF-8; name="README.images" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="README.images" EXPERIMENTING WITH NETMAP, VALE AND FAST QEMU --------------------------------------------- To ease experiments with Netmap, the VALE switch and our Qemu enhancements we have prepared a couple of bootable images (linux and FreeBSD). You can find them on the netmap page http://info.iet.unipi.it/~luigi/netmap/ where you can also look at more recent versions of this file. Below are step-by-step instructions on experiments you can run with these images. The two main versions are picobsd.hdd -> FreeBSD HEAD (netmap + VALE) tinycore.hdd -> Linux (qemu + netmap + VALE) Booting the image ----------------- For all experiments you need to copy the image on a USB stick and boot a PC with it. Alternatively, you can use the image with VirtualBox, Qemu or other emulators, as an example qemu-system-x86_64 -hda IMAGE_FILE -m 1G -machine accel=kvm ... (remove 'accel=kvm' if your host does not support kvm). The images do not install anything on the hard disk. Both systems have preloaded drivers for a number of network cards (including the intel 10 Gbit ones) with netmap extensions. The VALE switch is also available (it is part of the netmap module). ssh, scp and a few other utilities are also included. FreeBSD image: + the OS boots directly in console mode, you can switch between terminals with ALT-Fn. The password for the 'root' account is 'setup' + if you are connected to a network, you can use dhclient em0 # or other interface name to obtain an IP address and external connectivity. Linux image: + in addition to the netmap/VALE modules, the KVM kernel module is also preloaded. + the boot-loader gives you two main options (each with a variant to delay boot in case you have slow devices): + "Boot TinyCore" boots in an X11 environment as user 'tc'. You can create a few terminals using the icon at the bottom. You can use "sudo -s" to get root access. In case no suitable video card is available/detected, it falls back to command line mode. + "Boot Core (command line only)" boots in console mode with virtual terminals. You're automatically logged in as user 'tc'. To log in the other terminals use the same username (no password required). + The system should automatically recognize the existing ethernet devices, and load the appropriate netmap-capable device drivers when available. Interfaces are configured through DHCP when possible. General test recommendations ---------------------------- NOTE: The tests outlined in the following sections can generate very high packet rates, and some hardware misconfiguration problems may prevent you from achieving maximum speed. Common problems are: + slow link autonegotiation. Our programs typically wait 2-4 seconds for link negotiation to complete, but some NIC/switch combinations are much slower. In this case you should increase the delay (pkt-gen has the -w XX option for that) or possibly force the link speed and duplex mode on both sides. Check the link speed to make sure there are no nogotiation problems, and that you see the expected speed. ethtool IFNAME # on linux ifconfig IFNAME # on FreeBSD + ethernet flow control. If the receiving port is slow (often the case in presence of multicast/broadcast traffic, or also unicast if you are sending to non-netmap receivers), it will generate ethernet flow control frames that throttle down the sender. We recommend to disable BOTH RX and TX ethernet flow control on BOTH sender and receiver. On Linux this can be done with ethtool: ethtool -A IFNAME tx off rx off whereas on FreeBSD there are device-specific sysctl sysctl dev.ix.0.queue0.flow_control = 0 + CPU power saving. The CPU governor on linux, or equivalent in FreeBSD, tend to throttle down the clock rate reducing performance. Unlike other similar systems, netmap does not have busy-wait loops, so the CPU load is generally low and this can trigger the clock slowdown. Make sure that ALL CPUs run at maximum speed disabling the dynamic frequency-scaling mechanisms. cpufreq-set -gperformance # on linux sysctl dev.cpu.0.freq=3401 # on FreeBSD. + wrong MAC address netmap does not put the NIC in promiscuous mode, so unless the application does it, the NIC will only receive broadcast traffic or unicast directed to its own MAC address. STANDARD SOCKET TESTS --------------------- For most socket-based experiments you can use the "netperf" tool installed on the system (version 2.6.0). Be careful to use a matching version for the other netperf endpoint (e.g. netserver) when running tests between different machines. Interesting experiments are: netperf -H x.y.z.w -tTCP_STREAM # test TCP throughput netperf -H x.y.z.w -tTCP_RR # test latency netperf -H x.y.z.w -tUDP_STREAM -- -m8 # test UDP throughput with short packets where x.y.z.w is the host running "netserver". RAW SOCKET AND TAP TESTS ------------------------ For experiments with raw sockets and tap devices you can use the l2 utilities (l2open, l2send, l2recv) installed on the system. With these utilities you can send/receive custom network packets to/from raw sockets or tap file descriptors. The receiver can be run with one of the following commands l2open -r IFNAME l2recv # receive from a raw socket attached to IFNAME l2open -t IFNAME l2recv # receive from a file descriptor opened on the tap IFNAME The receiver process will wait indefinitely for the first packet and then keep receiving as long as packets keep coming. When the flow stops (after a 2 seconds timeout) the process terminates and prints the received packet rate and packet count. To run the sender in an easy way, you can use the script l2-send.sh in the home directory. This script defines several shell variables that can be manually changed to customize the test (see the comments in the script itself). As an example, you can test configurations with Virtual Machines attached to host tap devices bridged together. Tests using the Linux in-kernel pktgen -------------------------------------- To use the Linux in-kernel packet generator, you can use the script "linux-pktgen.sh" in the home directory. The pktgen creates a kernel thread for each hardware TX queue of a given NIC. By manually changing the script shell variable definitions you can change the test configuration (e.g. addresses in the generated packet). Please change the "NCPU" variable to match the number of CPUs on your machine. The script has an argument which specifies the number of NIC queues (i.e. kernel threads) to use minus one. For example: ./linux-pktgen.sh 2 # Uses 3 NIC queues When the script terminates, it prints the per-queue rates and the total rate achieved. NETMAP AND VALE EXPERIMENTS --------------------------- For most experiments with netmap you can use the "pkt-gen" command (do not confuse it with the Linux in-kernel pktgen), which has a large number of options to send and receive traffic (also on TAP devices). pkt-gen normally generates UDP traffic for a specific IP address and using the brodadcast MAC address Netmap testing with network interfaces -------------------------------------- Remember that you need a netmap-capable driver in order to use netmap on a specific NIC. Currently supported drivers are e1000, e1000e, ixgbe, igb. For updated information please visit http://info.iet.unipi.it/~luigi/netmap/ Before running pkt-gen, make sure that the link is up. Run pkt-gen on an interface called "IFNAME": pkt-gen -i IFNAME -f tx # run a pkt-gen sender pkt-gen -i IFNAME -f rx # run a pkt-gen receiver pkt-gen without arguments will show other options, e.g. + -w sec modifies the wait time for link negotioation + -l len modifies the packet size + -d, -s set the IP destination/source addresses and ports + -D, -S set the MAC destination/source addresses and more. Testing the VALE switch ------------------------ To use the VALE switch instead of physical ports you only need to change the interface name in the pkt-gen command. As an example, on a single machine, you can run senders and receivers on multiple ports of a VALE switch as follows (run the commands into separate terminals to see the output) pkt-gen -ivale0:01 -ftx # run a sender on the port 01 of the switch vale0 pkt-gen -ivale0:02 -frx # receiver on the port 02 of same switch pkt-gen -ivale0:03 -ftx # another sender on the port 03 The VALE switches and ports are created (and destroyed) on the fly. Transparent connection of physical ports to the VALE switch ----------------------------------------------------------- It is also possible to use a network device as a port of a VALE switch. You can do this with the following command: vale-ctl -h vale0:eth0 # attach interface "eth0" to the "vale0" switch To detach an interface from a bridge: vale-ctl -d vale0:eth0 # detach interface "eth0" from the "vale0" switch These operations can be issued at any moment. Tests with our modified QEMU ---------------------------- The Linux image also contains our modified QEMU, with the VALE backend and the "e1000-paravirt" frontend (a paravirtualized e1000 emulation). After you have booted the image on a physical machine (so you can exploit KVM), you can boot the same image a second time (recursively) with QEMU. Therefore, you can run all the tests above also from within the virtual machine environment. To make VM testing easier, the home directory contains some some useful scripts to set up and launch VMs on the physical machine. + "prep-taps.sh" creates and sets up two permanent tap interfaces ("tap01" and "tap02") and a Linux in-kernel bridge. The tap interfaces are then bridged together on the same bridge. The bridge interface ("br0"), is given the address 10.0.0.200/24. This setup can be used to make two VMs communicate through the host bridge, or to test the speed of a linux switch using l2open + "unprep-taps.sh" undoes the above setup. + "launch-qemu.sh" can be used to run QEMU virtual machines. It takes four arguments: + The first argument can be "qemu" or "kvm", depending on whether we want to use the standard QEMU binary translation or the hardware virtualization acceleration. + The third argument can be "--tap", "--netuser" or "--vale", and tells QEMU what network backend to use: a tap device, the QEMU user networking (slirp), or a VALE switch port. + When the third argument is "--tap" or "--vale", the fourth argument specifies an index (e.g. "01", "02", etc..) which tells QEMU what tap device or VALE port to use as backend. You can manually modify the script to set the shell variables that select the type of emulated device (e.g. e1000, virtio-net-pci, ...) and related options (ioeventfd, virtio vhost, e1000 mitigation, ....). The default setup has an "e1000" device with interrupt mitigation disabled. You can try the paravirtualized e1000 device ("e1000-paravirt") or the "virtio-net" device to get better performance. However, bear in mind that these paravirtualized devices don't have netmap support (whereas the standard e1000 does have netmap support). Examples: # Run a kvm VM attached to the port 01 of a VALE switch ./launch-qemu.sh kvm --vale 01 # Run a kvm VM attached to the port 02 of the same VALE switch ./launch-qemu.sh kvm --vale 02 # Run a kvm VM attached to the tap called "tap01" ./launch-qemu.sh kvm --tap 01 # Run a kvm VM attached to the tap called "tap02" ./launch-qemu.sh kvm --tap 02 Guest-to-guest tests -------------------- If you run two VMs attached to the same switch (which can be a Linux bridge or a VALE switch), you can run guest-to-guest experiments. All the tests reported in the previous sections are possible (normal sockets, raw sockets, pkt-gen, ...), indipendently of the backend used. In the following examples we assume that: + Each VM has an ethernet interface called "eth0". + The interface of the first VM is given the IP 10.0.0.1/24. + The interface of the second VM is given the IP 10.0.0.2/24. + The Linux bridge interface "br0" on the host is given the IP 10.0.0.200/24. Examples: [1] ### Test UDP short packets over traditional sockets ### # On the guest 10.0.0.2 run netserver # on the guest 10.0.0.1 run netperf -H10.0.0.2 -tUDP_STREAM -- -m8 [2] ### Test UDP short packets with pkt-gen ### # On the guest 10.0.0.2 run pkt-gen -ieth0 -frx # On the guest 10.0.0.1 run pkt-gen -ieth0 -ftx [3] ### Test guest-to-guest latency ### # On the guest 10.0.0.2 run netserver # On the guest 10.0.0.1 run netperf -H10.0.0.2 -tTCP_RR Note that you can use pkt-gen into a VM only if the emulated ethernet device is supported by netmap. The default emulated device is "e1000", which has netmap support. If you try to run pkt-gen on an unsupported device, pkt-gen will not work, reporting that it is unable to register the interface. Guest-to-host tests (follows from the previous section) ------------------------------------------------------- If you run only a VM on your host machine, you can measure the network performance between the VM and the host machine. In this case the experiment setup depends on the backend you are using. With the tap backend, you can use the bridge interface "br0" as a communication endpoint. You can run normal/raw sockets experiments, but you cannot use pkt-gen on the "br0" interface, since the Linux bridge interface is not supported by netmap. Examples with the tap backend: [1] ### Test TCP throughput over traditional sockets ### # On the host run netserver # on the guest 10.0.0.1 run netperf -H10.0.0.200 -tTCP_STREAM [2] ### Test UDP short packets with pkt-gen and l2 ### # On the host run l2open -r br0 l2recv # On the guest 10.0.0.1 run (xx:yy:zz:ww:uu:vv is the # "br0" hardware address) pkt-gen -ieth0 -ftx -d10.0.0.200:7777 -Dxx:yy:zz:ww:uu:vv With the VALE backend you can perform only UDP tests, since we don't have a netmap application which implements a TCP endpoint: pkt-gen generates UDP packets. As a communication endpoint on the host, you can use a virtual VALE port opened on the fly by a pkt-gen instance. Examples with the VALE backend: [1] ### Test UDP short packets ### # On the host run pkt-gen -ivale0:99 -frx # On the guest 10.0.0.1 run pkt-gen -ieth0 -ftx [2] ### Test UDP big packets (receiver on the guest) ### # On the guest 10.0.0.1 run pkt-gen -ieth0 -frx # On the host run pkt-gen -ivale0:99 -ftx -l1460 --------------080404040009080203030301-- From owner-freebsd-net@FreeBSD.ORG Tue Jan 21 01:27:26 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 197FA434 for ; Tue, 21 Jan 2014 01:27:26 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 899561175 for ; Tue, 21 Jan 2014 01:27:24 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,693,1384318800"; d="scan'208";a="89303299" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 20 Jan 2014 20:27:22 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 9BCC8B4026; Mon, 20 Jan 2014 20:27:22 -0500 (EST) Date: Mon, 20 Jan 2014 20:27:22 -0500 (EST) From: Rick Macklem To: J David Message-ID: <800819196.13362204.1390267642625.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jan 2014 01:27:26 -0000 J. David wrote: > MIME-Version: 1.0 > Sender: jdavidlists@gmail.com > Received: by 10.42.170.8 with HTTP; Sun, 19 Jan 2014 20:08:04 -0800 > (PST) > In-Reply-To: > <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> > References: <52DC1241.7010004@egr.msu.edu> > <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> > Date: Sun, 19 Jan 2014 23:08:04 -0500 > Delivered-To: jdavidlists@gmail.com > X-Google-Sender-Auth: 2XgnsPkoaEEkfTqW1ZVFM_Lel3o > Message-ID: > > Subject: Re: Terrible NFS performance under 9.2-RELEASE? > From: J David > To: Rick Macklem > Content-Type: text/plain; charset=ISO-8859-1 > > On Sun, Jan 19, 2014 at 9:32 AM, Alfred Perlstein > wrote: > > I hit nearly the same problem and raising the mbufs worked for me. > > > > I'd suggest raising that and retrying. > > That doesn't seem to be an issue here; mbufs are well below max on > both client and server and all the "delayed"/"denied" lines are > 0/0/0. > > > On Sun, Jan 19, 2014 at 12:58 PM, Adam McDougall > wrote: > > Also try rsize=32768,wsize=32768 in your mount options, made a huge > > difference for me. > > This does make a difference, but inconsistently. > > In order to test this further, I created a Debian guest on the same > host as these two FreeBSD hosts and re-ran the tests with it acting > as > both client and server, and ran them for both 32k and 64k. > > Findings: > > > random random > write rewrite read reread read write > > S:FBSD,C:FBSD,Z:64k > 67246 2923 103295 1272407 172475 196 > > S:FBSD,C:FBSD,Z:32k > 11951 99896 223787 1051948 223276 13686 > > S:FBSD,C:DEB,Z:64k > 11414 14445 31554 30156 30368 13799 > > S:FBSD,C:DEB,Z:32k > 11215 14442 31439 31026 29608 13769 > > S:DEB,C:FBSD,Z:64k > 36844 173312 313919 1169426 188432 14273 > > S:DEB,C:FBSD,Z:32k > 66928 120660 257830 1048309 225807 18103 > Since I've never used the benchmark you're using, I'll admit I have no idea what these numbers mean? (Are big values fast or slow or???) > So the rsize/wsize makes a difference between two FreeBSD nodes, but > with a Debian node as either client or server, it no longer seems to > matter much. And /proc/mounts on the debian box confirms that it > negotiates and honors the 64k size as a client. > > On Sun, Jan 19, 2014 at 6:36 PM, Rick Macklem > wrote: > > Yes, it shouldn't make a big difference but it sometimes does. When > > it > > does, I believe that indicates there is a problem with your network > > fabric. > > Given that this is an entirely virtual environment, if your belief is > correct, where would supporting evidence be found? > I'd be looking at a packet trace in wireshark and looking for TCP retransmits and delays between packets. However, any significant improvement (like 50% or more faster for a smaller I/O size) indicates that something is broken in the network (in your case virtual) fabric, imho. (You saw that the smaller I/O size results in more RPCs, so that would suggest slower, not faster. bde@ has argued that a larger I/O size results in vm related fragmentation, but I don't think that causes large (50+%) differences. > As far as I can tell, there are no interface errors reported on the > host (checking both taps and the bridge) or any of the guests, > nothing > in sysctl dev.vtnet of concern, etc. Also the improvement from using > debian on either side, even with 64k sizes, seems counterintuitive. > > To try to help vindicate the network stack, I did iperf -d between > the > two FreeBSD nodes while the iozone was running: > NFS traffic looks very different than a typical network benchmark load. NFS traffic consists of bi-directional (both directions concurrently) RPC messages that are mostly small ones, except for the reads/writes, which will be a little larger than the 32k or 64k size specified. The most common problem is that a network interface will miss reception of a packet while under heavy transmit load. For a typical TCP load, all that is going in the opposite direction are ACKs and losing one may not have much impact, since the next ACK will cover for it (and TCP window sizes are pretty large these days). However, loss of a TCP segment with part/all of an RPC message will hammer NFS. It is message latency that will impact NFS performance, so any packet loss/retransmit will be significant. Bandwidth is much less an issue. > Server: > > $ iperf -s > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 port > 37449 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.169, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 6] local 172.20.20.162 port 28634 connected with 172.20.20.169 > port 5001 > > Waiting for server threads to complete. Interrupt again to force > quit. > > [ ID] Interval Transfer Bandwidth > > [ 6] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > > [ 4] 0.0-10.0 sec 15.6 GBytes 13.4 Gbits/sec > > > Client: > > $ iperf -c 172.20.20.162 -d > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 5] local 172.20.20.169 port 32533 connected with 172.20.20.162 > port 5001 > > [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port > 36617 > > [ ID] Interval Transfer Bandwidth > > [ 5] 0.0-10.0 sec 15.6 GBytes 13.4 Gbits/sec > > [ 4] 0.0-10.0 sec 15.5 GBytes 13.3 Gbits/sec > > > mbuf usage is pretty low. > > Server: > > $ netstat -m > > 545/4075/4620 mbufs in use (current/cache/total) > > 535/1819/2354/131072 mbuf clusters in use (current/cache/total/max) > > 535/1641 mbuf+clusters out of packet secondary zone in use > (current/cache) > > 0/2034/2034/12800 4k (page size) jumbo clusters in use > (current/cache/total/max) > > 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) > > 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) > > 1206K/12792K/13999K bytes allocated to network (current/cache/total) > > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) > > 0/0/0 requests for jumbo clusters denied (4k/9k/16k) > > 0/0/0 sfbufs in use (current/peak/max) > > 0 requests for sfbufs denied > > 0 requests for sfbufs delayed > > 0 requests for I/O initiated by sendfile > > 0 calls to protocol drain routines > > > Client: > > $ netstat -m > > 1841/3544/5385 mbufs in use (current/cache/total) > > 1172/1198/2370/32768 mbuf clusters in use (current/cache/total/max) > > 512/896 mbuf+clusters out of packet secondary zone in use > (current/cache) > > 0/2314/2314/16384 4k (page size) jumbo clusters in use > (current/cache/total/max) > > 0/0/0/8192 9k jumbo clusters in use (current/cache/total/max) > > 0/0/0/4096 16k jumbo clusters in use (current/cache/total/max) > > 2804K/12538K/15342K bytes allocated to network (current/cache/total) > > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) > > 0/0/0 requests for jumbo clusters denied (4k/9k/16k) > > 0/0/0 sfbufs in use (current/peak/max) > > 0 requests for sfbufs denied > > 0 requests for sfbufs delayed > > 0 requests for I/O initiated by sendfile > > 0 calls to protocol drain routines > > > > Here's 60 seconds of netstat -ss for ip and tcp from the server with > 64k mount running ozone: > > ip: > > 4776 total packets received > > 4758 packets for this host > > 18 packets for unknown/unsupported protocol > > 2238 packets sent from this host > > tcp: > > 2244 packets sent > > 1427 data packets (238332 bytes) > > 5 data packets (820 bytes) retransmitted > > 812 ack-only packets (587 delayed) > I'm not sure what (587 delayed) means, but in the old days delayed acknowledgement used to trainwreck performance. My TCP is very rusty, but maybe I'll take a look and see what (587 delayed) means. This would be something I'd be looking at in wireshark. In other words, I'd be looking at when the ACKs were sent and whether there were significant time delays because ACKs didn't happen quickly enough. Does anyone else know what these "delayed acks" refer to? > 2235 packets received > > 1428 acks (for 238368 bytes) > > 2007 packets (91952792 bytes) received in-sequence > > 225 out-of-order packets (325800 bytes) > > 1428 segments updated rtt (of 1426 attempts) > > 5 retransmit timeouts > 5 retransmits out of 2244 sounds like a lot to me. > 587 correct data packet header predictions > > 225 SACK options (SACK blocks) sent > > > And with 32k mount: > > ip: > > 24172 total packets received > > 24167 packets for this host > > 5 packets for unknown/unsupported protocol > > 26130 packets sent from this host > > tcp: > > 26130 packets sent > > 23506 data packets (5362120 bytes) > > 2624 ack-only packets (454 delayed) > > 21671 packets received > > 18143 acks (for 5362192 bytes) > > 20278 packets (756617316 bytes) received in-sequence > > 96 out-of-order packets (145964 bytes) > > 18143 segments updated rtt (of 17469 attempts) > > 1093 correct ACK header predictions > > 3449 correct data packet header predictions > > 111 SACK options (SACK blocks) sent > I don't see any retransmit timeouts here, which is what I would expect. > > So the 32k mount sends about 6x the packet volume. (This is on > iozone's linear write test.) > Yep, I'd guess the 64k is wedging for quite a while each time one of those retransmit timeouts occurs and those delays result in a lot less traffic. (This would be what I'd be looking for in wireshark.) > One thing I've noticed is that when the 64k connection bogs down, it > seems to "poison" things for awhile. For example, iperf will start > doing this afterward: > Just a wild guess, but something in the "virtual net interface" has gotten badly broken by the NFS traffic load. Any networking type able to explain this? > From the client to the server: > > $ iperf -c 172.20.20.162 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 14337 connected with 172.20.20.162 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.1 sec 4.88 MBytes 4.05 Mbits/sec > > > Ouch! That's quite a drop from 13Gbit/sec. Weirdly, iperf to the > debian node not affected: > > From the client to the debian node: > > $ iperf -c 172.20.20.166 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.166, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 24376 connected with 172.20.20.166 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 20.4 GBytes 17.5 Gbits/sec > > > From the debian node to the server: > > $ iperf -c 172.20.20.162 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 23.5 KByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.166 port 43166 connected with 172.20.20.162 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 12.9 GBytes 11.1 Gbits/sec > > > But if I let it run for longer, it will apprently figure things out > and creep back up to normal speed and stay there until NFS strikes > again. It's like the kernel is caching some sort of hint that > connectivity to that other host sucks, and it has to either expire or > be slowly overcome. > > Client: > > $ iperf -c 172.20.20.162 -t 60 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 59367 connected with 172.20.20.162 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-60.0 sec 56.2 GBytes 8.04 Gbits/sec > > > Server: > > $ netstat -I vtnet1 -ihw 1 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 7 0 0 420 0 0 0 0 > > 7 0 0 420 0 0 0 0 > > 8 0 0 480 0 0 0 0 > > 8 0 0 480 0 0 0 0 > > 7 0 0 420 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 11 0 0 12k 3 0 206 0 > <--- starts here > > 17 0 0 227k 10 0 660 0 > > 17 0 0 408k 10 0 660 0 > > 17 0 0 417k 10 0 660 0 > > 17 0 0 425k 10 0 660 0 > > 17 0 0 438k 10 0 660 0 > > 17 0 0 444k 10 0 660 0 > > 16 0 0 453k 10 0 660 0 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 16 0 0 463k 10 0 660 0 > > 16 0 0 469k 10 0 660 0 > > 16 0 0 482k 10 0 660 0 > > 16 0 0 487k 10 0 660 0 > > 16 0 0 496k 10 0 660 0 > > 16 0 0 504k 10 0 660 0 > > 18 0 0 510k 10 0 660 0 > > 16 0 0 521k 10 0 660 0 > > 17 0 0 524k 10 0 660 0 > > 17 0 0 538k 10 0 660 0 > > 17 0 0 540k 10 0 660 0 > > 17 0 0 552k 10 0 660 0 > > 17 0 0 554k 10 0 660 0 > > 17 0 0 567k 10 0 660 0 > > 16 0 0 568k 10 0 660 0 > > 16 0 0 581k 10 0 660 0 > > 16 0 0 582k 10 0 660 0 > > 16 0 0 595k 10 0 660 0 > > 16 0 0 595k 10 0 660 0 > > 16 0 0 609k 10 0 660 0 > > 16 0 0 609k 10 0 660 0 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 16 0 0 620k 10 0 660 0 > > 16 0 0 623k 10 0 660 0 > > 17 0 0 632k 10 0 660 0 > > 17 0 0 637k 10 0 660 0 > > 8.7k 0 0 389M 4.4k 0 288k 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 41k 0 0 2.1G 20k 0 1.4M 0 > > 38k 0 0 1.9G 19k 0 1.2M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 40k 0 0 2G 20k 0 1.3M 0 > > 39k 0 0 2G 20k 0 1.3M 0 > > 43k 0 0 2.2G 22k 0 1.4M 0 > > 42k 0 0 2.2G 21k 0 1.4M 0 > > 39k 0 0 2G 19k 0 1.3M 0 > > 38k 0 0 1.9G 19k 0 1.2M 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 44k 0 0 2.2G 22k 0 1.4M 0 > > 41k 0 0 2.1G 20k 0 1.3M 0 > > 41k 0 0 2.1G 21k 0 1.4M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 43k 0 0 2.2G 22k 0 1.4M 0 > > 41k 0 0 2.1G 20k 0 1.3M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 42k 0 0 2.2G 21k 0 1.4M 0 > > 39k 0 0 2G 19k 0 1.3M 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 38k 0 0 2G 19k 0 1.3M 0 > > 39k 0 0 2G 20k 0 1.3M 0 > > 45k 0 0 2.3G 23k 0 1.5M 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > > It almost looks like something is limiting it to 10 packets per > second. So confusing! TCP super slow start? > Well, I'm not a networking guy, but the packet losses might have triggered congestion avoidance. I don't know diddly about the congestion avoidance algorithms in use these days (and don't know if they are applied across multiple TCP connections between the same host addresses?). > Thanks! > > (Sorry Rick, forgot to reply all so you got an extra! :( ) > > Also, here's the netstat from the client side showing the 10 packets > per second limit and eventual recovery: > > $ netstat -I net1 -ihw 1 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 15 0 0 962 11 0 114k 0 > > 17 0 0 1.1k 10 0 368k 0 > > 17 0 0 1.1k 10 0 411k 0 > > 17 0 0 1.1k 10 0 425k 0 > > 17 0 0 1.1k 10 0 432k 0 > > 17 0 0 1.1k 10 0 439k 0 > > 17 0 0 1.1k 10 0 452k 0 > > 16 0 0 1k 10 0 457k 0 > > 16 0 0 1k 10 0 467k 0 > > 16 0 0 1k 10 0 477k 0 > > 16 0 0 1k 10 0 481k 0 > > 16 0 0 1k 10 0 495k 0 > > 16 0 0 1k 10 0 498k 0 > > 16 0 0 1k 10 0 510k 0 > > 16 0 0 1k 10 0 515k 0 > > 16 0 0 1k 10 0 524k 0 > > 17 0 0 1.1k 10 0 532k 0 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 17 0 0 1.1k 10 0 538k 0 > > 17 0 0 1.1k 10 0 548k 0 > > 17 0 0 1.1k 10 0 552k 0 > > 17 0 0 1.1k 10 0 562k 0 > > 17 0 0 1.1k 10 0 566k 0 > > 16 0 0 1k 10 0 576k 0 > > 16 0 0 1k 10 0 580k 0 > > 16 0 0 1k 10 0 590k 0 > > 17 0 0 1.1k 10 0 594k 0 > > 16 0 0 1k 10 0 603k 0 > > 16 0 0 1k 10 0 609k 0 > > 16 0 0 1k 10 0 614k 0 > > 16 0 0 1k 10 0 623k 0 > > 16 0 0 1k 10 0 626k 0 > > 17 0 0 1.1k 10 0 637k 0 > > 18 0 0 1.1k 10 0 637k 0 > > 17k 0 0 1.1M 34k 0 1.7G 0 > > 21k 0 0 1.4M 42k 0 2.1G 0 > > 20k 0 0 1.3M 39k 0 2G 0 > > 19k 0 0 1.2M 38k 0 1.9G 0 > > 20k 0 0 1.3M 41k 0 2.0G 0 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 19k 0 0 1.2M 38k 0 1.9G 0 > > 22k 0 0 1.5M 45k 0 2.3G 0 > > 20k 0 0 1.3M 40k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.1G 0 > > 18k 0 0 1.2M 36k 0 1.9G 0 > > 21k 0 0 1.4M 41k 0 2.1G 0 > > 22k 0 0 1.4M 44k 0 2.2G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 20k 0 0 1.3M 41k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 19k 0 0 1.2M 38k 0 1.9G 0 > > 21k 0 0 1.4M 42k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 21k 0 0 1.4M 42k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 24k 0 0 1.6M 48k 0 2.5G 0 > > 6.3k 0 0 417k 12k 0 647M 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 21 01:47:47 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4944977F for ; Tue, 21 Jan 2014 01:47:47 +0000 (UTC) Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2AC4512BD for ; Tue, 21 Jan 2014 01:47:45 +0000 (UTC) Received: from 172.24.2.119 (EHLO szxeml207-edg.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id BOW42630; Tue, 21 Jan 2014 09:47:23 +0800 (CST) Received: from SZXEML454-HUB.china.huawei.com (10.82.67.197) by szxeml207-edg.china.huawei.com (172.24.2.56) with Microsoft SMTP Server (TLS) id 14.3.158.1; Tue, 21 Jan 2014 09:46:45 +0800 Received: from [127.0.0.1] (10.177.18.75) by SZXEML454-HUB.china.huawei.com (10.82.67.197) with Microsoft SMTP Server id 14.3.158.1; Tue, 21 Jan 2014 09:46:40 +0800 Message-ID: <52DDD17D.6060104@huawei.com> Date: Tue, 21 Jan 2014 09:46:37 +0800 From: Wang Weidong User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Giuseppe Lettieri , =?UTF-8?B?ZmFjb2x0w6A=?= Subject: Re: netmap: I got some troubles with netmap References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> <52DD1914.7090506@iet.unipi.it> In-Reply-To: <52DD1914.7090506@iet.unipi.it> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.177.18.75] X-CFilter-Loop: Reflected Cc: Luigi Rizzo , Vincenzo Maffione , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jan 2014 01:47:47 -0000 On 2014/1/20 20:39, Giuseppe Lettieri wrote: > Hi Wang, > > OK, you are using the netmap support in the upstream qemu git. That does not yet include all our modifications, some of which are very important for high throughput with VALE. In particular, the upstream qemu does not include the batching improvements in the frontend/backend interface, and it does not include the "map ring" optimization of the e1000 frontend. Please find attached a gzipped patch that contains all of our qemu code. The patch is against the latest upstream master (commit 1cf892ca). > > Please ./configure the patched qemu with the following option, in addition to any other option you may need: > > --enable-e1000-paravirt --enable-netmap \ > --extra-cflags=-I/path/to/netmap/sys/directory > > Note that --enable-e1000-paravirt is needed to enable the "map ring" optimization in the e1000 frontend, even if you are not going to use the e1000-paravirt device. > > Now you should be able to rerun your tests. I am also attaching a README file that describes some more tests you may want to run. > Thanks for your answers. I will do it soon. Cheers, Wang > > Cheers, > Giuseppe > > Il 17/01/2014 04:39, Wang Weidong ha scritto: >> On 2014/1/16 18:24, facoltà wrote: >>> Hi Wang, >>> >>> I work with Luigi, please check the replies below. >>> >>> >>> Il giorno 16/gen/2014, alle ore 04:53, Luigi Rizzo > ha scritto: >>> >>>> >>>> >> [...] >>>> Problem 3: >>>> "qemu-system-x86_64 -m 1024 -boot c -net nic -net netmap,ifname=vale0:1 -hda /home/disk/nm_d0 >>>> -enable-kvm -vnc :0", Use that command to start a vm. >>>> >>>> I test on the vm. >>>> #pkt-gen -i eth0 -f tx -l 60 -n 20000000, >>>> the speed is up to 1.02 Mpps. >>> >>>> >>>> I do "vale-ctl -h vale0:eth2", then I test on the vm, the speed is up to 558.57 Kpps. >>>> While "vale-ctl -a vale0:eth2", the speed is up to 800 kpps. >>>> >>> >>> The number you obtain in the first test is quite low. vale-ctl -h vale0:eth2 connects the host stack, which is very slow, so ~500 Kpps is not unexpected. I don’t know about the third test at the moment, I have to check. >>> >>> What version of our modified qemu are you using? Please note that there might be a qemu patch in the netmap sources, but that is only a leftover from our first attempts, so you should not use that. >>> >> Here, I use the qemu is from 'git clone git://git.qemu-project.org/qemu.git' origin/master and the commit is f976b09ea249 >> ("PPC: Fix compilation with TCG debug"). The netmap is submit into the qemu in commit 58952137b0("net: Adding netmap >> network backend"). Is the version I used is not right? Because of the netmap-20131019 doesn't support qemu, so I find the >> newest qemu. >> >> Although, I try to use the netmap-20120813 which support qemu, and download the qemu-1.0.1 from http://wiki.qemu-project.org/download/, >> then I patch the patch-zz-netmap-1 and copy the qemu-netmap to the qemu. I test the "pkt-gen -i eth0 -f tx -l 60 -n 20000000" on the vm, >> (the pkt-gen is from netmap-20131019) And the speed is unsteadily, sometimes up to 2Mpps or 1.44, and avg is 1.74Mpps. >> But when I use "./bridge -i vale0:eth2" on the host, then test "pkt-gen -i eth0 -f tx -l 60 -n 20000000" on the vm, >> I got a NULL pointer dereference BUG that: >> >> -------------- >> [ 2313.454871] BUG: unable to handle kernel NULL pointer dereference at (null) >> [ 2313.547751] IP: [] get_rps_cpu+0x44/0x390 >> [ 2313.613802] PGD 1f7cbe5067 PUD 1f7d792067 PMD 0 >> [ 2313.668509] Oops: 0000 [#1] SMP >> [ 2313.706703] CPU 0 >> [ 2313.728373] Modules linked in: ixgbe(N) netmap_lin(N) edd(N) bridge(N) stp(N) llc(N) mperf(N) microcode(N) fuse(N) loop(N) dm_mod(N) vhost_net(N) macvtap(N) macvlan(N) tun(N) kvm_intel(N) sg(N) i2c_i801(N) ipv6(N) kvm(N) ipv6_lib(N) i2c_core(N) i7core_edac(N) mptctl(N) iTCO_wdt(N) igb(N) pcspkr(N) edac_core(N) rtc_cmos(N) serio_raw(N) iTCO_vendor_support(N) mdio(N) dca(N) button(N) ext3(N) jbd(N) mbcache(N) usbhid(N) hid(N) uhci_hcd(N) ehci_hcd(N) usbcore(N) usb_common(N) sd_mod(N) crc_t10dif(N) processor(N) thermal_sys(N) hwmon(N) scsi_dh_alua(N) scsi_dh_hp_sw(N) scsi_dh_rdac(N) scsi_dh_emc(N) scsi_dh(N) ata_generic(N) ata_piix(N) libata(N) mptsas(N) mptscsih(N) mptbase(N) scsi_transport_sas(N) scsi_mod(N) [last unloaded: ixgbe] >> [ 2314.498465] Supported: Yes >> [ 2314.530455] >> [ 2314.548001] Pid: 10708, comm: bridge Tainted: G N 3.0.58-0.6.6-default #2 Huawei Technologies Co., Ltd. Tecal XH620 /BC21THSA >> [ 2314.718261] RIP: 0010:[] [] get_rps_cpu+0x44/0x390 >> [ 2314.813196] RSP: 0018:ffff881f5af75928 EFLAGS: 00010246 >> [ 2314.876137] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >> [ 2314.960745] RDX: ffff881f5af75990 RSI: ffff881f5b1da480 RDI: ffff881f59098000 >> [ 2315.045354] RBP: ffff881f5b1da480 R08: 0000000000000000 R09: 0000000000000004 >> [ 2315.129963] R10: 0000000080042000 R11: 0000000000000001 R12: ffff881f59098000 >> [ 2315.214570] R13: ffff881f7a480000 R14: ffff881f5b1da480 R15: 00000000000003ff >> [ 2315.299179] FS: 00007f948e25c700(0000) GS:ffff88203f200000(0000) knlGS:0000000000000000 >> [ 2315.395135] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 2315.463237] CR2: 0000000000000000 CR3: 0000001f7bb55000 CR4: 00000000000026e0 >> [ 2315.547845] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 2315.632454] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> [ 2315.717064] Process bridge (pid: 10708, threadinfo ffff881f5af74000, task ffff881f5903a3c0) >> [ 2315.816120] Stack: >> [ 2315.839856] ffff881f5af7598f 0000000000000258 ffff881f81aa1280 ffffffff8137ed57 >> [ 2315.927586] ffff881f5af75990 0000000000000000 ffff881f5b1da480 0000000000000296 >> [ 2316.015317] ffff881f7a480000 ffff881f5b1da480 00000000000003ff ffffffff8138e998 >> [ 2316.103044] Call Trace: >> [ 2316.131948] [] netif_rx+0xf8/0x190 >> [ 2316.191799] [] netmap_sync_to_host+0x1de/0x2b0 [netmap_lin] >> [ 2316.277452] [] netmap_poll+0x495/0x610 [netmap_lin] >> [ 2316.354846] [] do_poll+0x115/0x2a0 >> [ 2316.414696] [] do_sys_poll+0x18e/0x200 >> [ 2316.478676] [] sys_poll+0x66/0x100 >> [ 2316.538526] [] system_call_fastpath+0x16/0x1b >> [ 2316.609726] [<00007f948d7724bf>] 0x7f948d7724be >> [ 2316.664418] Code: 24 40 49 89 fc 4c 89 74 24 48 4c 89 7c 24 50 48 89 54 24 20 0f b7 86 ac 00 00 00 66 85 c0 0f 85 d3 00 00 00 48 8b 9f d8 02 00 00 <4c> 8b 2b 4d 85 ed 0f 84 83 01 00 00 41 83 7d 00 01 0f 84 05 01 >> [ 2316.888727] RIP [] get_rps_cpu+0x44/0x390 >> [ 2316.955804] RSP >> ------------------------- >> >> As you point out that I shouldn't use these old version. So the BUG not occured in the netmap-20131019 and qemu-newest which integrated the netmap-backend. >> >> Btw, how can I use the bridge command for testing? >> >> Thanks, >> Wang >> >>> Cheers, >>> Giuseppe >>> >>>> I did something wrong? >>>> ------ >>>> >>>> thanks, >>>> >>>> Wang >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> -----------------------------------------+------------------------------- >>>> Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione >>>> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >>>> TEL +39-050-2211611 . via Diotisalvi 2 >>>> Mobile +39-338-6809875 . 56122 PISA (Italy) >>>> -----------------------------------------+------------------------------- >>> >> >> > > From owner-freebsd-net@FreeBSD.ORG Tue Jan 21 02:01:14 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 01C8886F for ; Tue, 21 Jan 2014 02:01:14 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 7453313A6 for ; Tue, 21 Jan 2014 02:01:12 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,693,1384318800"; d="scan'208";a="89306762" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 20 Jan 2014 21:01:11 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id A4D7DB3F1D; Mon, 20 Jan 2014 21:01:11 -0500 (EST) Date: Mon, 20 Jan 2014 21:01:11 -0500 (EST) From: Rick Macklem To: J David Message-ID: <2057911949.13372985.1390269671666.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jan 2014 02:01:14 -0000 Since this is getting long winded, I'm going to "cheat" and top post. (Don't top post flame suit on;-) You could try setting net.inet.tcp.delayed_ack=0 via sysctl. I just looked and it appears that TCP delays ACKs for a while, even when TCP_NODELAY is set (I didn't know that). I honestly don't know how much/if any effect these delayed ACKs will have, but is you disable them, you can see what happens. rick > MIME-Version: 1.0 > Sender: jdavidlists@gmail.com > Received: by 10.42.170.8 with HTTP; Sun, 19 Jan 2014 20:08:04 -0800 > (PST) > In-Reply-To: > <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> > References: <52DC1241.7010004@egr.msu.edu> > <1349281953.12559529.1390174577569.JavaMail.root@uoguelph.ca> > Date: Sun, 19 Jan 2014 23:08:04 -0500 > Delivered-To: jdavidlists@gmail.com > X-Google-Sender-Auth: 2XgnsPkoaEEkfTqW1ZVFM_Lel3o > Message-ID: > > Subject: Re: Terrible NFS performance under 9.2-RELEASE? > From: J David > To: Rick Macklem > Content-Type: text/plain; charset=ISO-8859-1 > > On Sun, Jan 19, 2014 at 9:32 AM, Alfred Perlstein > wrote: > > I hit nearly the same problem and raising the mbufs worked for me. > > > > I'd suggest raising that and retrying. > > That doesn't seem to be an issue here; mbufs are well below max on > both client and server and all the "delayed"/"denied" lines are > 0/0/0. > > > On Sun, Jan 19, 2014 at 12:58 PM, Adam McDougall > wrote: > > Also try rsize=32768,wsize=32768 in your mount options, made a huge > > difference for me. > > This does make a difference, but inconsistently. > > In order to test this further, I created a Debian guest on the same > host as these two FreeBSD hosts and re-ran the tests with it acting > as > both client and server, and ran them for both 32k and 64k. > > Findings: > > > random random > write rewrite read reread read write > > S:FBSD,C:FBSD,Z:64k > 67246 2923 103295 1272407 172475 196 > > S:FBSD,C:FBSD,Z:32k > 11951 99896 223787 1051948 223276 13686 > > S:FBSD,C:DEB,Z:64k > 11414 14445 31554 30156 30368 13799 > > S:FBSD,C:DEB,Z:32k > 11215 14442 31439 31026 29608 13769 > > S:DEB,C:FBSD,Z:64k > 36844 173312 313919 1169426 188432 14273 > > S:DEB,C:FBSD,Z:32k > 66928 120660 257830 1048309 225807 18103 > > So the rsize/wsize makes a difference between two FreeBSD nodes, but > with a Debian node as either client or server, it no longer seems to > matter much. And /proc/mounts on the debian box confirms that it > negotiates and honors the 64k size as a client. > > On Sun, Jan 19, 2014 at 6:36 PM, Rick Macklem > wrote: > > Yes, it shouldn't make a big difference but it sometimes does. When > > it > > does, I believe that indicates there is a problem with your network > > fabric. > > Given that this is an entirely virtual environment, if your belief is > correct, where would supporting evidence be found? > > As far as I can tell, there are no interface errors reported on the > host (checking both taps and the bridge) or any of the guests, > nothing > in sysctl dev.vtnet of concern, etc. Also the improvement from using > debian on either side, even with 64k sizes, seems counterintuitive. > > To try to help vindicate the network stack, I did iperf -d between > the > two FreeBSD nodes while the iozone was running: > > Server: > > $ iperf -s > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 4] local 172.20.20.162 port 5001 connected with 172.20.20.169 port > 37449 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.169, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 6] local 172.20.20.162 port 28634 connected with 172.20.20.169 > port 5001 > > Waiting for server threads to complete. Interrupt again to force > quit. > > [ ID] Interval Transfer Bandwidth > > [ 6] 0.0-10.0 sec 15.8 GBytes 13.6 Gbits/sec > > [ 4] 0.0-10.0 sec 15.6 GBytes 13.4 Gbits/sec > > > Client: > > $ iperf -c 172.20.20.162 -d > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 5] local 172.20.20.169 port 32533 connected with 172.20.20.162 > port 5001 > > [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port > 36617 > > [ ID] Interval Transfer Bandwidth > > [ 5] 0.0-10.0 sec 15.6 GBytes 13.4 Gbits/sec > > [ 4] 0.0-10.0 sec 15.5 GBytes 13.3 Gbits/sec > > > mbuf usage is pretty low. > > Server: > > $ netstat -m > > 545/4075/4620 mbufs in use (current/cache/total) > > 535/1819/2354/131072 mbuf clusters in use (current/cache/total/max) > > 535/1641 mbuf+clusters out of packet secondary zone in use > (current/cache) > > 0/2034/2034/12800 4k (page size) jumbo clusters in use > (current/cache/total/max) > > 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) > > 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) > > 1206K/12792K/13999K bytes allocated to network (current/cache/total) > > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) > > 0/0/0 requests for jumbo clusters denied (4k/9k/16k) > > 0/0/0 sfbufs in use (current/peak/max) > > 0 requests for sfbufs denied > > 0 requests for sfbufs delayed > > 0 requests for I/O initiated by sendfile > > 0 calls to protocol drain routines > > > Client: > > $ netstat -m > > 1841/3544/5385 mbufs in use (current/cache/total) > > 1172/1198/2370/32768 mbuf clusters in use (current/cache/total/max) > > 512/896 mbuf+clusters out of packet secondary zone in use > (current/cache) > > 0/2314/2314/16384 4k (page size) jumbo clusters in use > (current/cache/total/max) > > 0/0/0/8192 9k jumbo clusters in use (current/cache/total/max) > > 0/0/0/4096 16k jumbo clusters in use (current/cache/total/max) > > 2804K/12538K/15342K bytes allocated to network (current/cache/total) > > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) > > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) > > 0/0/0 requests for jumbo clusters denied (4k/9k/16k) > > 0/0/0 sfbufs in use (current/peak/max) > > 0 requests for sfbufs denied > > 0 requests for sfbufs delayed > > 0 requests for I/O initiated by sendfile > > 0 calls to protocol drain routines > > > > Here's 60 seconds of netstat -ss for ip and tcp from the server with > 64k mount running ozone: > > ip: > > 4776 total packets received > > 4758 packets for this host > > 18 packets for unknown/unsupported protocol > > 2238 packets sent from this host > > tcp: > > 2244 packets sent > > 1427 data packets (238332 bytes) > > 5 data packets (820 bytes) retransmitted > > 812 ack-only packets (587 delayed) > > 2235 packets received > > 1428 acks (for 238368 bytes) > > 2007 packets (91952792 bytes) received in-sequence > > 225 out-of-order packets (325800 bytes) > > 1428 segments updated rtt (of 1426 attempts) > > 5 retransmit timeouts > > 587 correct data packet header predictions > > 225 SACK options (SACK blocks) sent > > > And with 32k mount: > > ip: > > 24172 total packets received > > 24167 packets for this host > > 5 packets for unknown/unsupported protocol > > 26130 packets sent from this host > > tcp: > > 26130 packets sent > > 23506 data packets (5362120 bytes) > > 2624 ack-only packets (454 delayed) > > 21671 packets received > > 18143 acks (for 5362192 bytes) > > 20278 packets (756617316 bytes) received in-sequence > > 96 out-of-order packets (145964 bytes) > > 18143 segments updated rtt (of 17469 attempts) > > 1093 correct ACK header predictions > > 3449 correct data packet header predictions > > 111 SACK options (SACK blocks) sent > > > So the 32k mount sends about 6x the packet volume. (This is on > iozone's linear write test.) > > One thing I've noticed is that when the 64k connection bogs down, it > seems to "poison" things for awhile. For example, iperf will start > doing this afterward: > > From the client to the server: > > $ iperf -c 172.20.20.162 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 14337 connected with 172.20.20.162 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.1 sec 4.88 MBytes 4.05 Mbits/sec > > > Ouch! That's quite a drop from 13Gbit/sec. Weirdly, iperf to the > debian node not affected: > > From the client to the debian node: > > $ iperf -c 172.20.20.166 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.166, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 24376 connected with 172.20.20.166 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 20.4 GBytes 17.5 Gbits/sec > > > From the debian node to the server: > > $ iperf -c 172.20.20.162 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 23.5 KByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.166 port 43166 connected with 172.20.20.162 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-10.0 sec 12.9 GBytes 11.1 Gbits/sec > > > But if I let it run for longer, it will apprently figure things out > and creep back up to normal speed and stay there until NFS strikes > again. It's like the kernel is caching some sort of hint that > connectivity to that other host sucks, and it has to either expire or > be slowly overcome. > > Client: > > $ iperf -c 172.20.20.162 -t 60 > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 3] local 172.20.20.169 port 59367 connected with 172.20.20.162 > port 5001 > > [ ID] Interval Transfer Bandwidth > > [ 3] 0.0-60.0 sec 56.2 GBytes 8.04 Gbits/sec > > > Server: > > $ netstat -I vtnet1 -ihw 1 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 7 0 0 420 0 0 0 0 > > 7 0 0 420 0 0 0 0 > > 8 0 0 480 0 0 0 0 > > 8 0 0 480 0 0 0 0 > > 7 0 0 420 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 11 0 0 12k 3 0 206 0 > <--- starts here > > 17 0 0 227k 10 0 660 0 > > 17 0 0 408k 10 0 660 0 > > 17 0 0 417k 10 0 660 0 > > 17 0 0 425k 10 0 660 0 > > 17 0 0 438k 10 0 660 0 > > 17 0 0 444k 10 0 660 0 > > 16 0 0 453k 10 0 660 0 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 16 0 0 463k 10 0 660 0 > > 16 0 0 469k 10 0 660 0 > > 16 0 0 482k 10 0 660 0 > > 16 0 0 487k 10 0 660 0 > > 16 0 0 496k 10 0 660 0 > > 16 0 0 504k 10 0 660 0 > > 18 0 0 510k 10 0 660 0 > > 16 0 0 521k 10 0 660 0 > > 17 0 0 524k 10 0 660 0 > > 17 0 0 538k 10 0 660 0 > > 17 0 0 540k 10 0 660 0 > > 17 0 0 552k 10 0 660 0 > > 17 0 0 554k 10 0 660 0 > > 17 0 0 567k 10 0 660 0 > > 16 0 0 568k 10 0 660 0 > > 16 0 0 581k 10 0 660 0 > > 16 0 0 582k 10 0 660 0 > > 16 0 0 595k 10 0 660 0 > > 16 0 0 595k 10 0 660 0 > > 16 0 0 609k 10 0 660 0 > > 16 0 0 609k 10 0 660 0 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 16 0 0 620k 10 0 660 0 > > 16 0 0 623k 10 0 660 0 > > 17 0 0 632k 10 0 660 0 > > 17 0 0 637k 10 0 660 0 > > 8.7k 0 0 389M 4.4k 0 288k 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 41k 0 0 2.1G 20k 0 1.4M 0 > > 38k 0 0 1.9G 19k 0 1.2M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 40k 0 0 2G 20k 0 1.3M 0 > > 39k 0 0 2G 20k 0 1.3M 0 > > 43k 0 0 2.2G 22k 0 1.4M 0 > > 42k 0 0 2.2G 21k 0 1.4M 0 > > 39k 0 0 2G 19k 0 1.3M 0 > > 38k 0 0 1.9G 19k 0 1.2M 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 44k 0 0 2.2G 22k 0 1.4M 0 > > 41k 0 0 2.1G 20k 0 1.3M 0 > > 41k 0 0 2.1G 21k 0 1.4M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > input (vtnet1) output > > packets errs idrops bytes packets errs bytes colls > > 43k 0 0 2.2G 22k 0 1.4M 0 > > 41k 0 0 2.1G 20k 0 1.3M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 42k 0 0 2.2G 21k 0 1.4M 0 > > 39k 0 0 2G 19k 0 1.3M 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 40k 0 0 2.0G 20k 0 1.3M 0 > > 42k 0 0 2.1G 21k 0 1.4M 0 > > 38k 0 0 2G 19k 0 1.3M 0 > > 39k 0 0 2G 20k 0 1.3M 0 > > 45k 0 0 2.3G 23k 0 1.5M 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > > It almost looks like something is limiting it to 10 packets per > second. So confusing! TCP super slow start? > > Thanks! > > (Sorry Rick, forgot to reply all so you got an extra! :( ) > > Also, here's the netstat from the client side showing the 10 packets > per second limit and eventual recovery: > > $ netstat -I net1 -ihw 1 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 15 0 0 962 11 0 114k 0 > > 17 0 0 1.1k 10 0 368k 0 > > 17 0 0 1.1k 10 0 411k 0 > > 17 0 0 1.1k 10 0 425k 0 > > 17 0 0 1.1k 10 0 432k 0 > > 17 0 0 1.1k 10 0 439k 0 > > 17 0 0 1.1k 10 0 452k 0 > > 16 0 0 1k 10 0 457k 0 > > 16 0 0 1k 10 0 467k 0 > > 16 0 0 1k 10 0 477k 0 > > 16 0 0 1k 10 0 481k 0 > > 16 0 0 1k 10 0 495k 0 > > 16 0 0 1k 10 0 498k 0 > > 16 0 0 1k 10 0 510k 0 > > 16 0 0 1k 10 0 515k 0 > > 16 0 0 1k 10 0 524k 0 > > 17 0 0 1.1k 10 0 532k 0 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 17 0 0 1.1k 10 0 538k 0 > > 17 0 0 1.1k 10 0 548k 0 > > 17 0 0 1.1k 10 0 552k 0 > > 17 0 0 1.1k 10 0 562k 0 > > 17 0 0 1.1k 10 0 566k 0 > > 16 0 0 1k 10 0 576k 0 > > 16 0 0 1k 10 0 580k 0 > > 16 0 0 1k 10 0 590k 0 > > 17 0 0 1.1k 10 0 594k 0 > > 16 0 0 1k 10 0 603k 0 > > 16 0 0 1k 10 0 609k 0 > > 16 0 0 1k 10 0 614k 0 > > 16 0 0 1k 10 0 623k 0 > > 16 0 0 1k 10 0 626k 0 > > 17 0 0 1.1k 10 0 637k 0 > > 18 0 0 1.1k 10 0 637k 0 > > 17k 0 0 1.1M 34k 0 1.7G 0 > > 21k 0 0 1.4M 42k 0 2.1G 0 > > 20k 0 0 1.3M 39k 0 2G 0 > > 19k 0 0 1.2M 38k 0 1.9G 0 > > 20k 0 0 1.3M 41k 0 2.0G 0 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 19k 0 0 1.2M 38k 0 1.9G 0 > > 22k 0 0 1.5M 45k 0 2.3G 0 > > 20k 0 0 1.3M 40k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.1G 0 > > 18k 0 0 1.2M 36k 0 1.9G 0 > > 21k 0 0 1.4M 41k 0 2.1G 0 > > 22k 0 0 1.4M 44k 0 2.2G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 20k 0 0 1.3M 41k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 21k 0 0 1.4M 43k 0 2.2G 0 > > 19k 0 0 1.2M 38k 0 1.9G 0 > > 21k 0 0 1.4M 42k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 21k 0 0 1.4M 42k 0 2.1G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > 20k 0 0 1.3M 40k 0 2.0G 0 > > input (net1) output > > packets errs idrops bytes packets errs bytes colls > > 24k 0 0 1.6M 48k 0 2.5G 0 > > 6.3k 0 0 417k 12k 0 647M 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > > 6 0 0 360 0 0 0 0 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Tue Jan 21 07:19:31 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2A1468C8 for ; Tue, 21 Jan 2014 07:19:31 +0000 (UTC) Received: from mail-we0-x234.google.com (mail-we0-x234.google.com [IPv6:2a00:1450:400c:c03::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BDAAE1A4B for ; Tue, 21 Jan 2014 07:19:30 +0000 (UTC) Received: by mail-we0-f180.google.com with SMTP id q59so7653709wes.39 for ; Mon, 20 Jan 2014 23:19:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=JUh5VKmi6mL3lXcmi5hati2UoUd/PDqQaocso/Vy1nM=; b=psNbmZUpRpltuDNthqqLi/mj1zWwXeQPLSpgQbm1UrhmkQLclWJaMNppan5yOwrVFT gYiL6QK7hOfaPSL78JuXbqpLZAy7ym5h8qAkMUSU9t2iE8hYynvm0hi+djIbjPDCDVzz Pvohz2Q8iKFT4jXgsmuu+KxQOiAkL7Z2LscwWVW7fMoexX9nIOQYBFvdktil+voLqK5+ 9NJYzwdAZ158/oeT6Jr/VrD6xCZwZC08OG3N9bwO2rfjXf8A3rdQAIlVys6eher820xD v62LTCUFVtlkum5JawJIo0jPqSBJepY/8wdP0nBB2pmSpubKl53BOP+j3KOQAotxDCTO pdaA== MIME-Version: 1.0 X-Received: by 10.194.104.39 with SMTP id gb7mr60543wjb.69.1390288768804; Mon, 20 Jan 2014 23:19:28 -0800 (PST) Received: by 10.194.29.163 with HTTP; Mon, 20 Jan 2014 23:19:28 -0800 (PST) Date: Tue, 21 Jan 2014 07:19:28 +0000 Message-ID: Subject: Netmap in FreeBSD 10 From: "C. L. Martinez" To: freebsd-net@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jan 2014 07:19:31 -0000 Hi all, Is netmap enabled by default under FreeBSD 10 or do I need to recompile GENERIC kernel using "device netmap" option?? Thanks. From owner-freebsd-net@FreeBSD.ORG Tue Jan 21 13:34:03 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 83E57F11; Tue, 21 Jan 2014 13:34:03 +0000 (UTC) Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0316B1DC0; Tue, 21 Jan 2014 13:34:02 +0000 (UTC) Received: from cell.glebius.int.ru (localhost [127.0.0.1]) by cell.glebius.int.ru (8.14.7/8.14.7) with ESMTP id s0LDXxVL068199 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 21 Jan 2014 17:33:59 +0400 (MSK) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebius.int.ru (8.14.7/8.14.7/Submit) id s0LDXxVP068198; Tue, 21 Jan 2014 17:33:59 +0400 (MSK) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to glebius@FreeBSD.org using -f Date: Tue, 21 Jan 2014 17:33:59 +0400 From: Gleb Smirnoff To: Olivier =?iso-8859-1?Q?Cochard-Labb=E9?= Subject: Re: Regression on 10-RC5 with a multicast routing daemon Message-ID: <20140121133359.GB66160@glebius.int.ru> References: <20140115113430.GK26504@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="yEPQxsgoJgBvi8ip" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.22 (2013-10-16) Cc: "freebsd-net@freebsd.org" , "freebsd-current@freebsd.org" , Andre Oppermann X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jan 2014 13:34:03 -0000 --yEPQxsgoJgBvi8ip Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit On Sun, Jan 19, 2014 at 02:42:32AM +0100, Olivier Cochard-Labbé wrote: O> > Olivier, O> > O> > O> > TL;DR version: you need not subtract iphdrlen in 10.0. Code in O> > igmp.c:accept_igmp() O> > should be smth like: O> > O> > iphdrlen = ip->ip_hl << 2; O> > #ifdef RAW_INPUT_IS_RAW /* Linux */ O> > ipdatalen = ntohs(ip->ip_len) - iphdrlen; O> > #else O> > #if __FreeBSD_version >= 1000000 O> > ipdatalen = ip->ip_len - iphdrlen; O> > #else O> > ipdatalen = ip->ip_len; O> > #endif O> > #endif O> > O> > O> With this patch I've no more the message "warning - Received packet from O> x.x.x.x shorter (28 bytes) than hdr+data length (20+28)":Thanks! O> But there is still a regression regarding the PIM socket behavior not O> related to the packet format. O> The pim.c include 2 functions (pim_read and pim_accept) that are called O> when the socket received a packet: There functions are never triggered when O> PIM packets are received on 10.0. O> In the same time igmp_read() and igmp_accept() are correctly triggered on O> 9.2 and 10.0. O> tcpdump in non-promiscious mode correctly see input of PIM packet: This O> should confirm that once this daemon is started, it correctly open a PIM O> socket and the multicast filter is updated. Can you please try this patch to kernel? If it doesn't work, can you please gather ktr(4) information with KTR_IPMF compiled into kernel. -- Totus tuus, Glebius. --yEPQxsgoJgBvi8ip Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="ip_mroute.c.diff" Index: sys/netinet/ip_mroute.c =================================================================== --- sys/netinet/ip_mroute.c (revision 260904) +++ sys/netinet/ip_mroute.c (working copy) @@ -2557,14 +2557,13 @@ pim_encapcheck(const struct mbuf *m, int off, int * is passed to if_simloop(). */ void -pim_input(struct mbuf *m, int off) +pim_input(struct mbuf *m, int iphlen) { struct ip *ip = mtod(m, struct ip *); struct pim *pim; int minlen; - int datalen = ntohs(ip->ip_len); + int datalen = ntohs(ip->ip_len) - iphlen; int ip_tos; - int iphlen = off; /* Keep statistics */ PIMSTAT_INC(pims_rcv_total_msgs); @@ -2594,8 +2593,7 @@ void * Get the IP and PIM headers in contiguous memory, and * possibly the PIM REGISTER header. */ - if ((m->m_flags & M_EXT || m->m_len < minlen) && - (m = m_pullup(m, minlen)) == 0) { + if (m->m_len < minlen && (m = m_pullup(m, minlen)) == 0) { CTR1(KTR_IPMF, "%s: m_pullup() failed", __func__); return; } --yEPQxsgoJgBvi8ip-- From owner-freebsd-net@FreeBSD.ORG Tue Jan 21 17:26:40 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3E7D8935; Tue, 21 Jan 2014 17:26:39 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id B18B813DE; Tue, 21 Jan 2014 17:26:39 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 97835B94C; Tue, 21 Jan 2014 12:26:38 -0500 (EST) From: John Baldwin To: freebsd-current@freebsd.org Subject: Re: Regression on 10-RC5 with a multicast routing daemon Date: Tue, 21 Jan 2014 11:48:53 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20130906; KDE/4.5.5; amd64; ; ) References: <20140115113430.GK26504@FreeBSD.org> In-Reply-To: <20140115113430.GK26504@FreeBSD.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201401211148.53400.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 21 Jan 2014 12:26:38 -0500 (EST) Cc: Olivier =?iso-8859-1?q?Cochard-Labb=E9?= , "freebsd-net@freebsd.org" , andre@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jan 2014 17:26:40 -0000 On Wednesday, January 15, 2014 6:34:30 am Gleb Smirnoff wrote: > Damn, what a mess. I'd like to go towards absolutely unmodified packets > for the 11-release cycle. I agree. -- John Baldwin From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 08:31:35 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id F2918E43; Wed, 22 Jan 2014 08:31:34 +0000 (UTC) Received: from mail-ve0-x22b.google.com (mail-ve0-x22b.google.com [IPv6:2607:f8b0:400c:c01::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 72C841E61; Wed, 22 Jan 2014 08:31:34 +0000 (UTC) Received: by mail-ve0-f171.google.com with SMTP id pa12so40993veb.16 for ; Wed, 22 Jan 2014 00:31:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=hF981SH8BOmn81Wa6JBUy//NpZ/kX1amiX/HoLf1WdY=; b=ZFIZYKdq8FpFP95W1kAbZOGs+LWRnrT4ZDk1pNUM2QSOSBtbcfO3uwXM1VUvhZaGss 3Mt7ICdMkqfm0VP22oZSq3cGST11jV6g7VFFi8n380c0dk416adHjOz6+HD9PteCy+Pc rs1N9DKvn2Ysy42vy95gyKsXqL87fpD7daY+AptAMmAvXKn5eoUT5gVyxdEUV64qlPmm D+dkAZmj6INRKE4b6QCGtPquyd3ffuZ4VSuhext88Ff1z32Sn6Mjr7fYdF1la5u6X9Yi 8zXLNDqdq2hEWHi5/hFsKuu8+GO2MvR88Kfajg+cjKXfBxYPvlc3RoXUFtjmuopOFLkj UEEA== X-Received: by 10.58.90.202 with SMTP id by10mr109539veb.6.1390379493530; Wed, 22 Jan 2014 00:31:33 -0800 (PST) MIME-Version: 1.0 Sender: cochard@gmail.com Received: by 10.58.171.1 with HTTP; Wed, 22 Jan 2014 00:31:13 -0800 (PST) In-Reply-To: <20140121133359.GB66160@glebius.int.ru> References: <20140115113430.GK26504@FreeBSD.org> <20140121133359.GB66160@glebius.int.ru> From: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= Date: Wed, 22 Jan 2014 09:31:13 +0100 X-Google-Sender-Auth: rR4OZUMGUahuzXByUyzVbNZoN-E Message-ID: Subject: Re: Regression on 10-RC5 with a multicast routing daemon To: Gleb Smirnoff Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: "freebsd-net@freebsd.org" , "freebsd-current@freebsd.org" , Andre Oppermann X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 08:31:35 -0000 On Tue, Jan 21, 2014 at 2:33 PM, Gleb Smirnoff wrote: > On Sun, Jan 19, 2014 at 02:42:32AM +0100, Olivier Cochard-Labb=E9 wrote: > O> But there is still a regression regarding the PIM socket behavior not > O> related to the packet format. > > Can you please try this patch to kernel? > > Lot's better with your patch: This fix the problem with PIM messages. Thanks ! From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 17:03:40 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 7453CF6 for ; Wed, 22 Jan 2014 17:03:40 +0000 (UTC) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3C32E1D56 for ; Wed, 22 Jan 2014 17:03:39 +0000 (UTC) Received: by mail-ob0-f182.google.com with SMTP id wm4so737377obc.27 for ; Wed, 22 Jan 2014 09:03:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=2/6WHfRmk+uKNGQxoN6p3CvDLKyceBvvGeHu+THnDBE=; b=Trx2EfdO9sIrVtbPsTYzoGxb1SOfWCX9llQZxxTEs9OQldzkxmOayEhhSRBZn3cFwL WZ+DulPOhADEw9Gc+CRRlVpg45IQaloElY3TkQC4IOODI4vHOWQMzIF1Z8hbcGIuPNA+ eRbTyTAL6Wz234eF4hBPoV5bs4YXwQVVFkWIODwlXw6xIecv46I9/wD3vXZTz271pKYF auFV1lssaWlK+vSeTA16Mm7aKhhfb+rTI1k3xvpmcw4YR6nTuD6rZZGo6cqlHo3mC6iU D7eVC2gqx974cHlF1s7Isk56LZHgpIY0/okGzZGqdBvgKNVqPF4tUb32JrkWwJHTO1UX 0eHw== X-Gm-Message-State: ALoCoQmCAIWVMONK0b9x9m9N97k5XbTQMmragtzrGXAHbJUJnTJwwyUdoA0DB2G/sAdxHKmO4rpR MIME-Version: 1.0 X-Received: by 10.182.29.33 with SMTP id g1mr2166098obh.59.1390410219103; Wed, 22 Jan 2014 09:03:39 -0800 (PST) Received: by 10.76.197.9 with HTTP; Wed, 22 Jan 2014 09:03:38 -0800 (PST) In-Reply-To: References: Date: Wed, 22 Jan 2014 12:03:38 -0500 Message-ID: Subject: Fwd: nmap not moving on after getting reset packets From: Daniel Malament To: freebsd-net@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 17:03:40 -0000 In the course of trying to do some comprehensive scans at work, I discovered the following behavior: nmap on SOURCE: # nmap -Pn -sS -v -n --scan-delay 20ms -p 1-65535 TARGET Starting Nmap 6.25 ( http://nmap.org ) at 2014-01-21 15:18 EST Initiating SYN Stealth Scan at 15:18 Scanning TARGET [65535 ports] Discovered open port 443/tcp on TARGET Discovered open port 80/tcp on TARGET Increasing send delay for TARGET from 20 to 40 due to max_successful_tryno increase to 4 Increasing send delay for TARGET from 40 to 80 due to 11 out of 16 dropped probes since last increase. Increasing send delay for TARGET from 80 to 160 due to max_successful_tryno increase to 5 Increasing send delay for TARGET from 160 to 320 due to 11 out of 29 dropped probes since last increase. SYN Stealth Scan Timing: About 2.67% done; ETC: 15:37 (0:18:51 remaining) [ctrl-c] tcpdump on SOURCE: 13:28:36.188904 IP SOURCE.59292 > TARGET.46181: Flags [S], seq 936512329, win 1024, options [mss 1460], length 0 13:28:36.209829 IP TARGET.46181 > SOURCE.59292: Flags [R.], seq 0, ack 936512330, win 1024, length 0 13:28:36.349905 IP SOURCE.59293 > TARGET.46181: Flags [S], seq 936577864, win 1024, options [mss 1460], length 0 13:28:36.370895 IP TARGET.46181 > SOURCE.59293: Flags [R.], seq 0, ack 936577865, win 1024, length 0 13:28:36.511905 IP SOURCE.59294 > TARGET.46181: Flags [S], seq 936381259, win 1024, options [mss 1460], length 0 13:28:36.537232 IP TARGET.46181 > SOURCE.59294: Flags [R.], seq 0, ack 936381260, win 1024, length 0 13:28:36.673905 IP SOURCE.59295 > TARGET.46181: Flags [S], seq 936446794, win 1024, options [mss 1460], length 0 13:28:36.694258 IP TARGET.46181 > SOURCE.59295: Flags [R.], seq 0, ack 936446795, win 1024, length 0 I'm checking on the nmap lists to see if this is expected behavior, but is it possible that something in the network stack is eating these packets between tcpdump and nmap? This is Nmap 6.25 on FreeBSD 9.2. PS: Adding --max-rtt-timeout 600ms --max-scan-delay 600ms made no difference. From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 18:43:26 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C9FA67DF for ; Wed, 22 Jan 2014 18:43:26 +0000 (UTC) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 8ABC21673 for ; Wed, 22 Jan 2014 18:43:26 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 5C09E7300A; Wed, 22 Jan 2014 19:45:52 +0100 (CET) Date: Wed, 22 Jan 2014 19:45:52 +0100 From: Luigi Rizzo To: "C. L. Martinez" Subject: Re: Netmap in FreeBSD 10 Message-ID: <20140122184552.GB98322@onelab2.iet.unipi.it> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 18:43:26 -0000 On Tue, Jan 21, 2014 at 07:19:28AM +0000, C. L. Martinez wrote: > Hi all, > > Is netmap enabled by default under FreeBSD 10 or do I need to > recompile GENERIC kernel using "device netmap" option?? you need to recompile the kernel (actually just the netmap module and device driver modules if you do not have them compiled in). I also suggest to update the netmap code to the one in head, which has more features and bugfixes. cheers luigi From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 19:58:57 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 65049F57 for ; Wed, 22 Jan 2014 19:58:57 +0000 (UTC) Received: from system.jails.se (system.jails.se [IPv6:2001:16d8:cc1e:1::1]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 1E6BB1D8A for ; Wed, 22 Jan 2014 19:58:56 +0000 (UTC) Received: from localhost (system.jails.se [91.205.63.85]) by system.jails.se (Postfix) with SMTP id 8F35C332FFA for ; Wed, 22 Jan 2014 20:58:49 +0100 (CET) Received: from klein.pean.org (klein.pean.org [IPv6:2001:16d8:ff9f::60]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by system.jails.se (Postfix) with ESMTPSA id 1AC2D332FF7; Wed, 22 Jan 2014 20:58:45 +0100 (CET) From: =?windows-1252?Q?Peter_Ankerst=E5l?= Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Support for NX3031. Date: Wed, 22 Jan 2014 20:58:44 +0100 Message-Id: <14AA735D-B9F7-4F3A-B711-DBBB5BBC54C4@pean.org> To: FreeBSD Net Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) X-Mailer: Apple Mail (2.1827) Cc: david.somayajulu@qlogic.com X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 19:58:57 -0000 Hi! I=92ve noticed a few logic drivers appeared in FreeBSD 10, but none of = them supporting my card which is a quad 1G card: none5@pci0:1:0:2: class=3D0x020000 card=3D0x1740103c = chip=3D0x01004040 rev=3D0x42 hdr=3D0x00 vendor =3D 'NetXen Incorporated' device =3D 'NX3031 Multifunction 1/10-Gigabit Server Adapter' class =3D network subclass =3D ethernet Is there any plans for such a thing i a near future? Thanks! From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 20:19:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 302554ED for ; Wed, 22 Jan 2014 20:19:09 +0000 (UTC) Received: from mx0a-0016ce01.pphosted.com (mx0a-0016ce01.pphosted.com [67.231.148.157]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0F9BE1F24 for ; Wed, 22 Jan 2014 20:19:08 +0000 (UTC) Received: from pps.filterd (m0045602.ppops.net [127.0.0.1]) by mx0a-0016ce01.pphosted.com (8.14.5/8.14.5) with SMTP id s0MK9WNe010782; Wed, 22 Jan 2014 12:09:47 -0800 Received: from avcashub1.qlogic.com (avcashub3.qlogic.com [198.70.193.117]) by mx0a-0016ce01.pphosted.com with ESMTP id 1hfp0nqcy5-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Wed, 22 Jan 2014 12:09:47 -0800 Received: from AVMB1.qlogic.org ([fe80::c919:8cc:f3ba:c727]) by avcashub3.qlogic.org ([::1]) with mapi id 14.02.0318.001; Wed, 22 Jan 2014 12:09:46 -0800 From: David Somayajulu To: =?iso-8859-1?Q?Peter_Ankerst=E5l?= , FreeBSD Net Subject: RE: Support for NX3031. Thread-Topic: Support for NX3031. Thread-Index: AQHPF6xnWEH7EW5Gm0eH6mZqyHlEaJqRKYtw Date: Wed, 22 Jan 2014 20:09:45 +0000 Message-ID: <49F5640B08EAA94DAF2F6B6145E6A08A8990B2BD@AVMB1.qlogic.org> References: <14AA735D-B9F7-4F3A-B711-DBBB5BBC54C4@pean.org> In-Reply-To: <14AA735D-B9F7-4F3A-B711-DBBB5BBC54C4@pean.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.1.4.10] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5600 definitions=7326 signatures=668897 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1305240000 definitions=main-1401220146 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 20:19:09 -0000 Hi Peter, We do not have any plans to support " NX3031" Adapter. This product is old = and has reached it End Of Life. Cheers David S. -----Original Message----- From: Peter Ankerst=E5l [mailto:peter@pean.org] Sent: Wednesday, January 22, 2014 11:59 AM To: FreeBSD Net Cc: David Somayajulu Subject: Support for NX3031. Hi! I've noticed a few logic drivers appeared in FreeBSD 10, but none of them s= upporting my card which is a quad 1G card: none5@pci0:1:0:2: class=3D0x020000 card=3D0x1740103c chip=3D0x0100404= 0 rev=3D0x42 hdr=3D0x00 vendor =3D 'NetXen Incorporated' device =3D 'NX3031 Multifunction 1/10-Gigabit Server Adapter' class =3D network subclass =3D ethernet Is there any plans for such a thing i a near future? Thanks! ________________________________ This message and any attached documents contain information from QLogic Cor= poration or its wholly-owned subsidiaries that may be confidential. If you = are not the intended recipient, you may not read, copy, distribute, or use = this information. If you have received this transmission in error, please n= otify the sender immediately by reply e-mail and then delete this message. From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 20:23:19 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 615E65E3 for ; Wed, 22 Jan 2014 20:23:19 +0000 (UTC) Received: from system.jails.se (system.jails.se [91.205.63.85]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DCAB31FA3 for ; Wed, 22 Jan 2014 20:23:16 +0000 (UTC) Received: from localhost (system.jails.se [91.205.63.85]) by system.jails.se (Postfix) with SMTP id DCDA5331199 for ; Wed, 22 Jan 2014 21:23:03 +0100 (CET) Received: from [172.25.0.153] (h148n9-u-a31.ias.bredband.telia.com [213.67.100.148]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by system.jails.se (Postfix) with ESMTPSA id 53B5D331198; Wed, 22 Jan 2014 21:23:03 +0100 (CET) Content-Type: multipart/signed; boundary=Apple-Mail-706A46A1-54F3-42A0-8782-EDA0AECC9E47; protocol="application/pkcs7-signature"; micalg=sha1 Content-Transfer-Encoding: 7bit From: =?utf-8?Q?Peter_Ankerst=C3=A5l?= Mime-Version: 1.0 (1.0) Subject: Re: Support for NX3031. Date: Wed, 22 Jan 2014 21:11:53 +0100 Message-Id: <3C1183B5-8C0C-40CC-9371-DF75DEA9D609@pean.org> References: <14AA735D-B9F7-4F3A-B711-DBBB5BBC54C4@pean.org> <49F5640B08EAA94DAF2F6B6145E6A08A8990B2BD@AVMB1.qlogic.org> In-Reply-To: <49F5640B08EAA94DAF2F6B6145E6A08A8990B2BD@AVMB1.qlogic.org> To: David Somayajulu X-Mailer: iPhone Mail (11B554a) Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 20:23:19 -0000 --Apple-Mail-706A46A1-54F3-42A0-8782-EDA0AECC9E47 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Ah, I understand. Thanks for the quick reply! H=C3=A4lsningar / regards Peter Ankerst=C3=A5l > On 22 jan 2014, at 21:09, David Somayajulu w= rote: >=20 > Hi Peter, > We do not have any plans to support " NX3031" Adapter. This product is old= and has reached it End Of Life. > Cheers > David S. >=20 > -----Original Message----- > From: Peter Ankerst=C3=A5l [mailto:peter@pean.org] > Sent: Wednesday, January 22, 2014 11:59 AM > To: FreeBSD Net > Cc: David Somayajulu > Subject: Support for NX3031. >=20 > Hi! >=20 > I've noticed a few logic drivers appeared in FreeBSD 10, but none of them s= upporting my card which is a quad 1G card: >=20 > none5@pci0:1:0:2: class=3D0x020000 card=3D0x1740103c chip=3D0x010040= 40 rev=3D0x42 hdr=3D0x00 > vendor =3D 'NetXen Incorporated' > device =3D 'NX3031 Multifunction 1/10-Gigabit Server Adapter' > class =3D network > subclass =3D ethernet >=20 > Is there any plans for such a thing i a near future? >=20 > Thanks! >=20 >=20 >=20 > ________________________________ >=20 > This message and any attached documents contain information from QLogic Co= rporation or its wholly-owned subsidiaries that may be confidential. If you a= re not the intended recipient, you may not read, copy, distribute, or use th= is information. If you have received this transmission in error, please noti= fy the sender immediately by reply e-mail and then delete this message. >=20 --Apple-Mail-706A46A1-54F3-42A0-8782-EDA0AECC9E47 Content-Type: application/pkcs7-signature; name=smime.p7s Content-Disposition: attachment; filename=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIGNzCCBjMw ggUboAMCAQICAwiyiDANBgkqhkiG9w0BAQUFADCBjDELMAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0 YXJ0Q29tIEx0ZC4xKzApBgNVBAsTIlNlY3VyZSBEaWdpdGFsIENlcnRpZmljYXRlIFNpZ25pbmcx ODA2BgNVBAMTL1N0YXJ0Q29tIENsYXNzIDEgUHJpbWFyeSBJbnRlcm1lZGlhdGUgQ2xpZW50IENB MB4XDTE0MDEyMDA3NTIzOFoXDTE1MDEyMTA4NTkyMVowUzEZMBcGA1UEDRMQMWlGRkxHbTV3RmVT WjZ6OTEXMBUGA1UEAwwOcGV0ZXJAcGVhbi5vcmcxHTAbBgkqhkiG9w0BCQEWDnBldGVyQHBlYW4u b3JnMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAzoKHiOE9vdQgax/GZyTaqtNvfjGI HwG1tsMOXZELs49KJY66oD//szW3yoIl8nQapUBn+hZqs3QT5PxqfElXxljYszYE6yk3kWR7EVtl IEfT7Pf24XlFw4uzoZzEjaxPJBt4+BWwb1MpqBmwTNZwZGYI9SO6JW23G9o+e+hPmlXFTovW9B36 J0M2Qu0+IE6MsDIG0y5CwuiXMqNz+vEBiIBvdef3CIidRn3/K7DQYBYn9gj/UNB1yf1GRhsNDO12 4T9+9bhlplov0srt7pqQjaSiiqVOCCWdpxvM/eF0LFBkEFATy45RKtl2vk9zM1wmI+sU29vodHoD Duf8t4bTtQIDAQABo4IC1DCCAtAwCQYDVR0TBAIwADALBgNVHQ8EBAMCBLAwHQYDVR0lBBYwFAYI KwYBBQUHAwIGCCsGAQUFBwMEMB0GA1UdDgQWBBSAhVDjVwheLV39/7XFsz9rQP0sVDAfBgNVHSME GDAWgBRTcu2SnODaywFcfH6WNU7y1LhRgjAZBgNVHREEEjAQgQ5wZXRlckBwZWFuLm9yZzCCAUwG A1UdIASCAUMwggE/MIIBOwYLKwYBBAGBtTcBAgMwggEqMC4GCCsGAQUFBwIBFiJodHRwOi8vd3d3 LnN0YXJ0c3NsLmNvbS9wb2xpY3kucGRmMIH3BggrBgEFBQcCAjCB6jAnFiBTdGFydENvbSBDZXJ0 aWZpY2F0aW9uIEF1dGhvcml0eTADAgEBGoG+VGhpcyBjZXJ0aWZpY2F0ZSB3YXMgaXNzdWVkIGFj Y29yZGluZyB0byB0aGUgQ2xhc3MgMSBWYWxpZGF0aW9uIHJlcXVpcmVtZW50cyBvZiB0aGUgU3Rh cnRDb20gQ0EgcG9saWN5LCByZWxpYW5jZSBvbmx5IGZvciB0aGUgaW50ZW5kZWQgcHVycG9zZSBp biBjb21wbGlhbmNlIG9mIHRoZSByZWx5aW5nIHBhcnR5IG9ibGlnYXRpb25zLjA2BgNVHR8ELzAt MCugKaAnhiVodHRwOi8vY3JsLnN0YXJ0c3NsLmNvbS9jcnR1MS1jcmwuY3JsMIGOBggrBgEFBQcB AQSBgTB/MDkGCCsGAQUFBzABhi1odHRwOi8vb2NzcC5zdGFydHNzbC5jb20vc3ViL2NsYXNzMS9j bGllbnQvY2EwQgYIKwYBBQUHMAKGNmh0dHA6Ly9haWEuc3RhcnRzc2wuY29tL2NlcnRzL3N1Yi5j bGFzczEuY2xpZW50LmNhLmNydDAjBgNVHRIEHDAahhhodHRwOi8vd3d3LnN0YXJ0c3NsLmNvbS8w DQYJKoZIhvcNAQEFBQADggEBAFiVjpZEkQoHYAtb0E6MVJgzo1K6d6eEjLsCNbaw833a0jws4Rh0 KG/MjqjJzUwa2G6mVZb/JaodRK8VENnpxJ8WhjWqyQL8/lKnGa88XYMtl+i4ICur08IfQLG7zNFn yG/kOAiMNkgF4H6lZx/ezup9fowUOt0hxERXMcqo4p+RzPShx35EGRv+5gZNQ7XW4s2rzFzt9CHa Dar8SyAGHK3oFapKpHsVSUYik0QCLwnGcaHEHNUkCp1YMsjKwvmxVtQQs/2WfsqQlult8UYe0bTr nwDyLbgJDbvp9R5mZDrkUcXYlgP+mAmzTOrT1JhHbyYQjbbxJAmqkAIDcwVyDRAxggNvMIIDawIB ATCBlDCBjDELMAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0YXJ0Q29tIEx0ZC4xKzApBgNVBAsTIlNl Y3VyZSBEaWdpdGFsIENlcnRpZmljYXRlIFNpZ25pbmcxODA2BgNVBAMTL1N0YXJ0Q29tIENsYXNz IDEgUHJpbWFyeSBJbnRlcm1lZGlhdGUgQ2xpZW50IENBAgMIsogwCQYFKw4DAhoFAKCCAa8wGAYJ KoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTQwMTIyMjAxMTU1WjAjBgkq hkiG9w0BCQQxFgQUvUTu1Gq0uaYorJiPgW6Bb2n/57YwgaUGCSsGAQQBgjcQBDGBlzCBlDCBjDEL MAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0YXJ0Q29tIEx0ZC4xKzApBgNVBAsTIlNlY3VyZSBEaWdp dGFsIENlcnRpZmljYXRlIFNpZ25pbmcxODA2BgNVBAMTL1N0YXJ0Q29tIENsYXNzIDEgUHJpbWFy eSBJbnRlcm1lZGlhdGUgQ2xpZW50IENBAgMIsogwgacGCyqGSIb3DQEJEAILMYGXoIGUMIGMMQsw CQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20gTHRkLjErMCkGA1UECxMiU2VjdXJlIERpZ2l0 YWwgQ2VydGlmaWNhdGUgU2lnbmluZzE4MDYGA1UEAxMvU3RhcnRDb20gQ2xhc3MgMSBQcmltYXJ5 IEludGVybWVkaWF0ZSBDbGllbnQgQ0ECAwiyiDANBgkqhkiG9w0BAQEFAASCAQAe7Ak7VMvcvrvO QZpTKTzjZjz4drGNS7+xBEikfEnXZhaDzMLyq8WkJ7Y4WAvzhHTxYhVAdCAUC2Tx0XaKqCd/zDlU 5qxY2fae6g3kZBTuxnGs9zcec8WbCZHCkxc+BHPxFwiqqkAzz0rPCBTnies1veoRDxGrvbCIXUSG iya589Nim2rMfE+V/eDjPCm84xgTVep8+zmYSbao5uvAK6WL6Mn7GqgdRSv9ripZVdGQ0wlH69F1 ysUIk2LR3OcaphcEdtTtJ5QGZRI9VTm3bTOQuwapJCx6PHg+B+cnbZXVXgkyQGQfLubbaA8dH726 ugGO0RQR7D8aXiTGBML5EjqEAAAAAAAA --Apple-Mail-706A46A1-54F3-42A0-8782-EDA0AECC9E47-- From owner-freebsd-net@FreeBSD.ORG Wed Jan 22 23:52:03 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0A83647D for ; Wed, 22 Jan 2014 23:52:03 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id B8B7411B5 for ; Wed, 22 Jan 2014 23:52:02 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0MNq0It088134 for ; Wed, 22 Jan 2014 18:52:00 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0MNq0AW088131; Wed, 22 Jan 2014 18:52:00 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21216.22944.314697.179039@hergotha.csail.mit.edu> Date: Wed, 22 Jan 2014 18:52:00 -0500 From: Garrett Wollman To: net@freebsd.org Subject: Use of contiguous physical memory in cxgbe driver X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 22 Jan 2014 18:52:00 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jan 2014 23:52:03 -0000 At this point everyone is well aware that requiring contiguous physical page when the hardware can do scatter-gather is a very bad idea. I have a new server under test (running 9.2) which uses Chelsio rather than Intel 10G Ethernet controllers. We fixed the Intel driver not to use more-than-page-sized jumbo mbufs. Can anyone say with certainty whether the Chelsio hardware actually requires physically-contiguous allocations for jumbo frames? Has this already been fixed in a more recent driver? (The hardware is identified specifically as a T420-CR.) -GAWollman From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 00:23:35 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 99945863; Thu, 23 Jan 2014 00:23:35 +0000 (UTC) Received: from mail-pa0-x235.google.com (mail-pa0-x235.google.com [IPv6:2607:f8b0:400e:c03::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 66DC214A8; Thu, 23 Jan 2014 00:23:35 +0000 (UTC) Received: by mail-pa0-f53.google.com with SMTP id lj1so1100173pab.40 for ; Wed, 22 Jan 2014 16:23:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=oAxcYp4l+1rDT6BRvRlJKuxMGaAmx71wGUOFtLONTmw=; b=iVlAwqEPiRIlS9byGZekisaDcSJJZhWPxCSWzua7FNLWeNdCo+odzbTZUrkxjRNXzd tGbpk9X0zHvF6yGZz4eepPpAEJBS9E8g6I3kjaV5ScR/XJGCDpk5M9MgujS60NxYths6 JcVdV8Z/LWf08R2eerEFrqEYBeTsiW1bxzzOclxHJoaXHDq3DGV9v3fg9jPz2l+RpneJ NVMSc4aI8t4MQSXeKLkA/OvCOqyT8QWX5qhBUAGZvn4ro4xjGv4O5QIjXkUVdXeP2efU 167m4hnHiHh3doNNjJ8xJff+JTDsD0wFC9qbEB89zjwh2GE2kkESYqW3qiisLRHKaHwL dtBA== X-Received: by 10.66.41.106 with SMTP id e10mr4723060pal.109.1390436614936; Wed, 22 Jan 2014 16:23:34 -0800 (PST) Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58]) by mx.google.com with ESMTPSA id gg10sm28632310pbc.46.2014.01.22.16.23.33 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 22 Jan 2014 16:23:33 -0800 (PST) Sender: Navdeep Parhar Message-ID: <52E06104.70600@FreeBSD.org> Date: Wed, 22 Jan 2014 16:23:32 -0800 From: Navdeep Parhar User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Garrett Wollman , net@freebsd.org Subject: Re: Use of contiguous physical memory in cxgbe driver References: <21216.22944.314697.179039@hergotha.csail.mit.edu> In-Reply-To: <21216.22944.314697.179039@hergotha.csail.mit.edu> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 00:23:35 -0000 On 01/22/14 15:52, Garrett Wollman wrote: > At this point everyone is well aware that requiring contiguous > physical page when the hardware can do scatter-gather is a very bad > idea. I wouldn't put it this way. Using buffers with size > PAGE_SIZE has its advantages. > I have a new server under test (running 9.2) which uses Chelsio > rather than Intel 10G Ethernet controllers. We fixed the Intel driver > not to use more-than-page-sized jumbo mbufs. Can anyone say with > certainty whether the Chelsio hardware actually requires > physically-contiguous allocations for jumbo frames? No it doesn't. It will do a scatter DMA if the next available rx buffer's size is less than the size of the incoming frame. > Has this already > been fixed in a more recent driver? (The hardware is identified > specifically as a T420-CR.) I'm not sure there's anything to fix here but I can add a knob that would limit the driver to a maximum of PAGE_SIZE sized rx buffers. Regards, Navdeep From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 00:48:30 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BEFE2818 for ; Thu, 23 Jan 2014 00:48:30 +0000 (UTC) Received: from mail-ie0-x22a.google.com (mail-ie0-x22a.google.com [IPv6:2607:f8b0:4001:c03::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8A92D1718 for ; Thu, 23 Jan 2014 00:48:30 +0000 (UTC) Received: by mail-ie0-f170.google.com with SMTP id u16so393580iet.15 for ; Wed, 22 Jan 2014 16:48:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ZNK2bJI1i4NDSF6YUVJbmCCRr00wB92L6Q3O17FwIdQ=; b=TxK+vS/VzZVKAB5vlCNAiwNPB06gaq5Z9XfpqQpGXihVLt8zUmFQI1AztvtW+zAqS3 H45i43pUqGh8QeVbKSkZp4SabdLD9xjNv6jbowCY+UPkeP09PtysgsZQZ/0zj7Wn9kJw rq0+uwSBKyAJWFj0+h7lVTzzI5iyj07LTkCv/6tZztrYqOqUT61fHkMuv4EXAW+KdIKq oi7Fb3dMzlO4sDCTNLWKwGKSFgJeH9nW/LiE5Jn2Gt49zknjIUxYthkF7p+cbybRcQPm VJL+zAQHbeo2L4qP7FhdeJTKzaiiUF1048yvBshLp2mOyWozDTfSeyx+zEYUjpD1N26d gpcQ== MIME-Version: 1.0 X-Received: by 10.50.154.102 with SMTP id vn6mr26289468igb.1.1390438109767; Wed, 22 Jan 2014 16:48:29 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Wed, 22 Jan 2014 16:48:29 -0800 (PST) In-Reply-To: <800819196.13362204.1390267642625.JavaMail.root@uoguelph.ca> References: <800819196.13362204.1390267642625.JavaMail.root@uoguelph.ca> Date: Wed, 22 Jan 2014 19:48:29 -0500 X-Google-Sender-Auth: m5047Z7u2vFARVXOa8ZqT05AMss Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 00:48:30 -0000 On Mon, Jan 20, 2014 at 8:27 PM, Rick Macklem wrote: >> random random >> write rewrite read reread read write >> >> S:FBSD,C:FBSD,Z:64k >> 67246 2923 103295 1272407 172475 196 >> >> S:FBSD,C:FBSD,Z:32k >> 11951 99896 223787 1051948 223276 13686 > Since I've never used the benchmark you're using, I'll admit I have > no idea what these numbers mean? (Are big values fast or slow or???) These numbers are kiB/sec read or written. Larger is better. > I'd be looking at a packet trace in wireshark and looking for TCP retransmits > and delays between packets. However, any significant improvement (like 50% or > more faster for a smaller I/O size) indicates that something is broken in the > network (in your case virtual) fabric, imho. If Debian were also affected in the same environment, I would tend to agree. As it is not, the problem seems to be pretty squarely on FreeBSD's shoulders, rather than the environment. > (You saw that the smaller I/O > size results in more RPCs, so that would suggest slower, not faster. It's not clear to me what this refers to. More RPCs would only indicate slower for the same throughput level, right? I.e. nominally you would expect to see twice as many RPCs for 32k as for 64k. But if 32k is transferring 200x faster than 64k, the net effect would be 100x more RPCs, right? The packet traces for the 64k size are pretty interesting. On the sequential write test, the client packets start out very small, and slowly scale up to just under 64k, then they abruptly fall to 1500 bytes and sit there for a good long while: 00:32:42.156107 IP (tos 0x0, ttl 64, id 5352, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.255894 IP (tos 0x0, ttl 64, id 46951, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0xa59c), seq 15072, ack 5909085, win 29127, options [nop,nop,TS val 3447942732 ecr 316758762], length 0 00:32:42.256078 IP (tos 0x0, ttl 64, id 5353, offset 0, flags [DF], proto TCP (6), length 5844) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.355958 IP (tos 0x0, ttl 64, id 46953, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0x8e34), seq 15072, ack 5914877, win 29127, options [nop,nop,TS val 3447942832 ecr 316758862], length 0 00:32:42.356152 IP (tos 0x0, ttl 64, id 5354, offset 0, flags [DF], proto TCP (6), length 8740) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.455891 IP (tos 0x0, ttl 64, id 46954, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0x6b7c), seq 15072, ack 5923565, win 29127, options [nop,nop,TS val 3447942932 ecr 316758962], length 0 00:32:42.456036 IP (tos 0x0, ttl 64, id 5355, offset 0, flags [DF], proto TCP (6), length 11636) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.456185 IP (tos 0x0, ttl 64, id 46955, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728916: reply ok 160 00:32:42.456370 IP (tos 0x0, ttl 64, id 5356, offset 0, flags [DF], proto TCP (6), length 14532) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.555949 IP (tos 0x0, ttl 64, id 46957, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0x0440), seq 15236, ack 5949629, win 29127, options [nop,nop,TS val 3447943032 ecr 316759062], length 0 00:32:42.556139 IP (tos 0x0, ttl 64, id 5357, offset 0, flags [DF], proto TCP (6), length 17428) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.655929 IP (tos 0x0, ttl 64, id 46958, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0xbf97), seq 15236, ack 5967005, win 29127, options [nop,nop,TS val 3447943132 ecr 316759162], length 0 00:32:42.656142 IP (tos 0x0, ttl 64, id 5358, offset 0, flags [DF], proto TCP (6), length 20324) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.755893 IP (tos 0x0, ttl 64, id 46959, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0x6f9f), seq 15236, ack 5987277, win 29127, options [nop,nop,TS val 3447943232 ecr 316759262], length 0 00:32:42.756085 IP (tos 0x0, ttl 64, id 5359, offset 0, flags [DF], proto TCP (6), length 23220) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.756198 IP (tos 0x0, ttl 64, id 46960, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728917: reply ok 160 00:32:42.756394 IP (tos 0x0, ttl 64, id 5360, offset 0, flags [DF], proto TCP (6), length 26116) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.855956 IP (tos 0x0, ttl 64, id 46961, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0xade2), seq 15400, ack 6036509, win 29127, options [nop,nop,TS val 3447943332 ecr 316759362], length 0 00:32:42.856162 IP (tos 0x0, ttl 64, id 5361, offset 0, flags [DF], proto TCP (6), length 29012) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.856310 IP (tos 0x0, ttl 64, id 46962, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728918: reply ok 160 00:32:42.856469 IP (tos 0x0, ttl 64, id 5362, offset 0, flags [DF], proto TCP (6), length 31908) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.955895 IP (tos 0x0, ttl 64, id 47036, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0xbee5), seq 15564, ack 6097325, win 29127, options [nop,nop,TS val 3447943432 ecr 316759462], length 0 00:32:42.956089 IP (tos 0x0, ttl 64, id 5363, offset 0, flags [DF], proto TCP (6), length 34804) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:42.956199 IP (tos 0x0, ttl 64, id 47037, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728919: reply ok 160 00:32:42.956404 IP (tos 0x0, ttl 64, id 5364, offset 0, flags [DF], proto TCP (6), length 37700) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.055906 IP (tos 0x0, ttl 64, id 47038, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0xa2a8), seq 15728, ack 6169725, win 29127, options [nop,nop,TS val 3447943532 ecr 316759562], length 0 00:32:43.056068 IP (tos 0x0, ttl 64, id 5365, offset 0, flags [DF], proto TCP (6), length 40596) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.056221 IP (tos 0x0, ttl 64, id 47040, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728921: reply ok 160 00:32:43.056400 IP (tos 0x0, ttl 64, id 5366, offset 0, flags [DF], proto TCP (6), length 43492) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.155935 IP (tos 0x0, ttl 64, id 47044, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0x592b), seq 15892, ack 6253709, win 29127, options [nop,nop,TS val 3447943632 ecr 316759662], length 0 00:32:43.156133 IP (tos 0x0, ttl 64, id 5368, offset 0, flags [DF], proto TCP (6), length 46388) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.156273 IP (tos 0x0, ttl 64, id 47045, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728933: reply ok 160 00:32:43.156481 IP (tos 0x0, ttl 64, id 5369, offset 0, flags [DF], proto TCP (6), length 49284) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.156632 IP (tos 0x0, ttl 64, id 47046, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728934: reply ok 160 00:32:43.156835 IP (tos 0x0, ttl 64, id 5370, offset 0, flags [DF], proto TCP (6), length 52180) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.156953 IP (tos 0x0, ttl 64, id 47047, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728937: reply ok 160 00:32:43.157144 IP (tos 0x0, ttl 64, id 5371, offset 0, flags [DF], proto TCP (6), length 55076) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.157285 IP (tos 0x0, ttl 64, id 47048, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728938: reply ok 160 00:32:43.157459 IP (tos 0x0, ttl 64, id 5372, offset 0, flags [DF], proto TCP (6), length 57972) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.255927 IP (tos 0x0, ttl 64, id 47049, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a (incorrect -> 0x5baf), seq 16548, ack 6514349, win 29127, options [nop,nop,TS val 3447943732 ecr 316759762], length 0 00:32:43.256132 IP (tos 0x0, ttl 64, id 5373, offset 0, flags [DF], proto TCP (6), length 60868) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.256251 IP (tos 0x0, ttl 64, id 47050, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728939: reply ok 160 00:32:43.485942 IP (tos 0x0, ttl 64, id 47056, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874728939: reply ok 160 00:32:43.486099 IP (tos 0x0, ttl 64, id 5376, offset 0, flags [DF], proto TCP (6), length 1500) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.486117 IP (tos 0x0, ttl 64, id 47057, offset 0, flags [DF], proto TCP (6), length 64) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 (incorrect -> 0xbaf5), seq 16712, ack 6575165, win 29127, options [nop,nop,TS val 3447943962 ecr 316759862,nop,nop,sack 1 {6638877:6640325}], length 0 00:32:43.486271 IP (tos 0x0, ttl 64, id 5377, offset 0, flags [DF], proto TCP (6), length 1500) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.486288 IP (tos 0x0, ttl 64, id 47058, offset 0, flags [DF], proto TCP (6), length 64) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 (incorrect -> 0xb54d), seq 16712, ack 6575165, win 29127, options [nop,nop,TS val 3447943962 ecr 316759862,nop,nop,sack 1 {6638877:6641773}], length 0 00:32:43.486440 IP (tos 0x0, ttl 64, id 5378, offset 0, flags [DF], proto TCP (6), length 1500) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.486470 IP (tos 0x0, ttl 64, id 47059, offset 0, flags [DF], proto TCP (6), length 64) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 (incorrect -> 0xaec2), seq 16712, ack 6576613, win 29124, options [nop,nop,TS val 3447943962 ecr 316760092,nop,nop,sack 1 {6638877:6641773}], length 0 00:32:43.486622 IP (tos 0x0, ttl 64, id 5379, offset 0, flags [DF], proto TCP (6), length 1500) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:32:43.486652 IP (tos 0x0, ttl 64, id 47060, offset 0, flags [DF], proto TCP (6), length 64) 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 (incorrect -> 0xa91a), seq 16712, ack 6578061, win 29124, options [nop,nop,TS val 3447943962 ecr 316760092,nop,nop,sack 1 {6638877:6641773}], length 0 After ahile, it tries again. Wash, rinse, repeat. With the 32k size, they scale up to between 50000 and 59000 bytes per packet and sit there for long periods of time, seeming to reset only after some other intervening RPC activity: 00:38:07.932518 IP (tos 0x0, ttl 64, id 38911, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.932530 IP (tos 0x0, ttl 64, id 22486, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686965: reply ok 160 00:38:07.932622 IP (tos 0x0, ttl 64, id 22487, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686941: reply ok 160 00:38:07.932732 IP (tos 0x0, ttl 64, id 38912, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.932747 IP (tos 0x0, ttl 64, id 22488, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686968: reply ok 160 00:38:07.932857 IP (tos 0x0, ttl 64, id 22489, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686967: reply ok 160 00:38:07.932963 IP (tos 0x0, ttl 64, id 38913, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.933060 IP (tos 0x0, ttl 64, id 22490, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686946: reply ok 160 00:38:07.933120 IP (tos 0x0, ttl 64, id 22491, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686971: reply ok 160 00:38:07.933176 IP (tos 0x0, ttl 64, id 38914, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.933270 IP (tos 0x0, ttl 64, id 22492, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686970: reply ok 160 00:38:07.933329 IP (tos 0x0, ttl 64, id 22493, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686972: reply ok 160 00:38:07.933386 IP (tos 0x0, ttl 64, id 38915, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.933483 IP (tos 0x0, ttl 64, id 22494, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686974: reply ok 160 00:38:07.933602 IP (tos 0x0, ttl 64, id 38916, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.933689 IP (tos 0x0, ttl 64, id 22495, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686973: reply ok 160 00:38:07.933754 IP (tos 0x0, ttl 64, id 22496, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686975: reply ok 160 00:38:07.933820 IP (tos 0x0, ttl 64, id 38917, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.933892 IP (tos 0x0, ttl 64, id 22497, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686976: reply ok 160 00:38:07.934017 IP (tos 0x0, ttl 64, id 38918, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.934112 IP (tos 0x0, ttl 64, id 22498, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686977: reply ok 160 00:38:07.934202 IP (tos 0x0, ttl 64, id 22499, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686978: reply ok 160 00:38:07.934234 IP (tos 0x0, ttl 64, id 38919, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.934312 IP (tos 0x0, ttl 64, id 22500, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686942: reply ok 160 00:38:07.934400 IP (tos 0x0, ttl 64, id 22501, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686979: reply ok 160 00:38:07.934426 IP (tos 0x0, ttl 64, id 38920, offset 0, flags [DF], proto TCP (6), length 53628) 172.20.20.169.0 > 172.20.20.162.2049: 0 null 00:38:07.934520 IP (tos 0x0, ttl 64, id 22502, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.169.1874686980: reply ok 160 It does seem weird that the client seems to double up the 216 replies all the time, but the throughput is still very good. With Debian as the client, it's so fast tcpdump can't keep up, but this is what it does see: 00:43:10.310980 IP (tos 0x0, ttl 64, id 30448, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2036438470 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824033280 00:43:10.311002 IP (tos 0x0, ttl 64, id 30450, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.311013 IP (tos 0x0, ttl 64, id 52178, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x376f), seq 38376, ack 988417, win 29118, options [nop,nop,TS val 3637794932 ecr 78651075], length 0 00:43:10.311266 IP (tos 0x0, ttl 64, id 52179, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2036438470: reply ok 160 write PRE: sz 824033280 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824037376 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.311356 IP (tos 0x0, ttl 64, id 30451, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2053215686 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824037376 00:43:10.311378 IP (tos 0x0, ttl 64, id 30453, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.311389 IP (tos 0x0, ttl 64, id 52180, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x264b), seq 38540, ack 992641, win 29118, options [nop,nop,TS val 3637794932 ecr 78651075], length 0 00:43:10.311648 IP (tos 0x0, ttl 64, id 52181, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2053215686: reply ok 160 write PRE: sz 824037376 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824041472 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.311734 IP (tos 0x0, ttl 64, id 30454, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2069992902 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824041472 00:43:10.311756 IP (tos 0x0, ttl 64, id 30456, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.311767 IP (tos 0x0, ttl 64, id 52182, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x1527), seq 38704, ack 996865, win 29118, options [nop,nop,TS val 3637794932 ecr 78651075], length 0 00:43:10.312021 IP (tos 0x0, ttl 64, id 52183, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2069992902: reply ok 160 write PRE: sz 824041472 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824045568 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.312107 IP (tos 0x0, ttl 64, id 30457, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2086770118 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824045568 00:43:10.312129 IP (tos 0x0, ttl 64, id 30459, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.312140 IP (tos 0x0, ttl 64, id 52184, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x0403), seq 38868, ack 1001089, win 29118, options [nop,nop,TS val 3637794932 ecr 78651075], length 0 00:43:10.312393 IP (tos 0x0, ttl 64, id 52185, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2086770118: reply ok 160 write PRE: sz 824045568 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824049664 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.312485 IP (tos 0x0, ttl 64, id 30460, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2103547334 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824049664 00:43:10.312507 IP (tos 0x0, ttl 64, id 30462, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.312517 IP (tos 0x0, ttl 64, id 52186, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0xf2de), seq 39032, ack 1005313, win 29118, options [nop,nop,TS val 3637794932 ecr 78651075], length 0 00:43:10.312775 IP (tos 0x0, ttl 64, id 52187, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2103547334: reply ok 160 write PRE: sz 824049664 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824053760 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.312861 IP (tos 0x0, ttl 64, id 30463, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2120324550 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824053760 00:43:10.312882 IP (tos 0x0, ttl 64, id 30465, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.312893 IP (tos 0x0, ttl 64, id 52188, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0xe1b3), seq 39196, ack 1009537, win 29124, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.313148 IP (tos 0x0, ttl 64, id 52189, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2120324550: reply ok 160 write PRE: sz 824053760 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824057856 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.313233 IP (tos 0x0, ttl 64, id 30466, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2137101766 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824057856 00:43:10.313259 IP (tos 0x0, ttl 64, id 30468, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.313280 IP (tos 0x0, ttl 64, id 52190, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0xd08f), seq 39360, ack 1013761, win 29124, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.313536 IP (tos 0x0, ttl 64, id 52191, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2137101766: reply ok 160 write PRE: sz 824057856 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824061952 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.313628 IP (tos 0x0, ttl 64, id 30469, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2153878982 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824061952 00:43:10.313650 IP (tos 0x0, ttl 64, id 30471, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.313661 IP (tos 0x0, ttl 64, id 52192, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0xbf71), seq 39524, ack 1017985, win 29118, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.313915 IP (tos 0x0, ttl 64, id 52193, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2153878982: reply ok 160 write PRE: sz 824061952 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824066048 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.314000 IP (tos 0x0, ttl 64, id 30472, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2170656198 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824066048 00:43:10.314022 IP (tos 0x0, ttl 64, id 30474, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.314032 IP (tos 0x0, ttl 64, id 52194, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0xae4d), seq 39688, ack 1022209, win 29118, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.314286 IP (tos 0x0, ttl 64, id 52195, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2170656198: reply ok 160 write PRE: sz 824066048 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824070144 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.314373 IP (tos 0x0, ttl 64, id 30475, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2187433414 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824070144 00:43:10.314395 IP (tos 0x0, ttl 64, id 30477, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.314406 IP (tos 0x0, ttl 64, id 52196, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x9d29), seq 39852, ack 1026433, win 29118, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.314665 IP (tos 0x0, ttl 64, id 52197, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2187433414: reply ok 160 write PRE: sz 824070144 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824074240 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.314751 IP (tos 0x0, ttl 64, id 30478, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2204210630 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824074240 00:43:10.314772 IP (tos 0x0, ttl 64, id 30480, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.314783 IP (tos 0x0, ttl 64, id 52198, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x8c05), seq 40016, ack 1030657, win 29118, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.315036 IP (tos 0x0, ttl 64, id 52199, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2204210630: reply ok 160 write PRE: sz 824074240 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824078336 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.315122 IP (tos 0x0, ttl 64, id 30481, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2220987846 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824078336 00:43:10.315144 IP (tos 0x0, ttl 64, id 30483, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.315154 IP (tos 0x0, ttl 64, id 52200, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x7ae1), seq 40180, ack 1034881, win 29118, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.315407 IP (tos 0x0, ttl 64, id 52201, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2220987846: reply ok 160 write PRE: sz 824078336 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824082432 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.315510 IP (tos 0x0, ttl 64, id 30484, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2237765062 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824082432 00:43:10.315532 IP (tos 0x0, ttl 64, id 30486, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.315543 IP (tos 0x0, ttl 64, id 52202, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x69bd), seq 40344, ack 1039105, win 29118, options [nop,nop,TS val 3637794932 ecr 78651076], length 0 00:43:10.315781 IP (tos 0x0, ttl 64, id 52203, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2237765062: reply ok 160 write PRE: sz 824082432 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824086528 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.315904 IP (tos 0x0, ttl 64, id 30487, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2254542278 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824086528 00:43:10.315927 IP (tos 0x0, ttl 64, id 30489, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.315937 IP (tos 0x0, ttl 64, id 52204, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x5889), seq 40508, ack 1043329, win 29124, options [nop,nop,TS val 3637794942 ecr 78651076], length 0 00:43:10.316191 IP (tos 0x0, ttl 64, id 52205, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2254542278: reply ok 160 write PRE: sz 824086528 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824090624 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes 00:43:10.316274 IP (tos 0x0, ttl 64, id 30490, offset 0, flags [DF], proto TCP (6), length 2948) 172.20.20.166.2271319494 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824090624 00:43:10.316296 IP (tos 0x0, ttl 64, id 30492, offset 0, flags [DF], proto TCP (6), length 1380) 172.20.20.166.0 > 172.20.20.162.2049: 0 null 00:43:10.316307 IP (tos 0x0, ttl 64, id 52206, offset 0, flags [DF], proto TCP (6), length 52) 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 (incorrect -> 0x476b), seq 40672, ack 1047553, win 29118, options [nop,nop,TS val 3637794942 ecr 78651076], length 0 00:43:10.316552 IP (tos 0x0, ttl 64, id 52207, offset 0, flags [DF], proto TCP (6), length 216) 172.20.20.162.2049 > 172.20.20.166.2271319494: reply ok 160 write PRE: sz 824090624 mtime 1390437790.000000 ctime 1390437790.000000 POST: REG 640 ids 0/0 sz 824094720 nlink 1 rdev 3/130 fsid 59 fileid 4 a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 bytes This seems quite different. But the 1500 byte fallback with the FreeBSD 64k case seems like the smoking gun. So far I have not been able to test whether the 1500 bytes case correlates to the 10 pps case, but I will attempt to do so. Thanks! From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 03:45:07 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 96A3192B; Thu, 23 Jan 2014 03:45:07 +0000 (UTC) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 50DCB16CD; Thu, 23 Jan 2014 03:45:07 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0N3j4Bb090696; Wed, 22 Jan 2014 22:45:04 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0N3j4ZS090693; Wed, 22 Jan 2014 22:45:04 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21216.36928.132606.318491@hergotha.csail.mit.edu> Date: Wed, 22 Jan 2014 22:45:04 -0500 From: Garrett Wollman To: Navdeep Parhar Subject: Re: Use of contiguous physical memory in cxgbe driver In-Reply-To: <52E06104.70600@FreeBSD.org> References: <21216.22944.314697.179039@hergotha.csail.mit.edu> <52E06104.70600@FreeBSD.org> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 22 Jan 2014 22:45:04 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 03:45:07 -0000 < said: > On 01/22/14 15:52, Garrett Wollman wrote: >> At this point everyone is well aware that requiring contiguous >> physical page when the hardware can do scatter-gather is a very bad >> idea. > I wouldn't put it this way. Using buffers with size > PAGE_SIZE has its > advantages. These advantages do not come close to balancing out the disadvantage of "your server eventually falls off the network due to physmem fragmentation, better hope you can reset it remotely because driving in to work at 3 AM sucks." If any free pages are available at all, the allocation of a 4k jumbo mbuf will succeed. A 9k jumbo mbuf requires three physically contiguous pages, and it's very, very easy for physical memory to get fragmented to the point where that is impossible. -GAWollman From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 04:13:03 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8EE942D0 for ; Thu, 23 Jan 2014 04:13:03 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id C224718E8 for ; Thu, 23 Jan 2014 04:13:02 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,704,1384318800"; d="scan'208";a="90243088" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 22 Jan 2014 23:12:54 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id ADC7FB4054; Wed, 22 Jan 2014 23:12:54 -0500 (EST) Date: Wed, 22 Jan 2014 23:12:54 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1891524918.14888294.1390450374695.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 04:13:03 -0000 J David wrote: > On Mon, Jan 20, 2014 at 8:27 PM, Rick Macklem > wrote: > >> random random > >> write rewrite read reread read write > >> > >> S:FBSD,C:FBSD,Z:64k > >> 67246 2923 103295 1272407 172475 196 > >> > >> S:FBSD,C:FBSD,Z:32k > >> 11951 99896 223787 1051948 223276 13686 > > > Since I've never used the benchmark you're using, I'll admit I have > > no idea what these numbers mean? (Are big values fast or slow > > or???) > > These numbers are kiB/sec read or written. Larger is better. > So, do you consider the 32K results as reasonable or terrible performance? (They are obviously much better than 64K, except for the reread case.) > > I'd be looking at a packet trace in wireshark and looking for TCP > > retransmits > > and delays between packets. However, any significant improvement > > (like 50% or > > more faster for a smaller I/O size) indicates that something is > > broken in the > > network (in your case virtual) fabric, imho. > > If Debian were also affected in the same environment, I would tend to > agree. As it is not, the problem seems to be pretty squarely on > FreeBSD's shoulders, rather than the environment. > I didn't mean to imply it had nothing to do with FreeBSD. When I said "network fabric", I meant that to include everything below the TCP socket that the NFS krpc code uses for RPC transport. (ie. Including FreeBSD's TCP stack and the device driver for the virtual network interface you are using.) Btw, I don't think you've mentioned what network device driver gets used for this virtual environment. That might be useful, in case the maintainer of that driver is aware of some issue/patch. > > (You saw that the smaller I/O > > size results in more RPCs, so that would suggest slower, not > > faster. > > It's not clear to me what this refers to. More RPCs would only > indicate slower for the same throughput level, right? > > I.e. nominally you would expect to see twice as many RPCs for 32k as > for 64k. But if 32k is transferring 200x faster than 64k, the net > effect would be 100x more RPCs, right? > > The packet traces for the 64k size are pretty interesting. On the > sequential write test, the client packets start out very small, and > slowly scale up to just under 64k, then they abruptly fall to 1500 > bytes and sit there for a good long while: > > 00:32:42.156107 IP (tos 0x0, ttl 64, id 5352, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.255894 IP (tos 0x0, ttl 64, id 46951, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0xa59c), seq 15072, ack 5909085, win 29127, options > [nop,nop,TS val 3447942732 ecr 316758762], length 0 > I don't know why tcpdump keeps flagging these as having incorrect checksums? Maybe someone else has an answer for this. > 00:32:42.256078 IP (tos 0x0, ttl 64, id 5353, offset 0, flags [DF], > proto TCP (6), length 5844) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.355958 IP (tos 0x0, ttl 64, id 46953, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0x8e34), seq 15072, ack 5914877, win 29127, options > [nop,nop,TS val 3447942832 ecr 316758862], length 0 > > 00:32:42.356152 IP (tos 0x0, ttl 64, id 5354, offset 0, flags [DF], > proto TCP (6), length 8740) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.455891 IP (tos 0x0, ttl 64, id 46954, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0x6b7c), seq 15072, ack 5923565, win 29127, options > [nop,nop,TS val 3447942932 ecr 316758962], length 0 > > 00:32:42.456036 IP (tos 0x0, ttl 64, id 5355, offset 0, flags [DF], > proto TCP (6), length 11636) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.456185 IP (tos 0x0, ttl 64, id 46955, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728916: reply ok 160 > > 00:32:42.456370 IP (tos 0x0, ttl 64, id 5356, offset 0, flags [DF], > proto TCP (6), length 14532) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.555949 IP (tos 0x0, ttl 64, id 46957, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0x0440), seq 15236, ack 5949629, win 29127, options > [nop,nop,TS val 3447943032 ecr 316759062], length 0 > > 00:32:42.556139 IP (tos 0x0, ttl 64, id 5357, offset 0, flags [DF], > proto TCP (6), length 17428) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.655929 IP (tos 0x0, ttl 64, id 46958, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0xbf97), seq 15236, ack 5967005, win 29127, options > [nop,nop,TS val 3447943132 ecr 316759162], length 0 > > 00:32:42.656142 IP (tos 0x0, ttl 64, id 5358, offset 0, flags [DF], > proto TCP (6), length 20324) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.755893 IP (tos 0x0, ttl 64, id 46959, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0x6f9f), seq 15236, ack 5987277, win 29127, options > [nop,nop,TS val 3447943232 ecr 316759262], length 0 > > 00:32:42.756085 IP (tos 0x0, ttl 64, id 5359, offset 0, flags [DF], > proto TCP (6), length 23220) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.756198 IP (tos 0x0, ttl 64, id 46960, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728917: reply ok 160 > > 00:32:42.756394 IP (tos 0x0, ttl 64, id 5360, offset 0, flags [DF], > proto TCP (6), length 26116) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.855956 IP (tos 0x0, ttl 64, id 46961, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0xade2), seq 15400, ack 6036509, win 29127, options > [nop,nop,TS val 3447943332 ecr 316759362], length 0 > > 00:32:42.856162 IP (tos 0x0, ttl 64, id 5361, offset 0, flags [DF], > proto TCP (6), length 29012) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.856310 IP (tos 0x0, ttl 64, id 46962, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728918: reply ok 160 > > 00:32:42.856469 IP (tos 0x0, ttl 64, id 5362, offset 0, flags [DF], > proto TCP (6), length 31908) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.955895 IP (tos 0x0, ttl 64, id 47036, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0xbee5), seq 15564, ack 6097325, win 29127, options > [nop,nop,TS val 3447943432 ecr 316759462], length 0 > > 00:32:42.956089 IP (tos 0x0, ttl 64, id 5363, offset 0, flags [DF], > proto TCP (6), length 34804) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:42.956199 IP (tos 0x0, ttl 64, id 47037, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728919: reply ok 160 > > 00:32:42.956404 IP (tos 0x0, ttl 64, id 5364, offset 0, flags [DF], > proto TCP (6), length 37700) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.055906 IP (tos 0x0, ttl 64, id 47038, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0xa2a8), seq 15728, ack 6169725, win 29127, options > [nop,nop,TS val 3447943532 ecr 316759562], length 0 > > 00:32:43.056068 IP (tos 0x0, ttl 64, id 5365, offset 0, flags [DF], > proto TCP (6), length 40596) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.056221 IP (tos 0x0, ttl 64, id 47040, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728921: reply ok 160 > > 00:32:43.056400 IP (tos 0x0, ttl 64, id 5366, offset 0, flags [DF], > proto TCP (6), length 43492) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.155935 IP (tos 0x0, ttl 64, id 47044, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0x592b), seq 15892, ack 6253709, win 29127, options > [nop,nop,TS val 3447943632 ecr 316759662], length 0 > > 00:32:43.156133 IP (tos 0x0, ttl 64, id 5368, offset 0, flags [DF], > proto TCP (6), length 46388) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.156273 IP (tos 0x0, ttl 64, id 47045, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728933: reply ok 160 > > 00:32:43.156481 IP (tos 0x0, ttl 64, id 5369, offset 0, flags [DF], > proto TCP (6), length 49284) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.156632 IP (tos 0x0, ttl 64, id 47046, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728934: reply ok 160 > > 00:32:43.156835 IP (tos 0x0, ttl 64, id 5370, offset 0, flags [DF], > proto TCP (6), length 52180) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.156953 IP (tos 0x0, ttl 64, id 47047, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728937: reply ok 160 > > 00:32:43.157144 IP (tos 0x0, ttl 64, id 5371, offset 0, flags [DF], > proto TCP (6), length 55076) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.157285 IP (tos 0x0, ttl 64, id 47048, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728938: reply ok 160 > > 00:32:43.157459 IP (tos 0x0, ttl 64, id 5372, offset 0, flags [DF], > proto TCP (6), length 57972) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.255927 IP (tos 0x0, ttl 64, id 47049, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x819a > (incorrect -> 0x5baf), seq 16548, ack 6514349, win 29127, options > [nop,nop,TS val 3447943732 ecr 316759762], length 0 > > 00:32:43.256132 IP (tos 0x0, ttl 64, id 5373, offset 0, flags [DF], > proto TCP (6), length 60868) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.256251 IP (tos 0x0, ttl 64, id 47050, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728939: reply ok 160 > > 00:32:43.485942 IP (tos 0x0, ttl 64, id 47056, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874728939: reply ok 160 > > 00:32:43.486099 IP (tos 0x0, ttl 64, id 5376, offset 0, flags [DF], > proto TCP (6), length 1500) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.486117 IP (tos 0x0, ttl 64, id 47057, offset 0, flags [DF], > proto TCP (6), length 64) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 > (incorrect -> 0xbaf5), seq 16712, ack 6575165, win 29127, options > [nop,nop,TS val 3447943962 ecr 316759862,nop,nop,sack 1 > {6638877:6640325}], length 0 > > 00:32:43.486271 IP (tos 0x0, ttl 64, id 5377, offset 0, flags [DF], > proto TCP (6), length 1500) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.486288 IP (tos 0x0, ttl 64, id 47058, offset 0, flags [DF], > proto TCP (6), length 64) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 > (incorrect -> 0xb54d), seq 16712, ack 6575165, win 29127, options > [nop,nop,TS val 3447943962 ecr 316759862,nop,nop,sack 1 > {6638877:6641773}], length 0 > > 00:32:43.486440 IP (tos 0x0, ttl 64, id 5378, offset 0, flags [DF], > proto TCP (6), length 1500) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.486470 IP (tos 0x0, ttl 64, id 47059, offset 0, flags [DF], > proto TCP (6), length 64) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 > (incorrect -> 0xaec2), seq 16712, ack 6576613, win 29124, options > [nop,nop,TS val 3447943962 ecr 316760092,nop,nop,sack 1 > {6638877:6641773}], length 0 > > 00:32:43.486622 IP (tos 0x0, ttl 64, id 5379, offset 0, flags [DF], > proto TCP (6), length 1500) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:32:43.486652 IP (tos 0x0, ttl 64, id 47060, offset 0, flags [DF], > proto TCP (6), length 64) > > 172.20.20.162.2049 > 172.20.20.169.689: Flags [.], cksum 0x81a6 > (incorrect -> 0xa91a), seq 16712, ack 6578061, win 29124, options > [nop,nop,TS val 3447943962 ecr 316760092,nop,nop,sack 1 > {6638877:6641773}], length 0 > > > After ahile, it tries again. Wash, rinse, repeat. > Hopefully someone who is conversant with TCP and tcpdump can make sense of this. I suggested wireshark because it recognizes the NFS requests, flags retransmits etc, so I find it much easier to understand. I normally use tcpdump to capture the packets into a file: # tcpdump -s 0 -w .pcap host - then I read this file into wireshark. A raw capture like the above usually keeps up and doesn't miss much. > With the 32k size, they scale up to between 50000 and 59000 bytes per > packet and sit there for long periods of time, seeming to reset only > after some other intervening RPC activity: > > 00:38:07.932518 IP (tos 0x0, ttl 64, id 38911, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.932530 IP (tos 0x0, ttl 64, id 22486, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686965: reply ok 160 > > 00:38:07.932622 IP (tos 0x0, ttl 64, id 22487, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686941: reply ok 160 > > 00:38:07.932732 IP (tos 0x0, ttl 64, id 38912, offset 0, flags [DF], > proto TCP (6), length 53628) > I don't know why this would be so large. A 32K write should be under 33Kbytes in size, not 53Kbytes. I suspect tcpdump is confused? > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.932747 IP (tos 0x0, ttl 64, id 22488, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686968: reply ok 160 > > 00:38:07.932857 IP (tos 0x0, ttl 64, id 22489, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686967: reply ok 160 > > 00:38:07.932963 IP (tos 0x0, ttl 64, id 38913, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.933060 IP (tos 0x0, ttl 64, id 22490, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686946: reply ok 160 > > 00:38:07.933120 IP (tos 0x0, ttl 64, id 22491, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686971: reply ok 160 > > 00:38:07.933176 IP (tos 0x0, ttl 64, id 38914, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.933270 IP (tos 0x0, ttl 64, id 22492, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686970: reply ok 160 > > 00:38:07.933329 IP (tos 0x0, ttl 64, id 22493, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686972: reply ok 160 > > 00:38:07.933386 IP (tos 0x0, ttl 64, id 38915, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.933483 IP (tos 0x0, ttl 64, id 22494, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686974: reply ok 160 > > 00:38:07.933602 IP (tos 0x0, ttl 64, id 38916, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.933689 IP (tos 0x0, ttl 64, id 22495, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686973: reply ok 160 > > 00:38:07.933754 IP (tos 0x0, ttl 64, id 22496, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686975: reply ok 160 > > 00:38:07.933820 IP (tos 0x0, ttl 64, id 38917, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.933892 IP (tos 0x0, ttl 64, id 22497, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686976: reply ok 160 > > 00:38:07.934017 IP (tos 0x0, ttl 64, id 38918, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.934112 IP (tos 0x0, ttl 64, id 22498, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686977: reply ok 160 > > 00:38:07.934202 IP (tos 0x0, ttl 64, id 22499, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686978: reply ok 160 > > 00:38:07.934234 IP (tos 0x0, ttl 64, id 38919, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.934312 IP (tos 0x0, ttl 64, id 22500, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686942: reply ok 160 > > 00:38:07.934400 IP (tos 0x0, ttl 64, id 22501, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686979: reply ok 160 > > 00:38:07.934426 IP (tos 0x0, ttl 64, id 38920, offset 0, flags [DF], > proto TCP (6), length 53628) > > 172.20.20.169.0 > 172.20.20.162.2049: 0 null > > 00:38:07.934520 IP (tos 0x0, ttl 64, id 22502, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.169.1874686980: reply ok 160 > > > It does seem weird that the client seems to double up the 216 replies > all the time, but the throughput is still very good. > > With Debian as the client, it's so fast tcpdump can't keep up, but > this is what it does see: > > 00:43:10.310980 IP (tos 0x0, ttl 64, id 30448, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2036438470 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824033280 > Btw, here the Debian NFS client is writing 4096 bytes, so it is nowhere near 64K. It appears to be writing single pages at this point. Notice that, for the above FreeBSD traces, it never figures out that there is an NFS write. Wireshark is much better at NFS related stuff, from my limited experience. > 00:43:10.311002 IP (tos 0x0, ttl 64, id 30450, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.311013 IP (tos 0x0, ttl 64, id 52178, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x376f), seq 38376, ack 988417, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651075], length 0 > > 00:43:10.311266 IP (tos 0x0, ttl 64, id 52179, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2036438470: reply ok 160 write > PRE: sz 824033280 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824037376 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.311356 IP (tos 0x0, ttl 64, id 30451, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2053215686 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824037376 > > 00:43:10.311378 IP (tos 0x0, ttl 64, id 30453, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.311389 IP (tos 0x0, ttl 64, id 52180, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x264b), seq 38540, ack 992641, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651075], length 0 > > 00:43:10.311648 IP (tos 0x0, ttl 64, id 52181, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2053215686: reply ok 160 write > PRE: sz 824037376 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824041472 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.311734 IP (tos 0x0, ttl 64, id 30454, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2069992902 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824041472 > > 00:43:10.311756 IP (tos 0x0, ttl 64, id 30456, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.311767 IP (tos 0x0, ttl 64, id 52182, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x1527), seq 38704, ack 996865, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651075], length 0 > > 00:43:10.312021 IP (tos 0x0, ttl 64, id 52183, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2069992902: reply ok 160 write > PRE: sz 824041472 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824045568 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.312107 IP (tos 0x0, ttl 64, id 30457, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2086770118 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824045568 > > 00:43:10.312129 IP (tos 0x0, ttl 64, id 30459, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.312140 IP (tos 0x0, ttl 64, id 52184, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x0403), seq 38868, ack 1001089, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651075], length 0 > > 00:43:10.312393 IP (tos 0x0, ttl 64, id 52185, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2086770118: reply ok 160 write > PRE: sz 824045568 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824049664 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.312485 IP (tos 0x0, ttl 64, id 30460, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2103547334 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824049664 > > 00:43:10.312507 IP (tos 0x0, ttl 64, id 30462, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.312517 IP (tos 0x0, ttl 64, id 52186, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0xf2de), seq 39032, ack 1005313, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651075], length 0 > > 00:43:10.312775 IP (tos 0x0, ttl 64, id 52187, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2103547334: reply ok 160 write > PRE: sz 824049664 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824053760 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.312861 IP (tos 0x0, ttl 64, id 30463, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2120324550 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824053760 > > 00:43:10.312882 IP (tos 0x0, ttl 64, id 30465, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.312893 IP (tos 0x0, ttl 64, id 52188, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0xe1b3), seq 39196, ack 1009537, win 29124, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.313148 IP (tos 0x0, ttl 64, id 52189, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2120324550: reply ok 160 write > PRE: sz 824053760 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824057856 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.313233 IP (tos 0x0, ttl 64, id 30466, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2137101766 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824057856 > > 00:43:10.313259 IP (tos 0x0, ttl 64, id 30468, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.313280 IP (tos 0x0, ttl 64, id 52190, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0xd08f), seq 39360, ack 1013761, win 29124, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.313536 IP (tos 0x0, ttl 64, id 52191, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2137101766: reply ok 160 write > PRE: sz 824057856 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824061952 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.313628 IP (tos 0x0, ttl 64, id 30469, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2153878982 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824061952 > > 00:43:10.313650 IP (tos 0x0, ttl 64, id 30471, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.313661 IP (tos 0x0, ttl 64, id 52192, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0xbf71), seq 39524, ack 1017985, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.313915 IP (tos 0x0, ttl 64, id 52193, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2153878982: reply ok 160 write > PRE: sz 824061952 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824066048 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.314000 IP (tos 0x0, ttl 64, id 30472, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2170656198 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824066048 > > 00:43:10.314022 IP (tos 0x0, ttl 64, id 30474, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.314032 IP (tos 0x0, ttl 64, id 52194, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0xae4d), seq 39688, ack 1022209, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.314286 IP (tos 0x0, ttl 64, id 52195, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2170656198: reply ok 160 write > PRE: sz 824066048 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824070144 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.314373 IP (tos 0x0, ttl 64, id 30475, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2187433414 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824070144 > > 00:43:10.314395 IP (tos 0x0, ttl 64, id 30477, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.314406 IP (tos 0x0, ttl 64, id 52196, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x9d29), seq 39852, ack 1026433, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.314665 IP (tos 0x0, ttl 64, id 52197, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2187433414: reply ok 160 write > PRE: sz 824070144 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824074240 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.314751 IP (tos 0x0, ttl 64, id 30478, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2204210630 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824074240 > > 00:43:10.314772 IP (tos 0x0, ttl 64, id 30480, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.314783 IP (tos 0x0, ttl 64, id 52198, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x8c05), seq 40016, ack 1030657, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.315036 IP (tos 0x0, ttl 64, id 52199, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2204210630: reply ok 160 write > PRE: sz 824074240 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824078336 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.315122 IP (tos 0x0, ttl 64, id 30481, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2220987846 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824078336 > > 00:43:10.315144 IP (tos 0x0, ttl 64, id 30483, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.315154 IP (tos 0x0, ttl 64, id 52200, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x7ae1), seq 40180, ack 1034881, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.315407 IP (tos 0x0, ttl 64, id 52201, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2220987846: reply ok 160 write > PRE: sz 824078336 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824082432 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.315510 IP (tos 0x0, ttl 64, id 30484, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2237765062 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824082432 > > 00:43:10.315532 IP (tos 0x0, ttl 64, id 30486, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.315543 IP (tos 0x0, ttl 64, id 52202, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x69bd), seq 40344, ack 1039105, win 29118, options > [nop,nop,TS val 3637794932 ecr 78651076], length 0 > > 00:43:10.315781 IP (tos 0x0, ttl 64, id 52203, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2237765062: reply ok 160 write > PRE: sz 824082432 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824086528 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.315904 IP (tos 0x0, ttl 64, id 30487, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2254542278 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824086528 > > 00:43:10.315927 IP (tos 0x0, ttl 64, id 30489, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.315937 IP (tos 0x0, ttl 64, id 52204, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x5889), seq 40508, ack 1043329, win 29124, options > [nop,nop,TS val 3637794942 ecr 78651076], length 0 > > 00:43:10.316191 IP (tos 0x0, ttl 64, id 52205, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2254542278: reply ok 160 write > PRE: sz 824086528 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824090624 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > > 00:43:10.316274 IP (tos 0x0, ttl 64, id 30490, offset 0, flags [DF], > proto TCP (6), length 2948) > > 172.20.20.166.2271319494 > 172.20.20.162.2049: 2892 write fh > 1325,752613/4 4096 (4096) bytes @ 824090624 > > 00:43:10.316296 IP (tos 0x0, ttl 64, id 30492, offset 0, flags [DF], > proto TCP (6), length 1380) > > 172.20.20.166.0 > 172.20.20.162.2049: 0 null > > 00:43:10.316307 IP (tos 0x0, ttl 64, id 52206, offset 0, flags [DF], > proto TCP (6), length 52) > > 172.20.20.162.2049 > 172.20.20.166.997: Flags [.], cksum 0x8197 > (incorrect -> 0x476b), seq 40672, ack 1047553, win 29118, options > [nop,nop,TS val 3637794942 ecr 78651076], length 0 > > 00:43:10.316552 IP (tos 0x0, ttl 64, id 52207, offset 0, flags [DF], > proto TCP (6), length 216) > > 172.20.20.162.2049 > 172.20.20.166.2271319494: reply ok 160 write > PRE: sz 824090624 mtime 1390437790.000000 ctime 1390437790.000000 > POST: REG 640 ids 0/0 sz 824094720 nlink 1 rdev 3/130 fsid 59 fileid > 4 > a/m/ctime 1390437723.000000 1390437790.000000 1390437790.000000 4096 > bytes > Well, it seems Debian is doing 4096 byte writes, which won't have anywhere near the effect on the network driver/virtual hardware that a 64K (about 45 IP datagrams) in one NFS RPC will. > > This seems quite different. > > But the 1500 byte fallback with the FreeBSD 64k case seems like the > smoking gun. > Yea, looking at this case in wireshark might make what is going on apparent. At least it would build up the actual 64K writes and would flag retransmits, etc. (It also lists timestamps, so you can scan down the screen and see where there are large time delays.) Have fun with it, rick > So far I have not been able to test whether the 1500 bytes case > correlates to the 10 pps case, but I will attempt to do so. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 05:04:11 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3D994578 for ; Thu, 23 Jan 2014 05:04:11 +0000 (UTC) Received: from mail-ig0-x22f.google.com (mail-ig0-x22f.google.com [IPv6:2607:f8b0:4001:c05::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0B29F1DC8 for ; Thu, 23 Jan 2014 05:04:11 +0000 (UTC) Received: by mail-ig0-f175.google.com with SMTP id uq10so15962629igb.2 for ; Wed, 22 Jan 2014 21:04:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=JqSSVCva9wq8BBEN7o1M9gjsLMRHpg4AgfW1p9Si4Zk=; b=f80P/oLHtI1fWOxIHpq8yP/XAeuwiqIEeHu7IHM86XRkcAVhPY7dnvvCOPoDezhsc7 VKYA2PwZCN65zUdJGmRizvTddepmFITbCPt3iL0Qkxb9HwRiEzwWu54pEuzXZjTTH4n/ Tyl0FHR2Czrd2QqHepytxV69xv5u7KUKizGabq3E6OM/KhGrOQy8y+qSZ6a4R4u9iPx0 uj1oiPibHePI+kNr+lxEkWdNOOlab/9FB/Ig8qNAxVS7SyqPkgF5DoU0R9t/AuyiPq0Y VEy8Avd4a5KSawY8e1inTSmiFDqb0Jb1R5c29GN4PNY9adnllFfAdEuDBz22CnKjfurL I4zQ== MIME-Version: 1.0 X-Received: by 10.43.49.1 with SMTP id uy1mr4350440icb.48.1390453450441; Wed, 22 Jan 2014 21:04:10 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Wed, 22 Jan 2014 21:04:10 -0800 (PST) In-Reply-To: <1891524918.14888294.1390450374695.JavaMail.root@uoguelph.ca> References: <1891524918.14888294.1390450374695.JavaMail.root@uoguelph.ca> Date: Thu, 23 Jan 2014 00:04:10 -0500 X-Google-Sender-Auth: aUsw749YtA2pZlL8NYaStWz9fCI Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 05:04:11 -0000 On Wed, Jan 22, 2014 at 11:12 PM, Rick Macklem wrote: > So, do you consider the 32K results as reasonable or terrible performance? > (They are obviously much better than 64K, except for the reread case.) It's the 64k numbers that prompted the "terrible" thread title. A 196K/sec write speed on a 10+ Gbit network is pretty disastrous. The 32k numbers are, as you say, better. Possibly reasonable, but I'm not sure if they're optimal. It's hard to tell the latency of the virtual network, which would be needed to make that determination. It would be best if FreeBSD out of the box blows the doors of Debian out of the box and FreeBSD tuned to the gills blows the doors off of Debian tuned to the gills. Right now, Debian seems to be the one with the edge and with FreeBSD's illustrious history as the NFS performance king for so many years, that just won't do. :) > Btw, I don't think you've mentioned what network device driver gets used > for this virtual environment. That might be useful, in case the maintainer > of that driver is aware of some issue/patch. KVM uses virtio. >> 00:38:07.932732 IP (tos 0x0, ttl 64, id 38912, offset 0, flags [DF], >> proto TCP (6), length 53628) >> > I don't know why this would be so large. A 32K write should be under > 33Kbytes in size, not 53Kbytes. I suspect tcpdump is confused? Since TCP is stream oriented, is there a reason to expect 1:1 correlation between NFS writes and TCP packets? > Well, it seems Debian is doing 4096 byte writes, which won't have anywhere > near the effect on the network driver/virtual hardware that a 64K (about > 45 IP datagrams) in one NFS RPC will. Debian's kernel says it is doing 64k reads/writes on that mount. So again, possibly an expectation of 1:1 correlation between NFS writes and TCP packets is not being satisfied. However, iozone is doing 4k reads/writes for these tests, so it's also possible that Debian is not coalescing them at all (which FreeBSD apparently is) and the 4k writes are hitting the virtual wire as-is. Also, both sides have TSO and LRO, so it would be surprising (and incorrect?) behavior if a 64k packet were actually fragmented into 45 IP datagrams. Although if something is happening to temporarily provoke exactly that behavior, it might explain the 1500 byte packets, so that's definitely a lead. Maybe it would be possible for me to peek at the stream from various different points and establish who is doing the fragmenting. It could be that if Debian is basically disregarding the 64k setting and using only 4k packets, it's simply not hitting whatever large-packet bad behavior that is harming FreeBSD. However it also performs better in the server role, with the client requesting the larger packets. So that's not definitive. > Yea, looking at this case in wireshark might make what is going on > apparent. Possibly, but that would likely have to be done by someone with more NFS protocol familiarity than I. Also, the incorrect checksums on outbound packets are normal because the interface supports checksum offloading. The checksum simply hasn't been calculated yet when tcpdump sees it. Thanks! From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 06:49:30 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 898CE6F4; Thu, 23 Jan 2014 06:49:30 +0000 (UTC) Received: from mail-pa0-x234.google.com (mail-pa0-x234.google.com [IPv6:2607:f8b0:400e:c03::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 56D331770; Thu, 23 Jan 2014 06:49:30 +0000 (UTC) Received: by mail-pa0-f52.google.com with SMTP id bj1so1479340pad.11 for ; Wed, 22 Jan 2014 22:49:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:mail-followup-to :references:mime-version:content-type:content-disposition :in-reply-to:user-agent; bh=8nu8C46jFUizQcoaDjLJhLMlEUal0qiexfe+nx02CSg=; b=vOh+r/KiVMkiGp8MHmVkpahCaviq9KUH/QiKjmJdqjWA+GcZih7TIK4v54ZCARtloO g/nI6zGxGb+omp1i2zmv39E3A30Jcz/Uc0ZGOJ7ma+iNYMytTqFgrEadV5Y2MyoQybVC 5BnpPddEThU6iU22FfY4HfZCTwCSqb+T/M1VI0lQBXf5HREWvkBSgV8xKgkmwaDRXcdJ uz842eEccHeRv9SjbGNEzmFWbdJchCpz4VPqSHjFeuU+ZFqqSIqTkr0AnemvJT6oC3+2 BbUWvgTsZOFlO7gDJD+nd8PyN2N6raD5C0ckXNk255PQJaHSxArrykhoRZMlfaKxWa0Z CFhg== X-Received: by 10.66.139.8 with SMTP id qu8mr6299850pab.157.1390459769370; Wed, 22 Jan 2014 22:49:29 -0800 (PST) Received: from ox (c-24-6-44-228.hsd1.ca.comcast.net. [24.6.44.228]) by mx.google.com with ESMTPSA id tu3sm57821624pab.1.2014.01.22.22.49.28 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 22 Jan 2014 22:49:28 -0800 (PST) Sender: Navdeep Parhar Date: Wed, 22 Jan 2014 22:49:23 -0800 From: Navdeep Parhar To: Garrett Wollman Subject: Re: Use of contiguous physical memory in cxgbe driver Message-ID: <20140123064923.GA6501@ox> Mail-Followup-To: Garrett Wollman , net@freebsd.org References: <21216.22944.314697.179039@hergotha.csail.mit.edu> <52E06104.70600@FreeBSD.org> <21216.36928.132606.318491@hergotha.csail.mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <21216.36928.132606.318491@hergotha.csail.mit.edu> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 06:49:30 -0000 On Wed, Jan 22, 2014 at 10:45:04PM -0500, Garrett Wollman wrote: > < said: > > > On 01/22/14 15:52, Garrett Wollman wrote: > >> At this point everyone is well aware that requiring contiguous > >> physical page when the hardware can do scatter-gather is a very bad > >> idea. > > > I wouldn't put it this way. Using buffers with size > PAGE_SIZE has its > > advantages. > > These advantages do not come close to balancing out the disadvantage > of "your server eventually falls off the network due to physmem > fragmentation, better hope you can reset it remotely because driving > in to work at 3 AM sucks." This seems to imply that the only alternate to a successful allocation of the preferred size (which may be > PAGE_SIZE) is to not allocate any rx buffer at all and thus fall off the network eventually. This is a false choice. The driver can always fall back to allocating PAGE_SIZE sized buffers iff larger allocations fail. I'm considering implementing something along these lines in cxgbe(4), with exponential backoffs to avoid repeated allocation attempts from jumbo zones that are depleted. Regards, Navdeep > If any free pages are available at all, the allocation of a 4k jumbo > mbuf will succeed. A 9k jumbo mbuf requires three physically > contiguous pages, and it's very, very easy for physical memory to get > fragmented to the point where that is impossible. > > -GAWollman From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 07:30:14 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 083E86F4 for ; Thu, 23 Jan 2014 07:30:14 +0000 (UTC) Received: from mail-we0-x22b.google.com (mail-we0-x22b.google.com [IPv6:2a00:1450:400c:c03::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 97A5F1AB1 for ; Thu, 23 Jan 2014 07:30:13 +0000 (UTC) Received: by mail-we0-f171.google.com with SMTP id w61so811523wes.16 for ; Wed, 22 Jan 2014 23:30:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=D7SxhCbY4McG2OPBxPy5OCUaatRkEL2IbfMunnflgsQ=; b=zaXLof6AjRnY5KeHWyiSxvMcfG1/5RcpKheeHDD1JyVs9lFJihkqBf6dfZEPcqyodo sW4BuihrqJ+B3psBlIJYYhFk7TetNDh5vfjGIvfxndf7zelHsG0Qc63TlO3rSlL8kVqS hbqK9kpOHCIKiExhVzagR1gUECYTGLX7EEitKKv5yS8zFrCoVqJ2L/WhWCakVa9x+bHa JUG2eVF7CMn3L9ZHXHtM3Meu7Po+7My9IXXsb5EUkiHivz/yoiNVw+9wnKb43K1SlOSi W5lFJ4D8O2x4Q3MknNVzKMLWcte6jsa1/Yrcdncf1DVjv/RhOn8FMOczephJpsd4twnu oaxQ== MIME-Version: 1.0 X-Received: by 10.194.2.70 with SMTP id 6mr5312666wjs.25.1390462212053; Wed, 22 Jan 2014 23:30:12 -0800 (PST) Received: by 10.194.29.163 with HTTP; Wed, 22 Jan 2014 23:30:11 -0800 (PST) In-Reply-To: <20140122184552.GB98322@onelab2.iet.unipi.it> References: <20140122184552.GB98322@onelab2.iet.unipi.it> Date: Thu, 23 Jan 2014 07:30:11 +0000 Message-ID: Subject: Re: Netmap in FreeBSD 10 From: "C. L. Martinez" To: freebsd-net@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 07:30:14 -0000 On Wed, Jan 22, 2014 at 6:45 PM, Luigi Rizzo wrote: > On Tue, Jan 21, 2014 at 07:19:28AM +0000, C. L. Martinez wrote: >> Hi all, >> >> Is netmap enabled by default under FreeBSD 10 or do I need to >> recompile GENERIC kernel using "device netmap" option?? > > you need to recompile the kernel (actually just the netmap > module and device driver modules if you do not have > them compiled in). > > I also suggest to update the netmap code to the one in head, > which has more features and bugfixes. > > cheers > luigi Thanks luigi, but from where do I need to download/update netmap?? And what about when I launch freebsd-update?? If I am not wrong, freebsd-update overwrite the code. From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 07:32:52 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E367D86E; Thu, 23 Jan 2014 07:32:52 +0000 (UTC) Received: from mail-qc0-x231.google.com (mail-qc0-x231.google.com [IPv6:2607:f8b0:400d:c01::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7E6E41B29; Thu, 23 Jan 2014 07:32:52 +0000 (UTC) Received: by mail-qc0-f177.google.com with SMTP id i8so1965540qcq.36 for ; Wed, 22 Jan 2014 23:32:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=abTkJrb4s8C2K/ZWTGcl+36nhF+VYmihowOClThVR9c=; b=MGcbWHiWk6sb/uUQ48en7jsjh/GYmhBAi4vsmEDPDJWdl0ETq3Z5L+jMzqv7NOZm/o pZ/wIVGAr3TjcHdAOnK6xTStOBDNB8mN0tzjlvEigvJ0Sw3xzXs+lJ3Mi4uqFmWs/tvo uH9XGtJmLiw47zZcTmOAIrOY4TqMi1oPUd3S40LBQMU4wrqVJjKkvRGz6KtjTGW8a3Cs rSOEB0heKcJTr0LNBBdO2ibKpeqzejtPemz+f68Ju5bt+/tP1pgtHR1nPWSH2k581k2c ykYJ00ghKm6kJWkQ1CIob5H4IpqcxTSvk9TTX+z818pwUG3ygBckhK1NG/FpncTCJG6F LcLw== MIME-Version: 1.0 X-Received: by 10.224.127.131 with SMTP id g3mr9189396qas.98.1390462371718; Wed, 22 Jan 2014 23:32:51 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Wed, 22 Jan 2014 23:32:51 -0800 (PST) In-Reply-To: <21216.36928.132606.318491@hergotha.csail.mit.edu> References: <21216.22944.314697.179039@hergotha.csail.mit.edu> <52E06104.70600@FreeBSD.org> <21216.36928.132606.318491@hergotha.csail.mit.edu> Date: Wed, 22 Jan 2014 23:32:51 -0800 X-Google-Sender-Auth: k9lf16qRTSJDXM5a2XqxOd2A3gU Message-ID: Subject: Re: Use of contiguous physical memory in cxgbe driver From: Adrian Chadd To: Garrett Wollman Content-Type: text/plain; charset=ISO-8859-1 Cc: Navdeep Parhar , "net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 07:32:53 -0000 It's about time we taught the physmem allocator to be more conducive to physically contiguous allocations. A server with gigabytes of memory should be able to keep a couple tens of megabytes of 64k sized allocation chunks around for exactly this. -a On 22 January 2014 19:45, Garrett Wollman wrote: > < said: > >> On 01/22/14 15:52, Garrett Wollman wrote: >>> At this point everyone is well aware that requiring contiguous >>> physical page when the hardware can do scatter-gather is a very bad >>> idea. > >> I wouldn't put it this way. Using buffers with size > PAGE_SIZE has its >> advantages. > > These advantages do not come close to balancing out the disadvantage > of "your server eventually falls off the network due to physmem > fragmentation, better hope you can reset it remotely because driving > in to work at 3 AM sucks." > > If any free pages are available at all, the allocation of a 4k jumbo > mbuf will succeed. A 9k jumbo mbuf requires three physically > contiguous pages, and it's very, very easy for physical memory to get > fragmented to the point where that is impossible. > > -GAWollman > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 09:41:57 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 92B40EB3 for ; Thu, 23 Jan 2014 09:41:57 +0000 (UTC) Received: from mail-wi0-x234.google.com (mail-wi0-x234.google.com [IPv6:2a00:1450:400c:c05::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2C7931856 for ; Thu, 23 Jan 2014 09:41:57 +0000 (UTC) Received: by mail-wi0-f180.google.com with SMTP id d13so1544018wiw.13 for ; Thu, 23 Jan 2014 01:41:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MtD0vbNVQpP8KkZNcbUuSsrw7SmYkBKfw+LDNNbmr40=; b=xdyGxqBhQYcHJYuCXsn2eialaAIRofKXVsh1rZUIsGByCj9FWoejg2n3QtfFoxcyn3 tdpW7gchmK8no8PhwMZe9vra7enPxAefYi9hCtmU7aCUJey3dxAbXXGUnPNkh6Zla+Fn 1SX3CDzNoa7NNtAuA10vsLICbamHOIn5x9waWlTsMm9g+UVZLL6as3zjFR7jlhEANbfo Rb0qS704SkBc98eHN3IaMXlMgTriMdIBPSg1gJgOMgnoTGiMU5fIc041mq07I8vm25xL Vh7HTkeC57dzdMMfjE0fjcAcUdKByBEDPq+cowcqpPW94YhhK8djL1vDGJou+LPxOrmA 6DMg== MIME-Version: 1.0 X-Received: by 10.194.2.70 with SMTP id 6mr5778273wjs.25.1390470115452; Thu, 23 Jan 2014 01:41:55 -0800 (PST) Received: by 10.194.29.163 with HTTP; Thu, 23 Jan 2014 01:41:55 -0800 (PST) In-Reply-To: References: <20140122184552.GB98322@onelab2.iet.unipi.it> Date: Thu, 23 Jan 2014 09:41:55 +0000 Message-ID: Subject: Re: Netmap in FreeBSD 10 From: "C. L. Martinez" To: freebsd-net@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 09:41:57 -0000 On Thu, Jan 23, 2014 at 7:30 AM, C. L. Martinez wrote: > On Wed, Jan 22, 2014 at 6:45 PM, Luigi Rizzo wrote: >> On Tue, Jan 21, 2014 at 07:19:28AM +0000, C. L. Martinez wrote: >>> Hi all, >>> >>> Is netmap enabled by default under FreeBSD 10 or do I need to >>> recompile GENERIC kernel using "device netmap" option?? >> >> you need to recompile the kernel (actually just the netmap >> module and device driver modules if you do not have >> them compiled in). >> >> I also suggest to update the netmap code to the one in head, >> which has more features and bugfixes. >> >> cheers >> luigi > > Thanks luigi, but from where do I need to download/update netmap?? And > what about when I launch freebsd-update?? If I am not wrong, > freebsd-update overwrite the code. And another question. What about libpcap that comes out of the box with FreeBSD?? Do I need to recompile ?? Thanks. From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 09:54:05 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DE8BD15F for ; Thu, 23 Jan 2014 09:54:05 +0000 (UTC) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 9C932193A for ; Thu, 23 Jan 2014 09:54:05 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 042587300B; Thu, 23 Jan 2014 10:56:39 +0100 (CET) Date: Thu, 23 Jan 2014 10:56:39 +0100 From: Luigi Rizzo To: "C. L. Martinez" Subject: Re: Netmap in FreeBSD 10 Message-ID: <20140123095638.GA7274@onelab2.iet.unipi.it> References: <20140122184552.GB98322@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 09:54:05 -0000 On Thu, Jan 23, 2014 at 09:41:55AM +0000, C. L. Martinez wrote: > On Thu, Jan 23, 2014 at 7:30 AM, C. L. Martinez wrote: > > On Wed, Jan 22, 2014 at 6:45 PM, Luigi Rizzo wrote: > >> On Tue, Jan 21, 2014 at 07:19:28AM +0000, C. L. Martinez wrote: > >>> Hi all, > >>> > >>> Is netmap enabled by default under FreeBSD 10 or do I need to > >>> recompile GENERIC kernel using "device netmap" option?? > >> > >> you need to recompile the kernel (actually just the netmap > >> module and device driver modules if you do not have > >> them compiled in). > >> > >> I also suggest to update the netmap code to the one in head, > >> which has more features and bugfixes. > >> > >> cheers > >> luigi > > > > Thanks luigi, but from where do I need to download/update netmap?? And > > what about when I launch freebsd-update?? If I am not wrong, > > freebsd-update overwrite the code. > > And another question. What about libpcap that comes out of the box > with FreeBSD?? Do I need to recompile ?? the libpcap support is not committed yet. you need to bring in the code manually (e.g. from http://code.google.com/p/netmap/ ) or wait until i merge it. From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 16:51:23 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EEA0EB1D for ; Thu, 23 Jan 2014 16:51:23 +0000 (UTC) Received: from mail-pd0-x22b.google.com (mail-pd0-x22b.google.com [IPv6:2607:f8b0:400e:c02::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id C49B01284 for ; Thu, 23 Jan 2014 16:51:23 +0000 (UTC) Received: by mail-pd0-f171.google.com with SMTP id g10so1989811pdj.30 for ; Thu, 23 Jan 2014 08:51:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:content-transfer-encoding:subject:date:message-id :to:mime-version; bh=LHQW7WDSKK9Aeg2j6I+cJM2o0w6oG3wGSAgd2s60L6E=; b=tS8YqAkzNaywlsbU6jfhbi1Z4x2nPFpa2LXfQpXV8OEDQhXYleAvBLS0224akJT+wB cEicyGRk0dbI+jHCyFifM+306thoP0TB5+Gvm95oI7KuK3halUJC3FxOe9S53vb9BZC5 DEkOpWo0UHXUFRFLQQjSdwQIrKiXwaRbwuv8Wx8Hjg65SehxWWt+xkha+YWbbO0wfeOc tXKgk13eOeHEBKt3r/hgkN6prh8ccI1o88ResapEpSffDp+4YKLgisbf9vOq2OWBdpFN fECYlrHM3Q7LzemqI41vGmhlS6Aw0k/xNW0NmRuT5s6K06QgL+Lz96v+Atw6twU7Y/i2 6wKw== X-Received: by 10.68.172.196 with SMTP id be4mr9225625pbc.12.1390495883344; Thu, 23 Jan 2014 08:51:23 -0800 (PST) Received: from [192.168.0.2] (h116-0-130-004.catv02.itscom.jp. [116.0.130.4]) by mx.google.com with ESMTPSA id qz9sm38625879pbc.3.2014.01.23.08.51.21 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Jan 2014 08:51:22 -0800 (PST) From: Kenichi Mori Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Urdu Language Fuzzer over peak function ova attack default account., design on., Date: Fri, 24 Jan 2014 01:51:21 +0900 Message-Id: <7BFE4AE4-79F9-41D2-A935-C61B3AE66A55@gmail.com> To: freebsd-net@freebsd.org Mime-Version: 1.0 (Apple Message framework v1283) X-Mailer: Apple Mail (2.1283) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 16:51:24 -0000 =D8= =AB=D8=AB=D8=AB=D8=AD=D8=AD=D8=AD=DA=88=DA=88=DA=88=DA=91=DA=91=DA=91=D8=B2= =D8=B2=D8=B2=D9=85=D9=85=D9=85=DA=A9=DA=A9=DA=A9=D9=81=D9=81=D9=81=D8=BA=D8= =BA=D8=BA=D8=B9=D8=B9=D8=B9=D8=B8=D8=B8=D8=B8=D8=B7=D8=B7=D8=B7=D8=B6=D8=B6= =D8=B6=D8=B5=D8=B5=D8=B5=DA=A9=DA=A9=DA=A9=DA=AF=DA=AF=DA=AF=D9=86=D9=86=D9= =86 = =DB=81=DB=81=DB=81=D9=87=D9=87=D9=87=D9=88=D9=88=D9=88=D9=86=D9=86=D9=86=DA= =AF=DA=AF=DA=AF=DA=A9=DA=A9=DA=A9=D9=81=D9=81=D9=81=D8=B9=D8=B9=D8=B9=D8=B8= =D8=B8=D8=B8=D8=B7=D8=B7=D8=B7=D8=B6=D8=B6=D8=B6=D9=B9=D8=AA=D9=B9=D9=B9=D9= =B9=D8=AE=D8=AE=D8=AE=DA=88=D8=AF=DA=88=DA=88=DA=88=DA=91=D8=B1=DA=91=DA=91= =DA=91=D8=B2=D8=B2=D8=B2 = =D8=B3=D8=B3=D8=B3=D8=B4=D8=B4=D8=B4=D8=AE=D8=AE=D8=AE=DA=86=DA=86=DA=86=D8= =AC=D8=AC=D8=AC=D8=AB=D8=AB=D8=AB=D8=AA=D8=AA=D8=AA=D8=A8=D8=A8=D8=A8=D8=A2= =D8=A2=D8=A2=D8=B5=D8=B5=D8=B5=D8=B5=D8=B7=D8=B7=D8=B7=D8=BA=D8=BA=D8=BA=DA= =A9=DA=A9=DA=A9=DA=AF=DA=AF=DA=AF=D9=84=D9=84=D9=84=D9=85=D9=85=D9=85=D9=86= =D9=86=D9=86 Kenichi Mori = UGM/OpenWall.,inc/GNU/Linux/BSD/SunOS/Solaris/MacOS/Windows/ OpenVMS/AIX-IBM.,Statistics.,Domani/Hot.,IBM.,co.,ltd., doing = statistics/ FreeSoftware Foundation.,inc.,core member math.,muesure.,Prof., HousouDaigaku KyouyouGakubu Zenkarisyusei Name Kenichi Mori., Birth Shouwa57Nen7gatsu24niti Gakusei Bangou 071-035570-0 Kyoujyu Bangou 071-035570-0 www.ouj.ac.jp/ Last Limited Heisei27nen3gatsu mats Student and Professor., From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 18:02:13 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 79AFD4A7 for ; Thu, 23 Jan 2014 18:02:13 +0000 (UTC) Received: from mail-we0-x232.google.com (mail-we0-x232.google.com [IPv6:2a00:1450:400c:c03::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 15D1518F6 for ; Thu, 23 Jan 2014 18:02:12 +0000 (UTC) Received: by mail-we0-f178.google.com with SMTP id t60so1593516wes.9 for ; Thu, 23 Jan 2014 10:02:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=iBDqdSdhrPRmwhu9ukg94xiHN9oC+sJAqYPZlwDMxjk=; b=AtR/YkI2URsRK8pGlMgH4kTgopplCIv//xatlLdt3xmGk3NEJ7gQNN8GFx/5Ema6Ba 0KSQIuKVZkHKa65IEB6HN2v7D3aRzpiaU/9exBBmE4ZH2vk+0YfBQOJmBOfkoUZklaXw S+uTSBEmSSz/7A/r/LY/0DeGQDWFhp/rSNY23T6iBDnEtG/VLMwv3/vYjTXN+5Hk04gD OFqYl369J94AuJUwnZRYOSysmXtj49LCYZm42MdSwFX5y5CEf8fNOxcnXRJcOqU7LJt/ ewPefkpls2NTgMSdK4TNOM+rqbruYYq+cj/cejiBH9hEkFOYdAuqXzobvB5DDzlwoNJA gdVA== MIME-Version: 1.0 X-Received: by 10.194.219.132 with SMTP id po4mr7615856wjc.7.1390500131538; Thu, 23 Jan 2014 10:02:11 -0800 (PST) Sender: asomers@gmail.com Received: by 10.194.22.35 with HTTP; Thu, 23 Jan 2014 10:02:11 -0800 (PST) Date: Thu, 23 Jan 2014 11:02:11 -0700 X-Google-Sender-Auth: bNyZH3Yn41pAFb3ZKyZUZfalKIg Message-ID: Subject: kern/185813: SOCK_SEQPACKET AF_UNIX sockets with asymmetrical buffers drop packets From: Alan Somers To: freebsd-net@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 18:02:13 -0000 There is a buffer space calculation bug in the send path for SOCK_SEQPACKET AF_UNIX sockets. The result is that, if the sending and receiving buffer sizes are different, the kernel will drop messages and leak mbufs. A more detailed description is available in the PR. The labyrinthine nature of the networking code makes it difficult to directly fix the space calculation. It's especially hard due to the optimization that AF_UNIX sockets have only a single socket buffer. As implemented, they store data in the receiving sockbuf, but use the transmitting sockbuf for space calculations. That's even true of SOCK_STREAM sockets. They only work due to an accident; they don't end up doing the same space calculation that trips up SOCK_SEQPACKET sockets. Instead, I propose modifying the kernel to force an AF_UNIX socket pair's buffers to always have the same size. That is, if you call setsockopt(s, SOL_SOCKET, SO_SNDBUF, whatever, whatever), the kernel will adjust both s's send buffer and the connected socket's receive buffer. This solution also solves another annoying problem: currently there is no way for a program to effectively change the size of its receiving buffers. If you call setsockopt(s, SOL_SOCKET, SO_RCVBUF, whatever, whatever) on an AF_UNIX socket, it will have no effect on how packets are actually handled. The attached patch implements my suggestion for setsockopt. It's obviously not perfect; it doesn't handle the case where you call setsockopt() before connect() and it introduces an unfortunate #include, but it's a working proof of concept. With this patch, the recently added ATF test case sys/kern/unix_seqpacket_test:pipe_simulator_128k_8k passes. Does this look like the correct approach? Index: uipc_socket.c =================================================================== --- uipc_socket.c (revision 261055) +++ uipc_socket.c (working copy) @@ -133,6 +133,8 @@ #include #include #include +#include +#include #include #include #include @@ -2382,6 +2384,8 @@ int sosetopt(struct socket *so, struct sockopt *sopt) { + struct socket* so2; + struct unpcb *unpcb, *unpcb2; int error, optval; struct linger l; struct timeval tv; @@ -2503,6 +2507,32 @@ } (sopt->sopt_name == SO_SNDBUF ? &so->so_snd : &so->so_rcv)->sb_flags &= ~SB_AUTOSIZE; + if (so->so_proto->pr_domain->dom_family != + PF_LOCAL || + so->so_type != SOCK_SEQPACKET) + break; + /* + * For unix domain seqpacket sockets, we set the + * bufsize on both ends of the socket. PR + * kern/185813 + */ + unpcb = (struct unpcb*)(so->so_pcb); + if (NULL == unpcb) + break; /* Shouldn't ever happen */ + unpcb2 = unpcb->unp_conn; + if (NULL == unpcb2) + break; /* For unconnected sockets */ + so2 = unpcb2->unp_socket; + if (NULL == so2) + break; /* Shouldn't ever happen? */ + if (sbreserve(sopt->sopt_name == SO_SNDBUF ? + &so2->so_rcv : &so2->so_snd, (u_long)optval, + so, curthread) == 0) { + error = ENOBUFS; + goto bad; + } + (sopt->sopt_name == SO_SNDBUF ? &so2->so_rcv : + &so2->so_snd)->sb_flags &= ~SB_AUTOSIZE; break; /* -Alan From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 18:27:00 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 563DABE9; Thu, 23 Jan 2014 18:27:00 +0000 (UTC) Received: from mail-qc0-x236.google.com (mail-qc0-x236.google.com [IPv6:2607:f8b0:400d:c01::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 035C71ACB; Thu, 23 Jan 2014 18:26:59 +0000 (UTC) Received: by mail-qc0-f182.google.com with SMTP id c9so3002720qcz.27 for ; Thu, 23 Jan 2014 10:26:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=TevHMeTD5l9crIcsMhaaMZS6c7mh4zOAOqiN1VIk/yU=; b=OCVnbv/98unvDkk/j0BGFfLqFteK1mLcX4WQxrheh8su/jNXSj5zmCxce4CpezmTVS Y7DfV6k0JwS6WwFP5QmofWgmq6W6fBnxhWDikpRX79EtR9cUj6yTrKNbbsTMPwZl3LR/ auqVRO3Os0Bh8QrHUMi1mJOdjoHBrBI2nsNBHFnLdDAy8yb7bRDSCq5pV+3p1ezLq142 5r+JP5hOgvTpL8sEyIRxZV4zwPrxiuckd5dSTo35NbsSEbNleUVEJLy2lPzp3BT6Obmw JpZ+JRTDxqxAvzjyYAJ19BNnQuZ1aINK4Ob+JiuZ4U8N9YH16jEhR9vwRWHl5p9EXhvY +dFQ== MIME-Version: 1.0 X-Received: by 10.224.89.71 with SMTP id d7mr14080640qam.26.1390501619094; Thu, 23 Jan 2014 10:26:59 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Thu, 23 Jan 2014 10:26:59 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 10:26:59 -0800 X-Google-Sender-Auth: R8CGHhA_3fv-1emt5RHLqVrIWjU Message-ID: Subject: Re: kern/185813: SOCK_SEQPACKET AF_UNIX sockets with asymmetrical buffers drop packets From: Adrian Chadd To: Alan Somers Content-Type: text/plain; charset=ISO-8859-1 Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 18:27:00 -0000 Well, shouldn't we fix the API/code so it doesn't drop packets, regardless of the sensibility or non-sensibility of different transmit/receive buffer sizes? -a On 23 January 2014 10:02, Alan Somers wrote: > There is a buffer space calculation bug in the send path for > SOCK_SEQPACKET AF_UNIX sockets. The result is that, if the sending > and receiving buffer sizes are different, the kernel will drop > messages and leak mbufs. A more detailed description is available in > the PR. > > The labyrinthine nature of the networking code makes it difficult to > directly fix the space calculation. It's especially hard due to the > optimization that AF_UNIX sockets have only a single socket buffer. > As implemented, they store data in the receiving sockbuf, but use the > transmitting sockbuf for space calculations. That's even true of > SOCK_STREAM sockets. They only work due to an accident; they don't > end up doing the same space calculation that trips up SOCK_SEQPACKET > sockets. > > Instead, I propose modifying the kernel to force an AF_UNIX socket > pair's buffers to always have the same size. That is, if you call > setsockopt(s, SOL_SOCKET, SO_SNDBUF, whatever, whatever), the kernel > will adjust both s's send buffer and the connected socket's receive > buffer. This solution also solves another annoying problem: currently > there is no way for a program to effectively change the size of its > receiving buffers. If you call setsockopt(s, SOL_SOCKET, SO_RCVBUF, > whatever, whatever) on an AF_UNIX socket, it will have no effect on > how packets are actually handled. > > The attached patch implements my suggestion for setsockopt. It's > obviously not perfect; it doesn't handle the case where you call > setsockopt() before connect() and it introduces an unfortunate > #include, but it's a working proof of concept. With this patch, the > recently added ATF test case > sys/kern/unix_seqpacket_test:pipe_simulator_128k_8k passes. Does this > look like the correct approach? > > > Index: uipc_socket.c > =================================================================== > --- uipc_socket.c (revision 261055) > +++ uipc_socket.c (working copy) > @@ -133,6 +133,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > @@ -2382,6 +2384,8 @@ > int > sosetopt(struct socket *so, struct sockopt *sopt) > { > + struct socket* so2; > + struct unpcb *unpcb, *unpcb2; > int error, optval; > struct linger l; > struct timeval tv; > @@ -2503,6 +2507,32 @@ > } > (sopt->sopt_name == SO_SNDBUF ? &so->so_snd : > &so->so_rcv)->sb_flags &= ~SB_AUTOSIZE; > + if (so->so_proto->pr_domain->dom_family != > + PF_LOCAL || > + so->so_type != SOCK_SEQPACKET) > + break; > + /* > + * For unix domain seqpacket sockets, we set the > + * bufsize on both ends of the socket. PR > + * kern/185813 > + */ > + unpcb = (struct unpcb*)(so->so_pcb); > + if (NULL == unpcb) > + break; /* Shouldn't ever happen */ > + unpcb2 = unpcb->unp_conn; > + if (NULL == unpcb2) > + break; /* For unconnected sockets */ > + so2 = unpcb2->unp_socket; > + if (NULL == so2) > + break; /* Shouldn't ever happen? */ > + if (sbreserve(sopt->sopt_name == SO_SNDBUF ? > + &so2->so_rcv : &so2->so_snd, (u_long)optval, > + so, curthread) == 0) { > + error = ENOBUFS; > + goto bad; > + } > + (sopt->sopt_name == SO_SNDBUF ? &so2->so_rcv : > + &so2->so_snd)->sb_flags &= ~SB_AUTOSIZE; > break; > > /* > > > -Alan > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 19:49:02 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mandree.no-ip.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9E12A93 for ; Thu, 23 Jan 2014 19:49:01 +0000 (UTC) Received: from [IPv6:::1] (localhost6.localdomain6 [IPv6:::1]) by apollo.emma.line.org (Postfix) with ESMTP id 69FCD23D0FE for ; Thu, 23 Jan 2014 09:20:01 +0100 (CET) Message-ID: <52E0D0B1.3090104@FreeBSD.org> Date: Thu, 23 Jan 2014 09:20:01 +0100 From: Matthias Andree User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: Terrible NFS performance under 9.2-RELEASE? References: <2057911949.13372985.1390269671666.JavaMail.root@uoguelph.ca> In-Reply-To: <2057911949.13372985.1390269671666.JavaMail.root@uoguelph.ca> X-Enigmail-Version: 1.5.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 19:49:02 -0000 Am 21.01.2014 03:01, schrieb Rick Macklem: > Since this is getting long winded, I'm going to "cheat" and top post. > (Don't top post flame suit on;-) > > You could try setting > net.inet.tcp.delayed_ack=0 > via sysctl. > I just looked and it appears that TCP delays ACKs for a while, even > when TCP_NODELAY is set (I didn't know that). I honestly don't know > how much/if any effect these delayed ACKs will have, but is you > disable them, you can see what happens. These are separate mechanisms. Not sure if it applies here, but anyways: TCP_NODELAY (opposite is "TCP Cork" or "corking") is for the sending end and means "send data even if you have unacknowledged data in flight and the buffer is not yet full" - and probably does not have much of an impact on bulk sends because you stuff packets full, and then they get sent. Delayed ACK is a feature of the receiving end of the stream so you can bundle return/response data with the ACK (because the "ACK" basically is only any TCP packet with the bumped sequence number in the header). http://www.stuartcheshire.org/papers/NagleDelayedAck/ Mind the warnings at the end of the document - make sure to not impair congestion control and fairness, if you want to avoid harder-to-debug performance issues with mixed traffic in the network (i. e. more than your synthetic NFS tests or NFS-dominated use). It'd be interesting to see if there has been research on the Minshall modification vs. TCP fairness (TCP fairness is paramount through all the TCP research papers, but I have not looked at that for a few years now, and I predominantly looked at other aspects of congestion control at the time). Links (not necessary URLs) welcome, especially if not delaying acks turns out to improve the situation. From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 22:30:00 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E19C93CA for ; Thu, 23 Jan 2014 22:29:59 +0000 (UTC) Received: from mail-vb0-x234.google.com (mail-vb0-x234.google.com [IPv6:2607:f8b0:400c:c02::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9ADBD1359 for ; Thu, 23 Jan 2014 22:29:59 +0000 (UTC) Received: by mail-vb0-f52.google.com with SMTP id p14so1394518vbm.25 for ; Thu, 23 Jan 2014 14:29:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=azXr0p4cbZzxQDTxzwKO5hzK825t8nSYMhalu0wxflM=; b=Fj16dNL0BKMtzjfV36WOwFzbXDH0GTdxxUkf0rzB6YjnsUuyWC5TjPp4HzhSmhHHEZ whGFVaE+jKSNqPY7+0846JKNPkffbD9VwhwcBuO0xTnwJVmPVpbOkGvZMqdu8GeDemn9 rHH10gYjgQIeB1wvXF6y1+BoUW3ejc1hn7VbNT+mZaLhri/luwBAdbKQq170kkyE/uCM isQwnMKpC6iwezS0CyYRnVLAeXdyPVgoNoSJcFzvMVd1ahOxa3JGK02ztyEb4d3reO2b SrFHwWl4lJ0p/TiReSCPTnlLLJlcuVSsJD13o6qjyT/yswLd0JhBZqH05iYMYAWRcLsF 4qIQ== X-Received: by 10.220.161.132 with SMTP id r4mr5769523vcx.29.1390516198686; Thu, 23 Jan 2014 14:29:58 -0800 (PST) MIME-Version: 1.0 Sender: cochard@gmail.com Received: by 10.58.171.1 with HTTP; Thu, 23 Jan 2014 14:29:38 -0800 (PST) In-Reply-To: <20140122184552.GB98322@onelab2.iet.unipi.it> References: <20140122184552.GB98322@onelab2.iet.unipi.it> From: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= Date: Thu, 23 Jan 2014 23:29:38 +0100 X-Google-Sender-Auth: kBqcKD39q6e7j-kK4X7fW5XP0u8 Message-ID: Subject: Re: Netmap in FreeBSD 10 To: Luigi Rizzo Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: "C. L. Martinez" , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 22:30:00 -0000 On Wed, Jan 22, 2014 at 7:45 PM, Luigi Rizzo wrote: > > I also suggest to update the netmap code to the one in head, > which has more features and bugfixes. > > I've tried to copy the netmap code from head to 10.0: it's a little bit more complex than a simple copy :-) The linker is not happy because some declarations are just declared and not included in some header file, here is a limited output: /src/sys/dev/e1000/if_em.c:(.text+0x51ee): undefined reference to `netmap_buffer_lut' /src/sys/dev/e1000/if_em.c:(.text+0x5204): undefined reference to `netmap_buffer_lut' if_em.o: In function `em_netmap_txsync': /src/sys/dev/e1000/if_em.c:(.text+0xa170): undefined reference to `netmap_buffer_lut' if_em.o:/src/sys/dev/e1000/if_em.c:(.text+0xa184): more undefined references to `netmap_buffer_lut' follow netmap.o: In function `netmap_get_memory': /src/sys/dev/netmap/netmap.c:(.text+0xab8): undefined reference to `nm_mem' /src/sys/dev/netmap/netmap.c:(.text+0xaea): undefined reference to `netmap_mem_finalize' netmap.o: In function `netmap_dtor_locked': /src/sys/dev/netmap/netmap.c:(.text+0xb77): undefined reference to `netmap_mem_deref' netmap.o: In function `netmap_do_unregif': /src/sys/dev/netmap/netmap.c:(.text+0xc74): undefined reference to `netmap_mem_rings_delete' /src/sys/dev/netmap/netmap.c:(.text+0xc88): undefined reference to `netmap_mem_if_delete' netmap.o: In function `netmap_txsync_to_host': /src/sys/dev/netmap/netmap.c:(.text+0xeb1): undefined reference to `mbq_init' netmap.o: In function `netmap_grab_packets': /src/sys/dev/netmap/netmap.c:(.text+0x1800): undefined reference to `mbq_enqueue' netmap.o: In function `netmap_send_up': /src/sys/dev/netmap/netmap.c:(.text+0x1849): undefined reference to `mbq_dequeue' /src/sys/dev/netmap/netmap.c:(.text+0x18ce): undefined reference to `mbq_dequeue' /src/sys/dev/netmap/netmap.c:(.text+0x18de): undefined reference to `mbq_destroy' (etc...) From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 22:39:28 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 7EFA3AC7 for ; Thu, 23 Jan 2014 22:39:28 +0000 (UTC) Received: from mail-la0-x229.google.com (mail-la0-x229.google.com [IPv6:2a00:1450:4010:c03::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 021EB1481 for ; Thu, 23 Jan 2014 22:39:27 +0000 (UTC) Received: by mail-la0-f41.google.com with SMTP id mc6so2036427lab.28 for ; Thu, 23 Jan 2014 14:39:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=GdXqxKIxOn9v8zMV8b1Vfa6DHgUA4inpbehnQusZgSQ=; b=FVtlSx0BwwrOOm6AXLIMTie6I5vjGfm0SIZQSY2itbU/PWUm36aKyZE3FskLqnbBbJ /IiwE8mJrlOY4vv5Fk3aKX1jUl1732vp/j2xUKAKSHyX6rkTUTUaU2Kr3tbE4h5JQiIz hfpWlU1sK2SKxDIiEhhCTIIoWzhVqk9YRxbyWAD/fIX1ggSVMHC/DtD0MTcIG3t8cbOw R4LpHgUoza1uHdGRIyc4WkmVPlZuOvZE/oT2dpKRYB6Y5lvdThNrvPMilVMnVKp7eUij 5fusJNZ7jjaN4ICM0hlegAzR+uw8PNCOXOmLkPix+ilCcfEjECBKKf9wMVZaf2F0EdzO 5AbQ== MIME-Version: 1.0 X-Received: by 10.152.5.136 with SMTP id s8mr75601las.55.1390516766031; Thu, 23 Jan 2014 14:39:26 -0800 (PST) Sender: rizzo.unipi@gmail.com Received: by 10.115.4.162 with HTTP; Thu, 23 Jan 2014 14:39:25 -0800 (PST) In-Reply-To: References: <20140122184552.GB98322@onelab2.iet.unipi.it> Date: Thu, 23 Jan 2014 14:39:25 -0800 X-Google-Sender-Auth: 6Zis__1e-JIjlBzxoxjdtx06XEI Message-ID: Subject: Re: Netmap in FreeBSD 10 From: Luigi Rizzo To: =?ISO-8859-1?Q?Olivier_Cochard=2DLabb=E9?= Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: "C. L. Martinez" , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 22:39:28 -0000 On Thu, Jan 23, 2014 at 2:29 PM, Olivier Cochard-Labb=E9 wrote: > On Wed, Jan 22, 2014 at 7:45 PM, Luigi Rizzo wrote: > >> >> I also suggest to update the netmap code to the one in head, >> which has more features and bugfixes. >> >> > I've tried to copy the netmap code from head to 10.0: it's a little bit > more complex than a simple copy :-) > you definitely need to bring in some small device driver modifications from head (such as svn 260368 and 257529; look for my commits after 250108 in the drivers), and probably also update sys/conf/files; but the errors below suggest that you are compiling with some of the old headers. FWIW the same applies to stable/9 as well. The reason i have not MFC'ed it yet is that we have another couple of additional features that require a small ABI change, and we'd like to do it all at once in the stable branches. Hopefully it should be only a few days now. cheers luigi From owner-freebsd-net@FreeBSD.ORG Thu Jan 23 23:30:25 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id AFF411B9; Thu, 23 Jan 2014 23:30:25 +0000 (UTC) Received: from luigi.brtsvcs.net (luigi.brtsvcs.net [IPv6:2607:fc50:1000:1f00::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 634AF18B5; Thu, 23 Jan 2014 23:30:25 +0000 (UTC) Received: from chombo.houseloki.net (unknown [IPv6:2601:7:880:bd0:21c:c0ff:fe7f:96ee]) by luigi.brtsvcs.net (Postfix) with ESMTPSA id 9C9FF2D4FAE; Thu, 23 Jan 2014 15:30:16 -0800 (PST) Received: from [IPv6:2601:7:880:bd0:b40d:eb14:4f81:f42f] (unknown [IPv6:2601:7:880:bd0:b40d:eb14:4f81:f42f]) by chombo.houseloki.net (Postfix) with ESMTPSA id AAF3C370; Thu, 23 Jan 2014 15:30:13 -0800 (PST) Message-ID: <52E1A60D.3080002@bluerosetech.com> Date: Thu, 23 Jan 2014 15:30:21 -0800 From: Darren Pilgrim User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-stable , freebsd-net Subject: Supermicro A1SRi-2758F, no NICs detected Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 23:30:25 -0000 I just got a Supermicro A1SRi-2758F mainboard (Atom C2758 CPU). It boots FreeBSD 9.2 amd64 just fine, but the kernel doesn't attach a driver to any of the four ethernet devices. The pciconf -lv output looks like: none3@pci0:0:20:0: class=0x020000 card=0x1f4115d9 chip=0x1f418086 rev=0x03 hdr=0x00 vendor = 'Intel Corporation' class = network subclass = ethernet The other three differ only by PCI selector: pci0:0:20:1, pci0:0:20:2, and pci0:0:20:3. These are the SoC gigabit controllers on the C2758, which I believe makes them i354's. Some rummaging about in forums tells me support for these might be in FreeBSD 10? Is that the case? If so, will it be backported to 9? From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 00:05:39 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B8ABF290; Fri, 24 Jan 2014 00:05:39 +0000 (UTC) Received: from luigi.brtsvcs.net (luigi.brtsvcs.net [204.109.60.246]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 952611C85; Fri, 24 Jan 2014 00:05:39 +0000 (UTC) Received: from chombo.houseloki.net (unknown [IPv6:2601:7:880:bd0:21c:c0ff:fe7f:96ee]) by luigi.brtsvcs.net (Postfix) with ESMTPSA id 2D4282D4FAE; Thu, 23 Jan 2014 16:05:38 -0800 (PST) Received: from [IPv6:2601:7:880:bd0:b40d:eb14:4f81:f42f] (unknown [IPv6:2601:7:880:bd0:b40d:eb14:4f81:f42f]) by chombo.houseloki.net (Postfix) with ESMTPSA id 8D86A374; Thu, 23 Jan 2014 16:05:36 -0800 (PST) Message-ID: <52E1AE57.5070503@bluerosetech.com> Date: Thu, 23 Jan 2014 16:05:43 -0800 From: Darren Pilgrim User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-stable , freebsd-net Subject: Re: Supermicro A1SRi-2758F, no NICs detected References: <52E1A60D.3080002@bluerosetech.com> In-Reply-To: <52E1A60D.3080002@bluerosetech.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 00:05:39 -0000 On 1/23/2014 3:30 PM, Darren Pilgrim wrote: > I just got a Supermicro A1SRi-2758F mainboard (Atom C2758 CPU). It > boots FreeBSD 9.2 amd64 just fine, but the kernel doesn't attach a > driver to any of the four ethernet devices. The pciconf -lv output > looks like: > > none3@pci0:0:20:0: class=0x020000 card=0x1f4115d9 chip=0x1f418086 > rev=0x03 hdr=0x00 > vendor = 'Intel Corporation' > class = network > subclass = ethernet > > The other three differ only by PCI selector: pci0:0:20:1, pci0:0:20:2, > and pci0:0:20:3. > > These are the SoC gigabit controllers on the C2758, which I believe > makes them i354's. Some rummaging about in forums tells me support for > these might be in FreeBSD 10? Is that the case? If so, will it be > backported to 9? Of course, right after I send the above, I noticed the 10.0-R announcement. Loaded up a memstick and 10.0 attached igb to them, so it's install 10.0 as I type this. I'm a little uneasy doing .0 release on brand-new hardware, but we'll see! I would still like to know if there are plans to backport the i354/C2000 SoC gigabit support in igb to 9.x? From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 00:13:04 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1B4426A7 for ; Fri, 24 Jan 2014 00:13:04 +0000 (UTC) Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E1AE51D2F for ; Fri, 24 Jan 2014 00:13:03 +0000 (UTC) Received: by mail-pd0-f181.google.com with SMTP id y10so2408118pdj.26 for ; Thu, 23 Jan 2014 16:12:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:from:subject:date:to; bh=yMKi18qoKua0qZV9SrmdsvOQPzLjXgvuMY2mtpyFDgU=; b=Mqdmbhy85f0/Pqcrfr4HxUv9YPV/N7iHJe1rSS4NSpOOAxe/czbiihj9EWW1E0Ykao 9oH4SI3cJ98/XHvdGkCbUnup5BNj3pTgl7pSWoiNDCmykBCLuoNH8jg+ax0SibHGIzCF bdBJ1s+S3sVhlVEWcvGW4Sei/X8CddO6t+kZHqDG7kjHHmgXEtqxqTW37uV9D8Hk2ZQJ Zr1Mm0t9Upwf3zU0L+OprNEEg+YYNy2dMCto5MXtWKku4Y0/MdwAZv/yfO2AvvR98JaG 9d6WzwEGM56547oo5+uY8Ky/LtfOJDH5ZMyE8LeelGMHvvOtuRf/Cf1UPK1u08n3fIzM gSJw== X-Gm-Message-State: ALoCoQlZcXbt5HqVGsuFadiKjvoReBFjnj0f1DSzjU7V6nAdNiUQdv46mp08U72bigsUioBHxMXm X-Received: by 10.66.142.170 with SMTP id rx10mr11018125pab.117.1390522377534; Thu, 23 Jan 2014 16:12:57 -0800 (PST) Received: from [29.154.8.126] (66-87-64-126.pools.spcsdns.net. [66.87.64.126]) by mx.google.com with ESMTPSA id nl7sm42659258pbc.6.2014.01.23.16.12.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Jan 2014 16:12:51 -0800 (PST) References: <52E1A60D.3080002@bluerosetech.com> <52E1AE57.5070503@bluerosetech.com> Mime-Version: 1.0 (1.0) In-Reply-To: <52E1AE57.5070503@bluerosetech.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <54717B7D-9B83-4074-81AC-7F7ED3ACB251@netgate.com> X-Mailer: iPhone Mail (11B554a) From: Jim Thompson Subject: Re: Supermicro A1SRi-2758F, no NICs detected Date: Thu, 23 Jan 2014 16:12:18 -0800 To: Darren Pilgrim Cc: freebsd-net , freebsd-stable X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 00:13:04 -0000 > On Jan 23, 2014, at 16:05, Darren Pilgrim w= rote: >=20 > I would still like to know if there are plans to backport the i354/C2000 S= oC gigabit support in igb to 9.x? We recompiled it for pfSense (8.3 based), but you're likely better off with 1= 0.0.=20= From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 00:29:59 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 21EB5DF3; Fri, 24 Jan 2014 00:29:59 +0000 (UTC) Received: from luigi.brtsvcs.net (luigi.brtsvcs.net [IPv6:2607:fc50:1000:1f00::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E7FDB1E33; Fri, 24 Jan 2014 00:29:58 +0000 (UTC) Received: from chombo.houseloki.net (unknown [IPv6:2601:7:880:bd0:21c:c0ff:fe7f:96ee]) by luigi.brtsvcs.net (Postfix) with ESMTPSA id 83F852D4FAE; Thu, 23 Jan 2014 16:29:57 -0800 (PST) Received: from [IPv6:2601:7:880:bd0:b40d:eb14:4f81:f42f] (unknown [IPv6:2601:7:880:bd0:b40d:eb14:4f81:f42f]) by chombo.houseloki.net (Postfix) with ESMTPSA id E339137A; Thu, 23 Jan 2014 16:29:55 -0800 (PST) Message-ID: <52E1B40B.9060009@bluerosetech.com> Date: Thu, 23 Jan 2014 16:30:03 -0800 From: Darren Pilgrim User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Jim Thompson Subject: Re: Supermicro A1SRi-2758F, no NICs detected References: <52E1A60D.3080002@bluerosetech.com> <52E1AE57.5070503@bluerosetech.com> <54717B7D-9B83-4074-81AC-7F7ED3ACB251@netgate.com> In-Reply-To: <54717B7D-9B83-4074-81AC-7F7ED3ACB251@netgate.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net , freebsd-stable X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 00:29:59 -0000 On 1/23/2014 4:12 PM, Jim Thompson wrote: > >> On Jan 23, 2014, at 16:05, Darren Pilgrim >> wrote: >> >> I would still like to know if there are plans to backport the >> i354/C2000 SoC gigabit support in igb to 9.x? > > We recompiled it for pfSense (8.3 based), but you're likely better > off with 10.0. I saw that. It was actually a pfSense forum thread that tipped me off about support in 10.0. :) From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 00:31:07 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 18099FE8; Fri, 24 Jan 2014 00:31:07 +0000 (UTC) Received: from mail-we0-x22a.google.com (mail-we0-x22a.google.com [IPv6:2a00:1450:400c:c03::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 77DB71EB4; Fri, 24 Jan 2014 00:31:06 +0000 (UTC) Received: by mail-we0-f170.google.com with SMTP id u57so2026447wes.29 for ; Thu, 23 Jan 2014 16:31:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=unYXpA/d3fV/YbIHh3LWTyw/kH/W0CP74TmcW15fPS4=; b=RaXMne5mFlEnebBVl/Q9a8jJUj6WEw88PcPV1KChVHc7deV1SDMas8D0l9UBBy9aDR V07OM3kcrMdJcnAU6s52UOjsS+0QWqfN7Kb0vAXndfhinDcdCHsTNZFi5amxMPgGTfhq wo76m/NjOm8JKW72pQZCF9qNlgTwV42vYJXX4xzhD5bM0QBczq0lFzlEQjzVXQzu5Suj b/GF9gE1XPd3mq9fChwfWW9cbdqFVXnDM5WZcIgkneDx098JXMErSa4/enlh3ODEGMvy 36DHlGHfFaUNUZgxEAdYwxqwP+Zq0HrYWt79Z8sEoUvQRe2/X+1QRm60g6noyXJn9ecf 1VGA== MIME-Version: 1.0 X-Received: by 10.180.211.39 with SMTP id mz7mr1056928wic.53.1390523464905; Thu, 23 Jan 2014 16:31:04 -0800 (PST) Sender: asomers@gmail.com Received: by 10.194.22.35 with HTTP; Thu, 23 Jan 2014 16:31:04 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 17:31:04 -0700 X-Google-Sender-Auth: JQF7PotKc-VDXkBPVL-M8pTDckM Message-ID: Subject: Re: kern/185813: SOCK_SEQPACKET AF_UNIX sockets with asymmetrical buffers drop packets From: Alan Somers To: Adrian Chadd Content-Type: text/plain; charset=ISO-8859-1 Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 00:31:07 -0000 On Thu, Jan 23, 2014 at 11:26 AM, Adrian Chadd wrote: > Well, shouldn't we fix the API/code so it doesn't drop packets, > regardless of the sensibility or non-sensibility of different > transmit/receive buffer sizes? That would be nice, but it may be beyond my ability to do so. The relevant code is very complicated, and most of it is in domain-agnostic code where we can't introduce AF_UNIX specific special cases. It may be possible to change the single-buffer optimization to use the receiving sockbuf's size for space calculations in uipc_send() instead of the transmitting sockbuf's size. I could try to do that, though it may cause existing programs to fail if they're depending on setsockopt(s, SOL_SOCKET, SO_SNDBUF, ...) to have an effect. -Alan > > > -a > > > On 23 January 2014 10:02, Alan Somers wrote: >> There is a buffer space calculation bug in the send path for >> SOCK_SEQPACKET AF_UNIX sockets. The result is that, if the sending >> and receiving buffer sizes are different, the kernel will drop >> messages and leak mbufs. A more detailed description is available in >> the PR. >> >> The labyrinthine nature of the networking code makes it difficult to >> directly fix the space calculation. It's especially hard due to the >> optimization that AF_UNIX sockets have only a single socket buffer. >> As implemented, they store data in the receiving sockbuf, but use the >> transmitting sockbuf for space calculations. That's even true of >> SOCK_STREAM sockets. They only work due to an accident; they don't >> end up doing the same space calculation that trips up SOCK_SEQPACKET >> sockets. >> >> Instead, I propose modifying the kernel to force an AF_UNIX socket >> pair's buffers to always have the same size. That is, if you call >> setsockopt(s, SOL_SOCKET, SO_SNDBUF, whatever, whatever), the kernel >> will adjust both s's send buffer and the connected socket's receive >> buffer. This solution also solves another annoying problem: currently >> there is no way for a program to effectively change the size of its >> receiving buffers. If you call setsockopt(s, SOL_SOCKET, SO_RCVBUF, >> whatever, whatever) on an AF_UNIX socket, it will have no effect on >> how packets are actually handled. >> >> The attached patch implements my suggestion for setsockopt. It's >> obviously not perfect; it doesn't handle the case where you call >> setsockopt() before connect() and it introduces an unfortunate >> #include, but it's a working proof of concept. With this patch, the >> recently added ATF test case >> sys/kern/unix_seqpacket_test:pipe_simulator_128k_8k passes. Does this >> look like the correct approach? >> >> >> Index: uipc_socket.c >> =================================================================== >> --- uipc_socket.c (revision 261055) >> +++ uipc_socket.c (working copy) >> @@ -133,6 +133,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> #include >> #include >> @@ -2382,6 +2384,8 @@ >> int >> sosetopt(struct socket *so, struct sockopt *sopt) >> { >> + struct socket* so2; >> + struct unpcb *unpcb, *unpcb2; >> int error, optval; >> struct linger l; >> struct timeval tv; >> @@ -2503,6 +2507,32 @@ >> } >> (sopt->sopt_name == SO_SNDBUF ? &so->so_snd : >> &so->so_rcv)->sb_flags &= ~SB_AUTOSIZE; >> + if (so->so_proto->pr_domain->dom_family != >> + PF_LOCAL || >> + so->so_type != SOCK_SEQPACKET) >> + break; >> + /* >> + * For unix domain seqpacket sockets, we set the >> + * bufsize on both ends of the socket. PR >> + * kern/185813 >> + */ >> + unpcb = (struct unpcb*)(so->so_pcb); >> + if (NULL == unpcb) >> + break; /* Shouldn't ever happen */ >> + unpcb2 = unpcb->unp_conn; >> + if (NULL == unpcb2) >> + break; /* For unconnected sockets */ >> + so2 = unpcb2->unp_socket; >> + if (NULL == so2) >> + break; /* Shouldn't ever happen? */ >> + if (sbreserve(sopt->sopt_name == SO_SNDBUF ? >> + &so2->so_rcv : &so2->so_snd, (u_long)optval, >> + so, curthread) == 0) { >> + error = ENOBUFS; >> + goto bad; >> + } >> + (sopt->sopt_name == SO_SNDBUF ? &so2->so_rcv : >> + &so2->so_snd)->sb_flags &= ~SB_AUTOSIZE; >> break; >> >> /* >> >> >> -Alan >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 01:06:30 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 07B36164; Fri, 24 Jan 2014 01:06:30 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CFD951147; Fri, 24 Jan 2014 01:06:29 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0O16TLg086139; Fri, 24 Jan 2014 01:06:29 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0O16TjW086138; Fri, 24 Jan 2014 01:06:29 GMT (envelope-from linimon) Date: Fri, 24 Jan 2014 01:06:29 GMT Message-Id: <201401240106.s0O16TjW086138@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-net@FreeBSD.org From: linimon@FreeBSD.org Subject: Re: kern/185909: [altq] [patch] ALTQ activation problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 01:06:30 -0000 Old Synopsis: ALTQ activation problem New Synopsis: [altq] [patch] ALTQ activation problem Responsible-Changed-From-To: freebsd-bugs->freebsd-net Responsible-Changed-By: linimon Responsible-Changed-When: Fri Jan 24 01:06:14 UTC 2014 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=185909 From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 02:27:25 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3193351E for ; Fri, 24 Jan 2014 02:27:25 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id BE0BD1713 for ; Fri, 24 Jan 2014 02:27:24 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEACHO4VKDaFve/2dsb2JhbABag0RWgn64aE+BInSCJQEBAQMBAQEBICsgCwUWGAICDRkCKQEJJgYIBwQBHASHXAgNql6cOxeBKY0GAQEbNAeCb4FJBIlIjAqEA5Bmg0seMXsJFwQe X-IronPort-AV: E=Sophos;i="4.95,710,1384318800"; d="scan'208";a="90015982" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 23 Jan 2014 21:27:17 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2846FB4053; Thu, 23 Jan 2014 21:27:17 -0500 (EST) Date: Thu, 23 Jan 2014 21:27:17 -0500 (EST) From: Rick Macklem To: J David Message-ID: <390483613.15499210.1390530437153.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 02:27:25 -0000 J David wrote: > On Wed, Jan 22, 2014 at 11:12 PM, Rick Macklem > wrote: > > So, do you consider the 32K results as reasonable or terrible > > performance? > > (They are obviously much better than 64K, except for the reread > > case.) > > It's the 64k numbers that prompted the "terrible" thread title. A > 196K/sec write speed on a 10+ Gbit network is pretty disastrous. > > The 32k numbers are, as you say, better. Possibly reasonable, but > I'm > not sure if they're optimal. It's hard to tell the latency of the > virtual network, which would be needed to make that determination. > It > would be best if FreeBSD out of the box blows the doors of Debian out > of the box and FreeBSD tuned to the gills blows the doors off of > Debian tuned to the gills. Right now, Debian seems to be the one > with > the edge and with FreeBSD's illustrious history as the NFS > performance > king for so many years, that just won't do. :) > > > Btw, I don't think you've mentioned what network device driver gets > > used > > for this virtual environment. That might be useful, in case the > > maintainer > > of that driver is aware of some issue/patch. > > KVM uses virtio. > > >> 00:38:07.932732 IP (tos 0x0, ttl 64, id 38912, offset 0, flags > >> [DF], > >> proto TCP (6), length 53628) > >> > > I don't know why this would be so large. A 32K write should be > > under > > 33Kbytes in size, not 53Kbytes. I suspect tcpdump is confused? > > Since TCP is stream oriented, is there a reason to expect 1:1 > correlation between NFS writes and TCP packets? > Well, my TCP is pretty rusty, but... Since your stats didn't show any jumbo frames, each IP datagram needs to fit in the MTU of 1500bytes. NFS hands an mbuf list of just over 64K (or 32K) to TCP in a single sosend(), then TCP will generate about 45 (or about 23 for 32K) TCP segments and put each in an IP datagram, then hand it to the network device driver for transmission. (wireshark figures this out and shows you the 45 TCP/IP packets + a summary of the NFS RPC message they make up. tcpdump doesn't know how to do this stuff. At least not any version I've used.) So, in summary, no, unless you use a very small 1Kbyte rsize/wsize. > > Well, it seems Debian is doing 4096 byte writes, which won't have > > anywhere > > near the effect on the network driver/virtual hardware that a 64K > > (about > > 45 IP datagrams) in one NFS RPC will. > > Debian's kernel says it is doing 64k reads/writes on that mount. So > again, possibly an expectation of 1:1 correlation between NFS writes > and TCP packets is not being satisfied. > The tcpdump you posted was showing 4Kbyte NFS writes, not a 4K TCP/IP datagram. As above, the 4K NFS RPC message will be 3 TCP/IP packets. tcpdump did succeed in figuring this out, unlike the large writes. (Look for the entries with "filesync" mentioned in it, for the tcpdump stuff you posted.) > However, iozone is doing 4k reads/writes for these tests, so it's > also > possible that Debian is not coalescing them at all (which FreeBSD > apparently is) and the 4k writes are hitting the virtual wire as-is. > Yep, or the Linux client might be writing a page at a time. > Also, both sides have TSO and LRO, so it would be surprising (and > incorrect?) behavior if a 64k packet were actually fragmented into 45 > IP datagrams. Although if something is happening to temporarily > provoke exactly that behavior, it might explain the 1500 byte > packets, > so that's definitely a lead. Maybe it would be possible for me to > peek at the stream from various different points and establish who is > doing the fragmenting. > The stuff you posted didn't list any jumbo frames, so 1500byte TCP/IP datagrams must be what TCP generates and sends via the network device driver. > It could be that if Debian is basically disregarding the 64k setting > and using only 4k packets, it's simply not hitting whatever > large-packet bad behavior that is harming FreeBSD. However it also > performs better in the server role, with the client requesting the > larger packets. So that's not definitive. > A little nit here. It isn't a large packet, it is a burst of 1500byte packets resulting from a send of a large NFS RPC message. > > Yea, looking at this case in wireshark might make what is going on > > apparent. > > Possibly, but that would likely have to be done by someone with more > NFS protocol familiarity than I. > Well, wireshark is pretty good at pointing out stuff like retransmits, which are mostly what you are looking for. (And you can easily scan the timestamps column and look for large delays.) It also reports relative sequence numbers, so you don't have to do the math in your head. > Also, the incorrect checksums on outbound packets are normal because > the interface supports checksum offloading. The checksum simply > hasn't been calculated yet when tcpdump sees it. > Ok, that makes sense. I never use TSO or checksum offload, since I've seen these broken too many times (but that doesn't mean they're broken in this case). I recall you saying you tried turning off TSO with no effect. You might also try turning off checksum offload. I doubt it will be where things are broken, but might be worth a try. Again, if you take the packet trace for a FreeBSD 64K test and put it in wireshark, you might be able to see how things are broken. (wireshark is your friend, believe me on this one;-) rick > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 02:34:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1B788867 for ; Fri, 24 Jan 2014 02:34:12 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id D2A3C17BE for ; Fri, 24 Jan 2014 02:34:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAKnQ4VKDaFve/2dsb2JhbABag0RWgn64aE+BInSCJQEBAQMBAQEBICsgCxsYAgINGQIpAQkmBggHBAEcBIdcCA2qYJw+F4EpjQYBARs0B4JvgUkEiUiMCoQDkGaDSx4xewkXBB4 X-IronPort-AV: E=Sophos;i="4.95,710,1384318800"; d="scan'208";a="90017452" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 23 Jan 2014 21:34:10 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id A5617B3F49; Thu, 23 Jan 2014 21:34:10 -0500 (EST) Date: Thu, 23 Jan 2014 21:34:10 -0500 (EST) From: Rick Macklem To: J David Message-ID: <1858487115.15503128.1390530850665.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 02:34:12 -0000 J. David wrote: > On Wed, Jan 22, 2014 at 11:12 PM, Rick Macklem > wrote: > > So, do you consider the 32K results as reasonable or terrible > > performance? > > (They are obviously much better than 64K, except for the reread > > case.) > > It's the 64k numbers that prompted the "terrible" thread title. A > 196K/sec write speed on a 10+ Gbit network is pretty disastrous. > > The 32k numbers are, as you say, better. Possibly reasonable, but > I'm > not sure if they're optimal. It's hard to tell the latency of the > virtual network, which would be needed to make that determination. > It > would be best if FreeBSD out of the box blows the doors of Debian out > of the box and FreeBSD tuned to the gills blows the doors off of > Debian tuned to the gills. Right now, Debian seems to be the one > with > the edge and with FreeBSD's illustrious history as the NFS > performance > king for so many years, that just won't do. :) > Btw, did you see this recent thread, where disabling TSO resolved the problem. http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B rick > > Btw, I don't think you've mentioned what network device driver gets > > used > > for this virtual environment. That might be useful, in case the > > maintainer > > of that driver is aware of some issue/patch. > > KVM uses virtio. > > >> 00:38:07.932732 IP (tos 0x0, ttl 64, id 38912, offset 0, flags > >> [DF], > >> proto TCP (6), length 53628) > >> > > I don't know why this would be so large. A 32K write should be > > under > > 33Kbytes in size, not 53Kbytes. I suspect tcpdump is confused? > > Since TCP is stream oriented, is there a reason to expect 1:1 > correlation between NFS writes and TCP packets? > > > Well, it seems Debian is doing 4096 byte writes, which won't have > > anywhere > > near the effect on the network driver/virtual hardware that a 64K > > (about > > 45 IP datagrams) in one NFS RPC will. > > Debian's kernel says it is doing 64k reads/writes on that mount. So > again, possibly an expectation of 1:1 correlation between NFS writes > and TCP packets is not being satisfied. > > However, iozone is doing 4k reads/writes for these tests, so it's > also > possible that Debian is not coalescing them at all (which FreeBSD > apparently is) and the 4k writes are hitting the virtual wire as-is. > > Also, both sides have TSO and LRO, so it would be surprising (and > incorrect?) behavior if a 64k packet were actually fragmented into 45 > IP datagrams. Although if something is happening to temporarily > provoke exactly that behavior, it might explain the 1500 byte > packets, > so that's definitely a lead. Maybe it would be possible for me to > peek at the stream from various different points and establish who is > doing the fragmenting. > > It could be that if Debian is basically disregarding the 64k setting > and using only 4k packets, it's simply not hitting whatever > large-packet bad behavior that is harming FreeBSD. However it also > performs better in the server role, with the client requesting the > larger packets. So that's not definitive. > > > Yea, looking at this case in wireshark might make what is going on > > apparent. > > Possibly, but that would likely have to be done by someone with more > NFS protocol familiarity than I. > > Also, the incorrect checksums on outbound packets are normal because > the interface supports checksum offloading. The checksum simply > hasn't been calculated yet when tcpdump sees it. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 02:51:09 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1D845CFE for ; Fri, 24 Jan 2014 02:51:09 +0000 (UTC) Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com [IPv6:2607:f8b0:4001:c05::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DF26F18E7 for ; Fri, 24 Jan 2014 02:51:08 +0000 (UTC) Received: by mail-ig0-f169.google.com with SMTP id uq10so3718275igb.0 for ; Thu, 23 Jan 2014 18:51:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=yjP8oUiDHE9fUc7BKgnwElXk3H5I9tu2uyCQHb5JPlk=; b=E4mUH5Uso0a68fZTP6rIx/iqIGSJnkOKxrJP/+GgKu/mWcvvqHAoOHXygWADrYZ/j8 fVrjpAQPI6Ncn6lTjor4/cMaYZuA2d2VZcTtYxnsnXn8Fd1RBb8Gx0QOjnx1KoLGVcGz L+XnqRt7YV1u66OqJSXVpgj35tiRO7jv8DIqNmNkoaPUn2I2+sfeT1OnTs7O42Ba87VB LfAl1qQssTr2xDZc9fM/8VD2KnaVArD7QdyI1KuFfgzKKlY2j58e/4GHUOrncBneUGnC QBcOwl2ze69E/WCPfBzjRRn2CPiZal107p2IBjKDnK/z+0vTeZ4j/axMATZYWpKC7pPR c28A== MIME-Version: 1.0 X-Received: by 10.50.60.105 with SMTP id g9mr2493215igr.14.1390531868406; Thu, 23 Jan 2014 18:51:08 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 23 Jan 2014 18:51:08 -0800 (PST) In-Reply-To: References: <1891524918.14888294.1390450374695.JavaMail.root@uoguelph.ca> Date: Thu, 23 Jan 2014 21:51:08 -0500 X-Google-Sender-Auth: PR0jVtIGggQqCBW8ZLxlVrhmpVU Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 02:51:09 -0000 Rick, If iozone wants to do a 4kiB write in the middle of a 1GiB file and the rsize/wsize are both set to 32kiB, does NFS read a 32k block over the wire, modify it, and send it back? Or does it just send "write this 4k at offset X" and let the server sort it out? Some of the tests I'm running are producing very strange results, and I'm trying to understand what might be happening. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 03:18:03 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8A11C629 for ; Fri, 24 Jan 2014 03:18:03 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 4F1E61CA7 for ; Fri, 24 Jan 2014 03:18:03 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,710,1384318800"; d="scan'208";a="90628593" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 23 Jan 2014 22:18:02 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 13609B4037; Thu, 23 Jan 2014 22:18:02 -0500 (EST) Date: Thu, 23 Jan 2014 22:18:02 -0500 (EST) From: Rick Macklem To: J David Message-ID: <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 03:18:03 -0000 J David wrote: > Rick, > > If iozone wants to do a 4kiB write in the middle of a 1GiB file and > the rsize/wsize are both set to 32kiB, does NFS read a 32k block over > the wire, modify it, and send it back? Or does it just send "write > this 4k at offset X" and let the server sort it out? > This depends on the client. For FreeBSD, if the rest of the 32K block has not been modified recently, it will mark that 4K byte range as dirty (b_dirtyoff, b_dirtyend in "struct buf") and it will do a 4Kbyte write at the correct offset. For Linux, I don't know. (Take a look at a wireshark trace of it and find out. You could write a simple program that does this once, so that the packet trace is short. Sorry, I couldn't resist;-) An NFS server must always be able to handle a write of any length starting at any byte offset. The performance effects on the server are largely defined by the type of exported file system on the server. I didn't mention this before, but using UFS will give you more realistic results than using mfs, since mfs would never be used for a real NFS server and never gets tested as an exported NFS file system. rick > Some of the tests I'm running are producing very strange results, and > I'm trying to understand what might be happening. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 03:49:26 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 58F4EFB2 for ; Fri, 24 Jan 2014 03:49:26 +0000 (UTC) Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id BF8DD1EAA for ; Fri, 24 Jan 2014 03:49:24 +0000 (UTC) Received: from 172.24.2.119 (EHLO szxeml214-edg.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id BPB18953; Fri, 24 Jan 2014 11:48:53 +0800 (CST) Received: from SZXEML408-HUB.china.huawei.com (10.82.67.95) by szxeml214-edg.china.huawei.com (172.24.2.29) with Microsoft SMTP Server (TLS) id 14.3.158.1; Fri, 24 Jan 2014 11:48:07 +0800 Received: from [127.0.0.1] (10.177.18.75) by szxeml408-hub.china.huawei.com (10.82.67.95) with Microsoft SMTP Server id 14.3.158.1; Fri, 24 Jan 2014 11:48:03 +0800 Message-ID: <52E1E272.8060009@huawei.com> Date: Fri, 24 Jan 2014 11:48:02 +0800 From: Wang Weidong User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Giuseppe Lettieri , =?UTF-8?B?ZmFjb2x0w6A=?= Subject: Re: netmap: I got some troubles with netmap References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> <52DD1914.7090506@iet.unipi.it> In-Reply-To: <52DD1914.7090506@iet.unipi.it> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.177.18.75] X-CFilter-Loop: Reflected Cc: Luigi Rizzo , Vincenzo Maffione , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 03:49:26 -0000 On 2014/1/20 20:39, Giuseppe Lettieri wrote: > Hi Wang, > > OK, you are using the netmap support in the upstream qemu git. That does not yet include all our modifications, some of which are very important for high throughput with VALE. In particular, the upstream qemu does not include the batching improvements in the frontend/backend interface, and it does not include the "map ring" optimization of the e1000 frontend. Please find attached a gzipped patch that contains all of our qemu code. The patch is against the latest upstream master (commit 1cf892ca). > > Please ./configure the patched qemu with the following option, in addition to any other option you may need: > > --enable-e1000-paravirt --enable-netmap \ > --extra-cflags=-I/path/to/netmap/sys/directory > > Note that --enable-e1000-paravirt is needed to enable the "map ring" optimization in the e1000 frontend, even if you are not going to use the e1000-paravirt device. > > Now you should be able to rerun your tests. I am also attaching a README file that describes some more tests you may want to run. > Yes, I patch the qemu-netmap-bc767e701.patch to the qemu, download the 20131019-tinycore-netmap.hdd. And I do some test that: 1. I use the bridge below: qemu-system-x86_64 -m 2048 -boot c -net nic -net bridge,br=br1 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 test between two vms. br1 without device. Use pktgen, I got the 237.95 kpps. Use the netserver/netperf I got the speed 1037M bits/sec with TCP_STREAM. The max speed is up to 1621M. Use the netserver/netperf I got the speed 3296/s with TCP_RR Use the netserver/netperf I got the speed 234M/86M bits/sec with UDP_STREAM When I add a device from host to the br1, the speed is 159.86 kpps. Use the netserver/netperf I got the speed 720M bits/sec with TCP_STREAM. The max speed is up to 1000M. Use the netserver/netperf I got the speed 3556/s with TCP_RR Use the netserver/netperf I got the speed 181M/181M bits/sec with UDP_STREAM What do you think of these data? 2. I use the vale below: qemu-system-x86_64 -m 2048 -boot c -net nic -net netmap,vale0:0 -hda /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 Test with 2 vms from the same host vale0 without device. I use the pkt-gen, the speed is 938 Kpps I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 195M/195M, then add -- -m 8, I only got 1.07M/1.07M. When use the smaller msg size, the speed will smaller? with vale-ctl -a vale0:eth2, use pkt-gen, the speed is 928 Kpps I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 209M/208M, then add -- -m 8, I only got 1.06M/1.06M. with vale-ctl -h vale0:eth2, use pkt-gen, the speed is 928 Kpps I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 192M/192M, then add -- -m 8, I only got 1.06M/1.06M. Test with 2 vms form two host, I only can test it by vale-ctl -h vale0:eth2 and set eth2 into promisc use pkt-gen with the default params, the speed is about 750 Kpps use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 160M/160M Is this right? 3. I can't use the l2 utils. When I do the "sudo l2open -t eth0 l2recv[l2send], I got that "l2open ioctl(TUNSETIFF...): Invalid argument" and "use l2open -r eth0 l2recv", wait a moment (only several seconds), I got the result: TEST-RESULT: 0.901 kpps 1pkts select/read=100.00 err=0 And I can't find the l2 utils from the net? Is it implemented by your team? All of them is tested on vms. Cheers. Wang > > Cheers, > Giuseppe > > Il 17/01/2014 04:39, Wang Weidong ha scritto: >> On 2014/1/16 18:24, facoltà wrote: [...] >> >> > > From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 05:06:50 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 65459FB0 for ; Fri, 24 Jan 2014 05:06:50 +0000 (UTC) Received: from mail-ie0-x22c.google.com (mail-ie0-x22c.google.com [IPv6:2607:f8b0:4001:c03::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 341871438 for ; Fri, 24 Jan 2014 05:06:50 +0000 (UTC) Received: by mail-ie0-f172.google.com with SMTP id e14so2353491iej.17 for ; Thu, 23 Jan 2014 21:06:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=VHCtYxYbxKg9hU3ZlP69fzTncVdbKNholfqHChBSp3k=; b=n27AMdgVjMa8+X3Vh4u34cGOqRil+sNjcgmr4G2GKILnSQfGuBlPiTePNLCy1EedZq 8L0bt7S9IoZ6imaFEVJZZqHdlGFpI9144iKXy/xuQ88fD+Zd18YqxVFTfwBTSJVHM1Mn imhWI+M+Mzns7Hk0qEzNnx8hk1d1MOhtPgIwQ1UE+WklML39jmJVodrEaSlxVYtJYytH h21nt+uIntGblycUE6uISyF3G6w7pV1p9NwDnRdbsYzfwRQCVES2Hskc7OgJrDJVeTKM b1yM5K+qaPuyLfpMnH+IMT4LJuFMm5fW1lmmu37heER2FdbWrKTNFJ3QrRBC5dYdYLfC CLqw== MIME-Version: 1.0 X-Received: by 10.42.122.146 with SMTP id n18mr8956971icr.41.1390540009750; Thu, 23 Jan 2014 21:06:49 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 23 Jan 2014 21:06:49 -0800 (PST) In-Reply-To: <390483613.15499210.1390530437153.JavaMail.root@uoguelph.ca> References: <390483613.15499210.1390530437153.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 00:06:49 -0500 X-Google-Sender-Auth: J-RDA1etKcWy_3X8t-BB1Pnxidc Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 05:06:50 -0000 On Thu, Jan 23, 2014 at 9:27 PM, Rick Macklem wrote: > Well, my TCP is pretty rusty, but... > Since your stats didn't show any jumbo frames, each IP > datagram needs to fit in the MTU of 1500bytes. NFS hands an mbuf > list of just over 64K (or 32K) to TCP in a single sosend(), then TCP > will generate about 45 (or about 23 for 32K) TCP segments and put > each in an IP datagram, then hand it to the network device driver > for transmission. This is *not* what happens with TSO/LRO. With TSO, TCP generates IP datagrams of up to 64k which are passed directly to the driver, which passes them directly to the hardware. Furthermore, in this unique case (two virtual machines on the same host and bridge with both TSO and LRO enabled end-to-end), the packet is *never* fragmented. The host takes the 64k packet off of one guest's output ring and puts it onto the other guest's input ring, intact. This is, as you might expect, a *massive* performance win. With TSO & LRO: $ time iperf -c 172.20.20.162 -d ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 5] local 172.20.20.169 port 60889 connected with 172.20.20.162 port 5001 [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 44101 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.0 sec 17.0 GBytes 14.6 Gbits/sec [ 4] 0.0-10.0 sec 17.4 GBytes 14.9 Gbits/sec real 0m10.061s user 0m0.229s sys 0m7.711s Without TSO & LRO: $ time iperf -c 172.20.20.162 -d ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ ------------------------------------------------------------ Client connecting to 172.20.20.162, TCP port 5001 TCP window size: 1.26 MByte (default) ------------------------------------------------------------ [ 5] local 172.20.20.169 port 22088 connected with 172.20.20.162 port 5001 [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 48615 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.0 sec 637 MBytes 534 Mbits/sec [ 4] 0.0-10.0 sec 767 MBytes 642 Mbits/sec real 0m10.057s user 0m0.231s sys 0m3.935s Look at the difference. In this bidirectional test, TSO is over 25x faster using not even 2x the CPU. This shows how essential TSO/LRO is if you plan to move data at real world speeds and still have enough CPU left to operate on that data. > I recall you saying you tried turning off TSO with no > effect. You might also try turning off checksum offload. I doubt it will > be where things are broken, but might be worth a try. That was not me, that was someone else. If there is a problem with NFS and TSO, the solution is *not* to disable TSO. That is, at best, a workaround that produces much more CPU load and much less throughput. The solution is to find the problem and fix it. More data to follow. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 05:10:36 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D7B64E0 for ; Fri, 24 Jan 2014 05:10:36 +0000 (UTC) Received: from mail-ie0-x22d.google.com (mail-ie0-x22d.google.com [IPv6:2607:f8b0:4001:c03::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A6AAD144A for ; Fri, 24 Jan 2014 05:10:36 +0000 (UTC) Received: by mail-ie0-f173.google.com with SMTP id e14so2350263iej.4 for ; Thu, 23 Jan 2014 21:10:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=RIDSEQZS3Ao8mYEnqUqkWU/PW5PvMfMVnRB+6nejkVI=; b=e0j5M3A8v5nFyFdVob/iIw+VY7Gpkg9XDkTTOfY/ezF9eXcEG9I1nD2FRwTlFMkvwB UXNj7bzXfIRmsihzwuGy4D0IMSgE8ChUwLwmfiCX80QGayCV39zfIE4sFdcXYJMKwzyb zAnqaqFUNxDz8Y6jEMftRrneLlVaoNIMag0w0jQHQIx3YtH/dvCuBkQlimEB/HolL8OY iJDv6hDRTPjXGDCMVxvAeXSeKcd1nXb2Bhkkp0Cw20o/gzCbwMd1mViy+irFy3kUJc+5 14HRrrnhBLcGctuyJeRohNQ6VxtrBfofHM3es7jOhYsLbnGS4FuzljnQo1epEkVxC9Jk QQAg== MIME-Version: 1.0 X-Received: by 10.43.161.2 with SMTP id me2mr9398038icc.20.1390540235829; Thu, 23 Jan 2014 21:10:35 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 23 Jan 2014 21:10:35 -0800 (PST) In-Reply-To: <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> References: <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 00:10:35 -0500 X-Google-Sender-Auth: wwCUo9SZWjjCfOtsvpA1hibDDaA Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 05:10:36 -0000 On Thu, Jan 23, 2014 at 10:18 PM, Rick Macklem wrote: > I didn't mention this before, but using UFS will give you more realistic > results than using mfs, since mfs would never be used for a real NFS > server and never gets tested as an exported NFS file system. That's my mistake; I said "mfs" but what I meant was: $ sudo mdconfig -a -t swap -s 2g md0 $ sudo newfs -U /dev/md0 /dev/md0: 2048.0MB (4194304 sectors) block size 32768, fragment size 4096 using 4 cylinder groups of 512.03MB, 16385 blks, 65664 inodes. with soft updates super-block backups (for fsck_ffs -b #) at: 192, 1048832, 2097472, 3146112 $ sudo mount /dev/md0 /mnt $ cat /etc/exports /mnt -alldirs -maproot=root 172.20.20.166 172.20.20.168 172.20.20.169 So it absolute is UFS, just backed by RAM, and I just forgot that "mfs" is only properly used to refer to that read-only memory blob filesystem used for miniroots. Sorry for any confusion. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 05:26:34 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C8175363 for ; Fri, 24 Jan 2014 05:26:34 +0000 (UTC) Received: from mail-ig0-x22b.google.com (mail-ig0-x22b.google.com [IPv6:2607:f8b0:4001:c05::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 93A84158B for ; Fri, 24 Jan 2014 05:26:34 +0000 (UTC) Received: by mail-ig0-f171.google.com with SMTP id uy17so1645462igb.4 for ; Thu, 23 Jan 2014 21:26:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=+hSWJutBikVB9HWuxq1OYAIZPsvsEBwpNs8Abtkg+OM=; b=QR6iqBylkByHG9qHerS2NdDOfwTL0PbiJNvahmDYry6CfEB7s/C7zJkTLQrFxvJt7E fbZ6pDKuh1cE6tbRl6m1vhigIu8ta2lHDHy/rHoUtyIAs/cG1Q1bkC4yfDONegzU9xxa m0AhQlLfR3yfEyyRf9SN8qG1fHLR8bolpME9hm3J9m6V7993KFEXDtiwIxPnoR71lcNv cjLutc+XOZEuBjg039Y8A57KQoZFG2RbVh+Q8iNUaDNb9fZ8D0Aj8nylL5Wqh6S8W+DT iGp9rNQuY+dMq5KTJI3NVXbK7If5LUVNAgULrhryFOSXQAT/1aYKIr1pslYmSBVRB8i0 huCA== MIME-Version: 1.0 X-Received: by 10.51.17.101 with SMTP id gd5mr2788844igd.25.1390541194139; Thu, 23 Jan 2014 21:26:34 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 23 Jan 2014 21:26:34 -0800 (PST) In-Reply-To: References: <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 00:26:34 -0500 X-Google-Sender-Auth: DW6G79NOx8ZeC5Uv9HMbs1UVSiI Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 05:26:34 -0000 Here's a pair of quick tcpdumps with TSO/LRO on and off showing that tcpdump does not reassemble small packets into larger ones: TSO/LRO on: 05:21:54.956061 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [S], seq 2538331145, win 65535, options [mss 1460,nop,wscale 9,sackOK,TS val 2932122 ecr 0], length 0 05:21:54.956239 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [S.], seq 3371756775, ack 2538331146, win 65535, options [mss 1460,nop,wscale 9,sackOK,TS val 2833562423 ecr 2932122], length 0 05:21:54.956292 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], ack 1, win 2050, options [nop,nop,TS val 2932132 ecr 2833562423], length 0 05:21:54.956372 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [P.], seq 1:25, ack 1, win 2050, options [nop,nop,TS val 2932132 ecr 2833562423], length 24 05:21:54.956432 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 1, win 2050, options [nop,nop,TS val 2833562423 ecr 2932132], length 0 05:21:54.956495 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 25:4381, ack 1, win 2050, options [nop,nop,TS val 2932132 ecr 2833562423], length 4356 05:21:54.956604 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 4381, win 2041, options [nop,nop,TS val 2833562423 ecr 2932132], length 0 05:21:54.956620 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 4381:11657, ack 1, win 2050, options [nop,nop,TS val 2932132 ecr 2833562423], length 7276 05:21:55.050656 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 11657, win 2050, options [nop,nop,TS val 2833562523 ecr 2932132], length 0 05:21:55.050686 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 11657:21829, ack 1, win 2050, options [nop,nop,TS val 2932222 ecr 2833562523], length 10172 05:21:55.150644 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 21829, win 2050, options [nop,nop,TS val 2833562623 ecr 2932222], length 0 05:21:55.150674 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 21829:34897, ack 1, win 2050, options [nop,nop,TS val 2932322 ecr 2833562623], length 13068 05:21:55.250629 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 34897, win 2050, options [nop,nop,TS val 2833562723 ecr 2932322], length 0 05:21:55.250658 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 34897:50861, ack 1, win 2050, options [nop,nop,TS val 2932422 ecr 2833562723], length 15964 05:21:55.350647 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 50861, win 2050, options [nop,nop,TS val 2833562823 ecr 2932422], length 0 05:21:55.350677 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 50861:69721, ack 1, win 2050, options [nop,nop,TS val 2932522 ecr 2833562823], length 18860 05:21:55.450656 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 69721, win 2050, options [nop,nop,TS val 2833562923 ecr 2932522], length 0 05:21:55.450686 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 69721:91477, ack 1, win 2050, options [nop,nop,TS val 2932622 ecr 2833562923], length 21756 05:21:55.550577 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 91477, win 2050, options [nop,nop,TS val 2833563023 ecr 2932622], length 0 05:21:55.550608 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 91477:116129, ack 1, win 2050, options [nop,nop,TS val 2932722 ecr 2833563023], length 24652 05:21:55.650645 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 116129, win 2050, options [nop,nop,TS val 2833563123 ecr 2932722], length 0 05:21:55.650676 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 116129:143677, ack 1, win 2050, options [nop,nop,TS val 2932822 ecr 2833563123], length 27548 05:21:55.750643 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 143677, win 2050, options [nop,nop,TS val 2833563223 ecr 2932822], length 0 05:21:55.750675 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 143677:174121, ack 1, win 2050, options [nop,nop,TS val 2932922 ecr 2833563223], length 30444 05:21:55.850636 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags [.], ack 174121, win 2050, options [nop,nop,TS val 2833563323 ecr 2932922], length 0 05:21:55.850667 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags [.], seq 174121:207461, ack 1, win 2050, options [nop,nop,TS val 2933022 ecr 2833563323], length 33340 TSO/LRO off: 05:19:34.556302 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [S], seq 529322163, win 65535, options [mss 1460,nop,wscale 9,sackOK,TS val 2791722 ecr 0], length 0 05:19:34.556414 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags [S.], seq 3835815533, ack 529322164, win 65535, options [mss 1460,nop,wscale 9,sackOK,TS val 1931664416 ecr 2791722], length 0 05:19:34.556443 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 0 05:19:34.556505 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [P.], seq 1:25, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 24 05:19:34.556604 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 25:1473, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556621 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags [.], ack 1, win 2050, options [nop,nop,TS val 1931664416 ecr 2791732], length 0 05:19:34.556648 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 1473:2921, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556672 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 2921:4369, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556711 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags [.], ack 1473, win 2047, options [nop,nop,TS val 1931664416 ecr 2791732], length 0 05:19:34.556730 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 4369:5817, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556743 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 5817:7265, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556755 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags [.], ack 4369, win 2041, options [nop,nop,TS val 1931664416 ecr 2791732], length 0 05:19:34.556774 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 7265:8713, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556793 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 8713:10161, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556813 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 10161:11609, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556830 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 11609:13057, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556848 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags [.], ack 7265, win 2036, options [nop,nop,TS val 1931664416 ecr 2791732], length 0 05:19:34.556865 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 13057:14505, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556881 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 14505:15953, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556893 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 15953:17401, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556912 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags [.], seq 17401:18849, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr 1931664416], length 1448 05:19:34.556929 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags [.], ack 10161, win 2030, options [nop,nop,TS val 1931664416 ecr 2791732], length 0 The stated lengths represent the length of each IP packet as it appears on the "wire." The result is the same whether it is obtained from the client, the server, or the KVM host snooping directly on the bridge. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 07:36:24 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 978D1B30; Fri, 24 Jan 2014 07:36:24 +0000 (UTC) Received: from forward-corp1e.mail.yandex.net (forward-corp1e.mail.yandex.net [IPv6:2a02:6b8:0:202::10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EA6B91E5B; Fri, 24 Jan 2014 07:36:23 +0000 (UTC) Received: from smtpcorp4.mail.yandex.net (smtpcorp4.mail.yandex.net [95.108.252.2]) by forward-corp1e.mail.yandex.net (Yandex) with ESMTP id EEAFD640156; Fri, 24 Jan 2014 11:36:08 +0400 (MSK) Received: from smtpcorp4.mail.yandex.net (localhost [127.0.0.1]) by smtpcorp4.mail.yandex.net (Yandex) with ESMTP id C9BB92C074A; Fri, 24 Jan 2014 11:36:08 +0400 (MSK) Received: from 95.108.170.36-red.dhcp.yndx.net (95.108.170.36-red.dhcp.yndx.net [95.108.170.36]) by smtpcorp4.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id CgjK8oGxfP-a8vOdV3E; Fri, 24 Jan 2014 11:36:08 +0400 (using TLSv1 with cipher CAMELLIA256-SHA (256/256 bits)) (Client certificate not present) X-Yandex-Uniq: 2899ebf9-2281-439c-b2c4-dd8a763c3304 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1390548968; bh=XstNle/Nib50Wr/KnaHqTfMQgR6GwXxknNCPO1w7158=; h=Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject: Content-Type; b=NzrnCFjPf0allBuiNwQH4bNA+J1u+097C6xHJ9uP9x9DH0CrhEJl8oL4Hwq4gL7Ti PmIqsYTMj42p/8vgHLuLULAwfAiM9yEA6/D7Q1G6+t2bKeheTbw2AWYmSvXE6OtxIC w+SQGebWUBVgfKMW2c0CyONl54SnHlHxUvnWpkWc= Authentication-Results: smtpcorp4.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Message-ID: <52E21721.5010309@yandex-team.ru> Date: Fri, 24 Jan 2014 11:32:49 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: "net@freebsd.org" Subject: "slow path" in network code || IPv6 panic on inteface removal Content-Type: multipart/mixed; boundary="------------050804080408080705010802" Cc: arch@freebsd.org, hackers@freebsd.org, "Andrey V. Elsukov" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 07:36:24 -0000 This is a multi-part message in MIME format. --------------050804080408080705010802 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hello guys! Typically we're mostly interested in making "fast" paths in our code running faster. However it seems it is time to take care of code which is either called rarely or is quite complex in terms of relative code size or/and locking. Some good examples from current codebase are probably: * L3->L2 mapping like ARP handling - while doing doing arpresolve we discover there is no valid entry, so we start doing complex locking, are request preparing/sending in the same piece of code. This washes out both i/d caches and makes sending process _more_ unpredictable. Here we can queue given mbuf to delayed processing and return quickly. * ip_fastfwd() handling corner cases. This is already optimized in terms of splitting "fast" and "slow" code paths for all cases. * ipfw(4) (and probably other pfil consumers) generating/sending various icmp/icmp6 packets for inbound mbuf What exactly is proposed: - Another one netisr queue for handling different types of packets - metainfo is stored in mbuf_tag attached to packet - ifnet departure handler taking care of packets queued from/to killed ifnet - API to register/unregister/dispath given type of traffic Real problem which is solved by this approach (traced by ae@): We're using per-LLE IPv6 timers for various purposes, most of them requires LLE modifications, so timer function starts with lle write lock held. Some timer events requires us to send neighbour solicication messages which involves a) source address selection (requiring LLE lock being held ) and b) calling ip6_output() which requires LLE lock being not held. It is solved exactly as in IPv4 arp handling code: timer function drops write lock before calling nd6_ns_output(). Dropping/acquiring lock is error-prone, for example, the following scenario is possible (traced by ae@): we're calling if_detach(ifp) (thread 1) and nd6_llinfo_timer (thread 2). Then the following can happen: #1 T2 releases LLE lock and runs nd6_ns_output(). #2 T1 proceeds with detaching: in6_ifdetach() -> in6_purgeaddr() -> nd6_rem_ifa_lle() -> in6_lltable_prefix_free() which removes all LLEs for given prefix acquiring each LLE write lock. "Our" LLE is not destroyed since it is refcounted by nd6_llinfo_settimer_locked(). #3 T2 proceeds with nd6_ns_output() selecting source address (which involves acquiring LLE read lock) #4 T1 finishes with detaching interface addresses and sets ifp->if_addr to NULL #5 T2 calls nd6_ifptomac() which reads interface MAC from ifp->if_addr #6 User inspects core generated by previous call Using new API, we can avoid #6 by making the following code changes: * LLE timer does not drop/reacquire LLE lock * we require nd6_ns_output callers to lock LLE if it is provided * nd6_ns_output() uses "slow" path instead of sending mbuf to ip6_output() immediately if LLE is not NULL. What do you think? --------------050804080408080705010802 Content-Type: text/x-patch; name="dly_fin2.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="dly_fin2.diff" Index: sys/conf/files =================================================================== --- sys/conf/files (revision 260983) +++ sys/conf/files (working copy) @@ -3044,6 +3044,7 @@ net/bpf_filter.c optional bpf | netgraph_bpf net/bpf_zerocopy.c optional bpf net/bridgestp.c optional bridge | if_bridge net/flowtable.c optional flowtable inet | flowtable inet6 +net/delayed_dispatch.c standard net/ieee8023ad_lacp.c optional lagg net/if.c standard net/if_arcsubr.c optional arcnet Index: sys/net/netisr.c =================================================================== --- sys/net/netisr.c (revision 260983) +++ sys/net/netisr.c (working copy) @@ -555,6 +555,81 @@ netisr_setqlimit(const struct netisr_handler *nhp, } /* + * Scan workqueue and delete mbufs pointed to handler. + */ +static int +netisr_scan_workqueue(struct netisr_work *npwp, netisr_scan_t *scan_f, + void *data) +{ + struct mbuf *m, *m_prev; + int deleted; + + deleted = 0; + m_prev = NULL; + m = npwp->nw_head; + while (m != NULL) { + if (scan_f(m, data) == 0) { + m_prev = m; + m = m->m_nextpkt; + continue; + } + + /* Handler requested item deletion */ + if (m_prev == NULL) + npwp->nw_head = m->m_nextpkt; + else + m_prev->m_nextpkt = m->m_nextpkt; + + if (m->m_nextpkt == NULL) + npwp->nw_tail = m_prev; + + npwp->nw_len--; + m_freem(m); + deleted++; + + if (m_prev == NULL) + m = npwp->nw_head; + else + m = m_prev->m_nextpkt; + } + + return (deleted); +} + +int +netisr_scan(unsigned int proto, netisr_scan_t *scan_f, void *data) +{ +#ifdef NETISR_LOCKING + struct rm_priotracker tracker; +#endif + struct netisr_proto *np; + struct netisr_work *npwp; + unsigned int i; + int deleted; + +#ifdef NETISR_LOCKING + NETISR_RLOCK(&tracker); +#endif + + deleted = 0; + + KASSERT(scan_f != NULL, ("%s: scan function is NULL", __func__)); + + np = &netisr_proto[proto]; + + CPU_FOREACH(i) { + npwp = &(DPCPU_ID_PTR(i, nws))->nws_work[proto]; + deleted += netisr_scan_workqueue(npwp, scan_f, data); + } + +#ifdef NETISR_LOCKING + NETISR_RUNLOCK(&tracker); +#endif + + return (deleted); +} + +/* * Drain all packets currently held in a particular protocol work queue. */ static void Index: sys/net/netisr.h =================================================================== --- sys/net/netisr.h (revision 260983) +++ sys/net/netisr.h (working copy) @@ -61,6 +61,7 @@ #define NETISR_IPV6 10 #define NETISR_NATM 11 #define NETISR_EPAIR 12 /* if_epair(4) */ +#define NETISR_SLOWPATH 13 /* delayed dispatch */ /* * Protocol ordering and affinity policy constants. See the detailed @@ -178,6 +179,7 @@ struct sysctl_netisr_work { */ struct mbuf; typedef void netisr_handler_t(struct mbuf *m); +typedef int netisr_scan_t(struct mbuf *m, void *); typedef struct mbuf *netisr_m2cpuid_t(struct mbuf *m, uintptr_t source, u_int *cpuid); typedef struct mbuf *netisr_m2flow_t(struct mbuf *m, uintptr_t source); @@ -212,6 +214,7 @@ void netisr_getqlimit(const struct netisr_handler void netisr_register(const struct netisr_handler *nhp); int netisr_setqlimit(const struct netisr_handler *nhp, u_int qlimit); void netisr_unregister(const struct netisr_handler *nhp); +int netisr_scan(u_int proto, netisr_scan_t *, void *); /* * Process a packet destined for a protocol, and attempt direct dispatch. Index: sys/netinet6/nd6.c =================================================================== --- sys/netinet6/nd6.c (revision 260983) +++ sys/netinet6/nd6.c (working copy) @@ -153,6 +153,8 @@ nd6_init(void) callout_init(&V_nd6_slowtimo_ch, 0); callout_reset(&V_nd6_slowtimo_ch, ND6_SLOWTIMER_INTERVAL * hz, nd6_slowtimo, curvnet); + + nd6_nbr_init(); } #ifdef VIMAGE @@ -160,6 +162,7 @@ void nd6_destroy() { + nd6_nbr_destroy(); callout_drain(&V_nd6_slowtimo_ch); callout_drain(&V_nd6_timer_ch); } @@ -500,9 +503,7 @@ nd6_llinfo_timer(void *arg) if (ln->la_asked < V_nd6_mmaxtries) { ln->la_asked++; nd6_llinfo_settimer_locked(ln, (long)ndi->retrans * hz / 1000); - LLE_WUNLOCK(ln); nd6_ns_output(ifp, NULL, dst, ln, 0); - LLE_WLOCK(ln); } else { struct mbuf *m = ln->la_hold; if (m) { @@ -547,9 +548,7 @@ nd6_llinfo_timer(void *arg) ln->la_asked = 1; ln->ln_state = ND6_LLINFO_PROBE; nd6_llinfo_settimer_locked(ln, (long)ndi->retrans * hz / 1000); - LLE_WUNLOCK(ln); nd6_ns_output(ifp, dst, dst, ln, 0); - LLE_WLOCK(ln); } else { ln->ln_state = ND6_LLINFO_STALE; /* XXX */ nd6_llinfo_settimer_locked(ln, (long)V_nd6_gctimer * hz); @@ -559,9 +558,7 @@ nd6_llinfo_timer(void *arg) if (ln->la_asked < V_nd6_umaxtries) { ln->la_asked++; nd6_llinfo_settimer_locked(ln, (long)ndi->retrans * hz / 1000); - LLE_WUNLOCK(ln); nd6_ns_output(ifp, dst, dst, ln, 0); - LLE_WLOCK(ln); } else { EVENTHANDLER_INVOKE(lle_event, ln, LLENTRY_EXPIRED); (void)nd6_free(ln, 0); Index: sys/netinet6/nd6.h =================================================================== --- sys/netinet6/nd6.h (revision 260983) +++ sys/netinet6/nd6.h (working copy) @@ -421,6 +421,8 @@ int nd6_storelladdr(struct ifnet *, struct mbuf *, const struct sockaddr *, u_char *, struct llentry **); /* nd6_nbr.c */ +void nd6_nbr_init(void); +void nd6_nbr_destroy(void); void nd6_na_input(struct mbuf *, int, int); void nd6_na_output(struct ifnet *, const struct in6_addr *, const struct in6_addr *, u_long, int, struct sockaddr *); Index: sys/netinet6/nd6_nbr.c =================================================================== --- sys/netinet6/nd6_nbr.c (revision 260983) +++ sys/netinet6/nd6_nbr.c (working copy) @@ -74,6 +74,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #define SDL(s) ((struct sockaddr_dl *)s) @@ -87,12 +88,37 @@ static void nd6_dad_ns_input(struct ifaddr *); static void nd6_dad_na_input(struct ifaddr *); static void nd6_na_output_fib(struct ifnet *, const struct in6_addr *, const struct in6_addr *, u_long, int, struct sockaddr *, u_int); +static int nd6_ns_output2(struct mbuf *, int, uintptr_t, struct ifnet *); VNET_DEFINE(int, dad_ignore_ns) = 0; /* ignore NS in DAD - specwise incorrect*/ VNET_DEFINE(int, dad_maxtry) = 15; /* max # of *tries* to transmit DAD packet */ #define V_dad_ignore_ns VNET(dad_ignore_ns) #define V_dad_maxtry VNET(dad_maxtry) +static struct dly_dispatcher dly_d = { + .name = "nd6_ns", + .dly_dispatch = nd6_ns_output2, +}; + +static int nd6_dlyid; + +void +nd6_nbr_init() +{ + + if (IS_DEFAULT_VNET(curvnet)) + nd6_dlyid = dly_register(&dly_d); +} + +void +nd6_nbr_destroy() +{ + + if (IS_DEFAULT_VNET(curvnet)) + dly_unregister(nd6_dlyid); +} + + /* * Input a Neighbor Solicitation Message. * @@ -366,11 +392,34 @@ nd6_ns_input(struct mbuf *m, int off, int icmp6len m_freem(m); } +static int +nd6_ns_output2(struct mbuf *m, int dad, uintptr_t _data, struct ifnet *ifp) +{ + struct ip6_moptions im6o; + + if (m->m_flags & M_MCAST) { + im6o.im6o_multicast_ifp = ifp; + im6o.im6o_multicast_hlim = 255; + im6o.im6o_multicast_loop = 0; + } + + /* Zero ingress interface not to fool PFIL consumers */ + m->m_pkthdr.rcvif = NULL; + + ip6_output(m, NULL, NULL, dad ? IPV6_UNSPECSRC : 0, &im6o, NULL, NULL); + icmp6_ifstat_inc(ifp, ifs6_out_msg); + icmp6_ifstat_inc(ifp, ifs6_out_neighborsolicit); + ICMP6STAT_INC(icp6s_outhist[ND_NEIGHBOR_SOLICIT]); + + return (0); +} + /* * Output a Neighbor Solicitation Message. Caller specifies: * - ICMP6 header source IP6 address * - ND6 header target IP6 address * - ND6 header source datalink address + * Note llentry has to be locked if specified * * Based on RFC 2461 * Based on RFC 2462 (duplicate address detection) @@ -386,11 +435,9 @@ nd6_ns_output(struct ifnet *ifp, const struct in6_ struct m_tag *mtag; struct ip6_hdr *ip6; struct nd_neighbor_solicit *nd_ns; - struct ip6_moptions im6o; int icmp6len; int maxlen; caddr_t mac; - struct route_in6 ro; if (IN6_IS_ADDR_MULTICAST(taddr6)) return; @@ -413,13 +460,8 @@ nd6_ns_output(struct ifnet *ifp, const struct in6_ if (m == NULL) return; - bzero(&ro, sizeof(ro)); - if (daddr6 == NULL || IN6_IS_ADDR_MULTICAST(daddr6)) { m->m_flags |= M_MCAST; - im6o.im6o_multicast_ifp = ifp; - im6o.im6o_multicast_hlim = 255; - im6o.im6o_multicast_loop = 0; } icmp6len = sizeof(*nd_ns); @@ -468,7 +510,6 @@ nd6_ns_output(struct ifnet *ifp, const struct in6_ hsrc = NULL; if (ln != NULL) { - LLE_RLOCK(ln); if (ln->la_hold != NULL) { struct ip6_hdr *hip6; /* hold ip6 */ @@ -483,7 +524,6 @@ nd6_ns_output(struct ifnet *ifp, const struct in6_ hsrc = &hip6->ip6_src; } } - LLE_RUNLOCK(ln); } if (hsrc && (ifa = (struct ifaddr *)in6ifa_ifpwithaddr(ifp, hsrc)) != NULL) { @@ -502,7 +542,7 @@ nd6_ns_output(struct ifnet *ifp, const struct in6_ oifp = ifp; error = in6_selectsrc(&dst_sa, NULL, - NULL, &ro, NULL, &oifp, &src_in); + NULL, NULL, NULL, &oifp, &src_in); if (error) { char ip6buf[INET6_ADDRSTRLEN]; nd6log((LOG_DEBUG, @@ -572,20 +612,16 @@ nd6_ns_output(struct ifnet *ifp, const struct in6_ m_tag_prepend(m, mtag); } - ip6_output(m, NULL, &ro, dad ? IPV6_UNSPECSRC : 0, &im6o, NULL, NULL); - icmp6_ifstat_inc(ifp, ifs6_out_msg); - icmp6_ifstat_inc(ifp, ifs6_out_neighborsolicit); - ICMP6STAT_INC(icp6s_outhist[ND_NEIGHBOR_SOLICIT]); + if (ln == NULL) + nd6_ns_output2(m, dad, 0, ifp); + else { + m->m_pkthdr.rcvif = ifp; /* Save VNET */ + dly_queue(nd6_dlyid, m, dad, 0, ifp); + } - /* We don't cache this route. */ - RO_RTFREE(&ro); - return; bad: - if (ro.ro_rt) { - RTFREE(ro.ro_rt); - } m_freem(m); return; } Index: sys/sys/mbuf.h =================================================================== --- sys/sys/mbuf.h (revision 260983) +++ sys/sys/mbuf.h (working copy) @@ -1022,6 +1022,7 @@ struct mbuf *m_unshare(struct mbuf *, int); #define PACKET_TAG_CARP 28 /* CARP info */ #define PACKET_TAG_IPSEC_NAT_T_PORTS 29 /* two uint16_t */ #define PACKET_TAG_ND_OUTGOING 30 /* ND outgoing */ +#define PACKET_TAG_DISPATCH_INFO 31 /* Netist slow dispatch */ /* Specific cookies and tags. */ --- /dev/null 2014-01-24 00:33:00.000000000 +0400 +++ sys/net/delayed_dispatch.c 2014-01-24 00:17:05.573964680 +0400 @@ -0,0 +1,364 @@ +/*- + * Copyright (c) 2014 Alexander V. Chernikov + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/net/delayed_dispatch.c$"); + +/* + * delayed dispatch is so-called "slowpath" packet path which permits you + * to enqueue mbufs requiring complex dispath (and/or possibly complex locking) + * into separate netisr queue instead of trying to deal with it in "fast" code path. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include + +struct dly_info { + struct dly_dispatcher *index; + int alloc; + int count; + struct rmlock lock; +}; +#define DLY_ALLOC_ITEMS 16 + +static struct dly_info dly; + +#define DLY_LOCK_INIT() rm_init(&dly.lock, "dly_lock") +#define DLY_RLOCK() rm_rlock(&dly.lock, &tracker) +#define DLY_RUNLOCK() rm_runlock(&dly.lock, &tracker) +#define DLY_WLOCK() rm_wlock(&dly.lock) +#define DLY_WUNLOCK() rm_wunlock(&dly.lock) +#define DLY_READER struct rm_priotracker tracker + +static eventhandler_tag ifdetach_tag; + +/* + * Adds mbuf to slowpath queue. Additional information + * is stored in PACKET_TAG_DISPATCH_INFO mbuf tag. + * Returns 0 if successfull, error code overwise. + */ +int +dly_queue(int dtype, struct mbuf *m, int dsubtype, uintptr_t data, + struct ifnet *ifp) +{ + struct dly_item *item; + struct m_tag *dtag; + DLY_READER; + + /* Ensure we're not going to cycle packet */ + if ((dtag = m_tag_find(m, PACKET_TAG_DISPATCH_INFO, NULL)) != NULL) { + printf("tag found: %p\n", dtag); + return (EINVAL); + } + + DLY_RLOCK(); + if (dtype < 0 || dtype >= dly.alloc || dly.index[dtype].name == NULL) { + DLY_RUNLOCK(); + printf("invalid dtype: 0..%d..%d\n", dtype, dly.alloc); + return (EINVAL); + } + DLY_RUNLOCK(); + + VNET_ASSERT(m->m_pkthdr.rcvif != NULL, + ("%s:%d rcvif == NULL: m=%p", __func__, __LINE__, m)); + + /* + * Do not allocate tag for basic IPv4/IPv6 output + */ + if (dtype != 0) { + dtag = m_tag_get(PACKET_TAG_DISPATCH_INFO, + sizeof(struct dly_item), M_NOWAIT); + + if (dtag == NULL) + return (ENOBUFS); + + item = (struct dly_item *)(dtag + 1); + + item->type = dtype; + item->subtype = dsubtype; + item->data = data; + item->ifp = ifp; + + m_tag_prepend(m, dtag); + } + + netisr_queue(NETISR_SLOWPATH, m); + + return (0); +} + +/* + * Adds mbuf to slowpath queue. User-provided buffer + * of size @size is stored inside PACKET_TAG_DISPATCH_INFO + * mbuf tag. Buffer structure needs to embed properly filled + * dly_item structure at the beginning of buffer. Such buffers + * needs to be dispatched by dly_pdispatch() handler. + * + * Returns 0 if successfull, error code overwise. + */ +int +dly_pqueue(int dtype, struct mbuf *m, struct dly_item *item, size_t size) +{ + struct m_tag *dtag; + DLY_READER; + + /* Ensure we're not going to cycle packet */ + if ((dtag = m_tag_find(m, PACKET_TAG_DISPATCH_INFO, NULL)) != NULL) { + return (EINVAL); + } + + DLY_RLOCK(); + if (dtype < 0 || dtype >= dly.alloc || dly.index[dtype].name == NULL) { + DLY_RUNLOCK(); + return (EINVAL); + } + DLY_RUNLOCK(); + + VNET_ASSERT(m->m_pkthdr.rcvif != NULL, + ("%s:%d rcvif == NULL: m=%p", __func__, __LINE__, m)); + + dtag = m_tag_get(PACKET_TAG_DISPATCH_INFO, size, M_NOWAIT); + + if (dtag == NULL) + return (ENOBUFS); + + memcpy(dtag + 1, item, size); + m_tag_prepend(m, dtag); + netisr_queue(NETISR_SLOWPATH, m); + + return (0); +} + +/* + * Base netisr handler for slowpath + */ +static void +dly_dispatch_item(struct mbuf *m) +{ + struct m_tag *dtag; + struct dly_item *item; + int dtype; + struct dly_dispatcher *dld; + DLY_READER; + + item = NULL; + dtype = 0; + + if ((dtag = m_tag_find(m, PACKET_TAG_DISPATCH_INFO, NULL)) != NULL) { + item = (struct dly_item *)(dtag + 1); + dtype = item->type; + } + + DLY_RLOCK(); + if (dtype < 0 || dtype >= dly.alloc || dly.index[dtype].name == NULL) { + DLY_RUNLOCK(); + return; + } + + dld = &dly.index[dtype]; + + if (dld->dly_dispatch != NULL) + dld->dly_dispatch(m, item->subtype, item->data, item->ifp); + else + dld->dly_pdispatch(m, item); + + DLY_RUNLOCK(); + + return; +} + + +/* + * Check if queue items is received or going to be transmitted + * via destroying interface. + */ +static int +dly_scan_ifp(struct mbuf *m, void *_data) +{ + struct m_tag *dtag; + struct dly_item *item; + struct ifnet *difp; + + difp = (struct ifnet *)_data; + + if (m->m_pkthdr.rcvif == difp) + return (1); + + if ((dtag = m_tag_find(m, PACKET_TAG_DISPATCH_INFO, NULL)) != NULL) { + item = (struct dly_item *)(dtag + 1); + if (item->ifp == difp) + return (1); + } + + return (0); +} + +/* + * Registers new slowpath handler. + * Returns handler id to use in dly_queue() or + * dly_pqueue() functions/ + */ +int +dly_register(struct dly_dispatcher *dld) +{ + int i, alloc; + struct dly_dispatcher *dd, *tmp; + +again: + DLY_WLOCK(); + + if (dly.count < dly.alloc) { + i = dly.count++; + dly.index[i] = *dld; + DLY_WUNLOCK(); + return (i); + } + + alloc = dly.alloc + DLY_ALLOC_ITEMS; + + DLY_WUNLOCK(); + + /* No spare room, need to increase */ + dd = malloc(sizeof(struct dly_dispatcher) * alloc, M_TEMP, + M_ZERO|M_WAITOK); + + DLY_WLOCK(); + if (dly.alloc >= alloc) { + /* Lost the race, try again */ + DLY_WUNLOCK(); + free(dd, M_TEMP); + goto again; + } + + memcpy(dly.index, dd, sizeof(struct dly_dispatcher) * dly.alloc); + tmp = dly.index; + dly.index = dd; + dly.alloc = alloc; + i = dly.count++; + dly.index[i] = *dld; + DLY_WUNLOCK(); + + free(tmp, M_TEMP); + + return (i); +} + +/* + * Checks if given netisr queue item is of type which + * needs to be unregistered. + */ +static int +dly_scan_unregistered(struct mbuf *m, void *_data) +{ + struct m_tag *dtag; + struct dly_item *item; + int i; + + i = *((int *)(intptr_t)_data); + + if ((dtag = m_tag_find(m, PACKET_TAG_DISPATCH_INFO, NULL)) != NULL) { + item = (struct dly_item *)(dtag + 1); + if (item->type == i) + return (1); + } + + return (0); +} + +/* + * Unregisters slow handler registered previously by dly_register(). + * Caller needs to ensure that no new items of given type can be queued + * prior calling this function. + */ +void +dly_unregister(int dtype) +{ + + netisr_scan(NETISR_SLOWPATH, dly_scan_unregistered, &dtype); + + DLY_WLOCK(); + if (dtype < 0 || dtype >= dly.alloc || dly.index[dtype].name == NULL) { + DLY_WUNLOCK(); + return; + } + + KASSERT(dly.index[dtype].name != NULL, + ("%s: unresigstering non-existend protocol %d", __func__, dtype)); + + memset(&dly.index[dtype], 0, sizeof(struct dly_dispatcher)); + DLY_WUNLOCK(); +} + + +static void +dly_ifdetach(void *arg __unused, struct ifnet *ifp) +{ + + netisr_scan(NETISR_SLOWPATH, dly_scan_ifp, ifp); +} + +static struct netisr_handler dly_nh = { + .nh_name = "slow", + .nh_handler = dly_dispatch_item, + .nh_proto = NETISR_SLOWPATH, + .nh_policy = NETISR_POLICY_SOURCE, +}; + +static void +dly_init(__unused void *arg) +{ + + memset(&dly, 0, sizeof(dly)); + dly.index = malloc(sizeof(struct dly_dispatcher) * DLY_ALLOC_ITEMS, + M_TEMP, M_ZERO|M_WAITOK); + dly.alloc = DLY_ALLOC_ITEMS; + dly.count = 1; + + DLY_LOCK_INIT(); + + netisr_register(&dly_nh); + ifdetach_tag = EVENTHANDLER_REGISTER(ifnet_departure_event, + dly_ifdetach, NULL, EVENTHANDLER_PRI_ANY); +} + +/* Exactly after netisr */ +SYSINIT(dly_init, SI_SUB_SOFTINTR, SI_ORDER_SECOND, dly_init, NULL); + --- /dev/null 2014-01-24 00:33:00.000000000 +0400 +++ sys/net/delayed_dispatch.h 2014-01-23 23:54:50.166594749 +0400 @@ -0,0 +1,57 @@ +/*- + * Copyright (c) 2014 Alexander V. Chernikov + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD: head/sys/net/netisr.h 222249 2011-05-24 12:34:19Z rwatson $ + */ + +#ifndef _NET_DELAYED_DISPATCH_H_ +#define _NET_DELAYED_DISPATCH_H_ + +struct dly_item { + int type; + int subtype; + struct ifnet *ifp; + uintptr_t data; +}; + +typedef int dly_dispatch_t(struct mbuf *, int, uintptr_t, struct ifnet *); +typedef int dly_pdispatch_t(struct mbuf *, struct dly_item *); +typedef int dly_free_t(struct mbuf *, int, uintptr_t, struct ifnet *); + +struct dly_dispatcher { + const char *name; + dly_dispatch_t *dly_dispatch; + dly_pdispatch_t *dly_pdispatch; + dly_free_t *dly_free; +}; + + +int dly_register(struct dly_dispatcher *); +void dly_unregister(int); +int dly_queue(int, struct mbuf *, int, uintptr_t, struct ifnet *); +int dly_pqueue(int, struct mbuf *, struct dly_item *, size_t); + +#endif + --------------050804080408080705010802-- From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 07:56:29 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 08055324 for ; Fri, 24 Jan 2014 07:56:29 +0000 (UTC) Received: from mail-ee0-x236.google.com (mail-ee0-x236.google.com [IPv6:2a00:1450:4013:c00::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 93BAD1FAC for ; Fri, 24 Jan 2014 07:56:28 +0000 (UTC) Received: by mail-ee0-f54.google.com with SMTP id e53so793717eek.27 for ; Thu, 23 Jan 2014 23:56:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Chlkk9F5+Ux9Q0EWdp3WAeRUqN+MXmS2jGTqswERgl8=; b=cNa+7pkG+oOiFTRPBZHst1Z/1N9uVVbRkOxvDNrRKOWd4mhju0RRyaWkWW1M2ywI7g Z5l1qLbASInDBMQxWcLsMTMWk/giGuG21V/PtQtviddOaJkQgs4drh/10f+LXEiHI6n2 zVOWfL7bHmBx1Bw7g6snmQlp5crUfvzuM303bRmdpYDJejBcPGK+g1ZYcmMwEci1lr8t FQw/0P/fLSKXdQuaTqF5BqOWvspmpaGvPlqHlOqo7e3qhML6/Jfeae2BZXuBu6Pz8lhz jN6jXp9mNuYz/WJMnYogg1pggEppi/7vPHUlxJ4trsrixM70d4tO3nMuiNvHdWrHOjgq YKfg== MIME-Version: 1.0 X-Received: by 10.14.94.69 with SMTP id m45mr1010543eef.95.1390550186479; Thu, 23 Jan 2014 23:56:26 -0800 (PST) Received: by 10.14.2.66 with HTTP; Thu, 23 Jan 2014 23:56:26 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 23:56:26 -0800 Message-ID: Subject: Re: Port mirroring on FreeBSD From: hiren panchasara To: Luigi Rizzo Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 07:56:29 -0000 On Sat, Jan 18, 2014 at 8:29 AM, Luigi Rizzo wrote: > > > > On Fri, Jan 17, 2014 at 10:58 PM, hiren panchasara > wrote: >> >> I have this weird requirement that I am juggling right now and I >> wanted to reach out to larger audience: >> >> In this box I have 2 dualport ixgbe 10G cards. On ingress, I want to >> get data off of 2 ports of first 10G card and lagg/lacp them into 1 >> stream of data. But for outgoing, I want to have 2 identical streams >> of data going out on 2 ports of the second 10G card. (not >> load-balancing but more of a mirroring). >> >> The reason for this is, I need to be able to provide same data to 2 >> different application hosts downstream for monitoring. Something like: >> >> http://www.juniper.net/techpubs/en_US/junos13.2/topics/concept/port-mirroring-ex-series.html >> >> I believe a regular switch might be perfect but for I could not find >> anything simple in FreeBSD to do that. >> >> Luigi: Can netmap/vale be helpful here? > > > for this and other custom applications what I would > do is build a userspace application that puts the nics in > netmap mode and does the necessary juggling. What I am thinking right now is: open all 4 (2 ingress and 2 egress) ports in netmap and then copy each packet from both ingress ports to both of the egress ports via netmap. I see some packet move/copy code between 2 ports in tools bridge example. I am thinking of tweaking that right now. Should that work? Also, initially I thought of trunking 2 ingress ports via lagg(4) but then I don't think I can open that lagged interface into netmap so I dropped that idea. cheers, Hiren > > Note that since the host is going to be the performance bottleneck, > you can probably do the same with just bpf without too much > impact on performance (and some advantage since you do not > need to handle the input traffic; at least, if i understand > your description the monitor does not need to see a > replica of the incoming traffic). > > Some time ago the answer to this type of questions used to be > "use netgraph". Maybe it is also a valid option but i do not > know if there are modules that suit your need. > > cheers > luigi From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 11:18:21 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 401A66E9 for ; Fri, 24 Jan 2014 11:18:21 +0000 (UTC) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com [IPv6:2607:f8b0:4001:c05::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 09DE31224 for ; Fri, 24 Jan 2014 11:18:20 +0000 (UTC) Received: by mail-ig0-f178.google.com with SMTP id uq10so2264056igb.5 for ; Fri, 24 Jan 2014 03:18:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=WC1r/JkyEbBWIE4tGMLBUQgdVmlTAz1QLBVVfeLtq1U=; b=IOLRCTzb6Y/j6Him0O1vnWHCeTy2PSWwIl7NO8Dep5jArjRQqn9ptU2s7CMUs9iHYa IpDc/0y0AZPzsU9zLDglqcJmBrV4g04YbHt47jnK7EOQQ53rXIlF/rKaV5RzigJ5Y6QR kugEvI3v7hLa36AI7sXDpnz7L6tdh8/kzE0Q2cLnnzlJ0gWrC+AFoM4Nd2byPwsAzuO8 uxeK2BbJGJhvt6fqgztebDkH+WZWL1azgM+28LrnoZT2QVHZWM6+FKNqex4wu4VyrkIJ XsMtO0Lz7sel6FT1BHFjrUPjS5B1BYTr71oirzaltD9yz50y3lR+hJC+SpQNb4O7TzFg SuVg== MIME-Version: 1.0 X-Received: by 10.43.156.18 with SMTP id lk18mr41396icc.77.1390562300430; Fri, 24 Jan 2014 03:18:20 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 24 Jan 2014 03:18:20 -0800 (PST) In-Reply-To: References: <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 06:18:20 -0500 X-Google-Sender-Auth: wnERsfEYiHrmAaZpXGyBSNTAftc Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 11:18:21 -0000 INTRODUCTION While researching NFS performance problems in our environment, I ran a series of four iozone tests twice, once with rsize=3D4096,wsize=3D4096 and again with rsize=3D8192,wsize=3D8192. That produced additional avenues of inquiry leading to the identification of two separate potential problems with NFS performance. This message discusses the first problem, apparently extraneous fixed-sized reads before writes. In order to maximize network throughput (>16Gbit/sec) and minimize network latency (<0.1ms), these tests were performed using two FreeBSD guests on the same KVM host node with bridged virtio adapters having TSO and LRO enabled. To remove local or network-backed virtual disks as a bottleneck, a 2GiB memory-backed-UFS-over-NFS filesystem served as the NFS export from the server to the client. BENCHMARK DATA Here is the iozone against the 4k NFS mount: $ iozone -e -I -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ [...] Include fsync in write timing File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random KB reclen write rewrite read reread read write 1048576 4 71167 73714 27525 46349 iozone test complete. Now, here are the read and write columns from nfsstat -s -w 1 over the entire run: Read Write 0 0 0 0 0 0 0 11791 <-- sequential write test begins 0 17215 0 18260 0 17968 0 17810 0 17839 0 17959 0 17912 0 18180 0 18285 0 18636 0 18554 0 18178 0 17361 0 17803 <-- sequential rewrite test begins 0 18358 0 18188 0 18817 0 17757 0 18153 0 18924 0 19444 0 18775 0 18995 0 18198 0 18949 0 17978 0 18879 0 19055 7904 67 <-- random read test begins 7502 0 7194 0 7432 0 6995 0 6844 0 6730 0 6761 0 7011 0 7058 0 7477 0 7139 0 6793 0 7047 0 6402 0 6621 0 7111 0 6911 0 7413 0 7431 0 7047 0 7002 0 7104 0 6987 0 6849 0 6580 0 6268 0 6868 0 6775 0 6335 0 6588 0 6595 0 6587 0 6512 0 6861 0 6953 0 7273 0 5184 1688 <-- random write test begins 0 11795 0 11915 0 11916 0 11838 0 12035 0 11408 0 11780 0 11488 0 11836 0 11787 0 11824 0 12099 0 11863 0 12154 0 11127 0 11434 0 11815 0 11960 0 11510 0 11623 0 11714 0 11896 0 1637 <-- test finished 0 0 0 0 0 0 This looks exactly like you would expect. Now, we re-run the exact same test with rsize/wsize set to 8k: $ iozone -e -I -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: freebsd Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Fri Jan 24 07:45:06 2014 Include fsync in write timing File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random KB reclen write rewrite read reread read write 1048576 4 124820 34850 35996 16258 iozone test complete. And here are the second-by-second counts of corresponding read/write operations from nfsstat: Read Write 0 0 0 0 0 0 11 10200 <-- sequential write test begins 13 16109 10 15360 12 16055 4 15623 8 15533 9 16241 11 16045 1891 11874 <-- sequential rewrite test begins 4508 4508 4476 4476 4353 4353 4350 4351 4230 4229 4422 4422 4731 4731 4600 4600 4550 4550 4314 4314 4273 4273 4422 4422 4284 4284 4609 4609 4600 4600 4362 4361 4386 4387 4451 4451 4467 4467 4370 4370 4394 4394 4252 4252 4155 4155 4427 4428 4436 4435 4300 4300 3989 3989 4438 4438 4632 4632 6440 1836 <-- random read test begins 6681 0 6416 0 6290 0 6560 0 6569 0 6237 0 6331 0 6403 0 6310 0 6010 0 6023 0 6271 0 6068 0 6044 0 6003 0 5957 0 5683 0 5833 0 5774 0 5708 0 5810 0 5829 0 5787 0 5791 0 5838 0 5922 0 6002 0 6191 0 4162 2624 <-- random write test begins 3998 4068 4056 4100 3929 3969 4164 4128 4188 4173 4256 4164 3899 4068 4018 3933 4077 4114 4210 4222 4050 4116 4157 3983 4066 4122 4129 4162 4303 4199 4235 4412 4100 3932 4056 4254 4192 4186 4204 4120 4085 4138 4103 4146 4272 4193 3943 4028 4161 4163 4153 4131 4218 4017 4132 4233 4195 4101 4041 4145 3994 4086 4297 4193 4205 4341 4101 4103 4134 3962 4297 4408 4282 4242 4180 4175 4286 4216 4217 4397 4364 4253 3673 3720 3843 3807 4147 4183 4171 4181 4280 4224 4126 4158 3977 4074 4146 3919 4147 4362 4079 4060 3755 3760 4157 4130 4087 4109 4006 3873 3860 3967 3982 4048 4146 3963 4188 4203 4040 4063 3976 4046 3859 3815 4114 4193 393 447 <-- test finished 0 0 0 0 0 0 To go through the tests individually... SEQUENTIAL WRITE TEST The first test is sequential write, the ktrace for this test looks like thi= s: 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 It=92s just writing 4k in a loop 262,144 times. This test is substantially faster with wsize=3D8192 than wsize=3D4096 (124,820 KiB/sec at 8k vs 71,167 KiB/sec at 4k). We see from the nfsstat output that this is because writes are being coalesced: there are half as many writes in the 8k test than the 4k test (8k =3D 131,072, 4k =3D 262,144). So far so good. SEQUENTIAL REWRITE TEST But when we move on to the rewrite test, there=92s trouble in River City. Here=92s what it looks like in ktrace: 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 As with the write test, it just writes 4k in a loop 262,144 times. Yet despite being exactly the same sequence of calls, this test is is less than half as fast with 8k NFS as 4k NFS (34,850 KiB/sec at 8k vs 73,714 KiB/sec at 4k). The only difference between this test and the previous one is that now the file exists. And although the 4k NFS client writes it the same way as the first test and therefore gets almost exactly same speed, the 8k NFS client is reading and writing in roughly equal proportion, and drops by a factor of 3.5x from its own performance on the first write test. Note that the total number of writes/sec on the 4k NFS rewrite test hovers around 18,500. On the 8k NFS test, the reads and writes are roughly balanced at around 4242. So the interleaved read/writes performed by the NFS client cut the total IOPs available by more than 50%, twice, leaving less than a quarter available for the application. And these numbers roughly correlate to the observed speeds: 4252 * 8KiB =3D 33,936kiB, 18,500 * 4kiB =3D 74,000 kiB. A total of 131,072 NFS writes are issued during the 8k NFS test for a maximum possible write of 1 GiB. Iozone is writing a total of 1GiB (262,144 x 4kiB). So none of the read data is being used. So those extraneous reads appear to be the source of the performance drop. RANDOM READ TEST Considering the random read test, things look better again. Here=92s what it looks like in ktrace: 3947 iozone CALL lseek(0x3,0x34caa000,SEEK_SET) 3947 iozone RET lseek 885694464/0x34caa000 3947 iozone CALL read(0x3,0x801500000,0x1000) 3947 iozone GIO fd 3 read 4096 bytes 3947 iozone RET read 4096/0x1000 3947 iozone CALL lseek(0x3,0x3aa53000,SEEK_SET) 3947 iozone RET lseek 983904256/0x3aa53000 3947 iozone CALL read(0x3,0x801500000,0x1000) 3947 iozone GIO fd 3 read 4096 bytes 3947 iozone RET read 4096/0x1000 This test lseeks to a random location (on a 4k boundary) then reads 4k, in a loop, 262,144 times. On this test, the 8k NFS moderately outperforms the 4k NFS (35,996 kiB/sec at 8k vs 27,525 kiB/sec at 4k). This is probably attributable to lucky adjacent requests, which are visible in the total number of reads. 4k NFS produces the expected 262,144 NFS read ops, but 8k NFS performs about 176,000, an average of about 6k per read, suggesting it was doing 8k reads of which the whole 8k was useful about 50% of the time. RANDOM WRITE TEST Finally, the random write test. Here=92s the ktrace output from this one: 3262 iozone CALL lseek(0x3,0x29665000,SEEK_SET) 3262 iozone RET lseek 694571008/0x29665000 3262 iozone CALL write(0x3,0x801500000,0x1000) 3262 iozone GIO fd 3 wrote 4096 bytes 3262 iozone RET write 4096/0x1000 3262 iozone CALL lseek(0x3,0x486c000,SEEK_SET) 3262 iozone RET lseek 75939840/0x486c000 3262 iozone CALL write(0x3,0x801500000,0x1000) 3262 iozone GIO fd 3 wrote 4096 bytes 3262 iozone RET write 4096/0x1000 3262 iozone CALL lseek(0x3,0x33310000,SEEK_SET) 3262 iozone RET lseek 858849280/0x33310000 3262 iozone CALL write(0x3,0x801500000,0x1000) 3262 iozone GIO fd 3 wrote 4096 bytes 3262 iozone RET write 4096/0x1000 3262 iozone CALL lseek(0x3,0x2ae3000,SEEK_SET) That=92s all it is, lseek() then write(), repeated 262144 times. No reads, and again, all the lseek=92s are on 4k boundaries. This test is dramatically slower with the 8k NFS, falling from 46,349 kiB/sec for 4k NFS to 16,258 kiB/sec for 8k NFS. Like the rewrite test, the nfsstat output shows an even mix of reads and writes during the 8k NFS run of this test, even though no such writes are ever issued by the client. Unlike the 8k NFS rewrite test, which performs 131,072 NFS write operations, the 8k NFS random write test performs 262,025 operations. This strongly suggests it was converting each 4kiB write into an 8kiB read-modify-write cycle, with the commensurate performance penalty in almost every case, with a handful of packets (about 119) lucky enough to be adjacent writes. Since the NFS server is supposed to be able to handle arbitrary-length writes at arbitrary file offsets, this appears to be unnecessary. To confirm that this is the case, and to determine whether it is doing 4kiB or 8kiB writes, we will have to go to the packets. Here is a sample from the random write test: 09:31:52.338749 IP 172.20.20.169.1877003014 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 606494720 09:31:52.338877 IP 172.20.20.162.2049 > 172.20.20.169.1877003014: reply ok 8320 read 09:31:52.338944 IP 172.20.20.169.1877003015 > 172.20.20.162.2049: 4232 write fh 1326,127488/4 4096 (4096) bytes @ 606498816 09:31:52.338964 IP 172.20.20.169.1877003016 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 85237760 09:31:52.339037 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], ack 4008446365, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512], length 0 09:31:52.339076 IP 172.20.20.162.2049 > 172.20.20.169.1877003015: reply ok 160 write [|nfs] 09:31:52.339117 IP 172.20.20.162.2049 > 172.20.20.169.1877003016: reply ok 8320 read 09:31:52.339142 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.], ack 8488, win 515, options [nop,nop,TS val 17929512 ecr 4241448191], length 0 09:31:52.339183 IP 172.20.20.169.1877003017 > 172.20.20.162.2049: 4232 write fh 1326,127488/4 4096 (4096) bytes @ 85241856 09:31:52.339201 IP 172.20.20.169.1877003018 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 100843520 09:31:52.339271 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], ack 4369, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512], length 0 09:31:52.339310 IP 172.20.20.162.2049 > 172.20.20.169.1877003017: reply ok 160 write [|nfs] 09:31:52.339332 IP 172.20.20.162.2049 > 172.20.20.169.1877003018: reply ok 8320 read 09:31:52.339355 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.], ack 16976, win 515, options [nop,nop,TS val 17929512 ecr 4241448191], length 0 09:31:52.339408 IP 172.20.20.169.1877003019 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 330153984 09:31:52.339514 IP 172.20.20.162.2049 > 172.20.20.169.1877003019: reply ok 8320 read 09:31:52.339562 IP 172.20.20.169.1877003020 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 500056064 09:31:52.339669 IP 172.20.20.162.2049 > 172.20.20.169.1877003020: reply ok 8320 read 09:31:52.339728 IP 172.20.20.169.1877003022 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 778158080 09:31:52.339758 IP 172.20.20.169.1877003021 > 172.20.20.162.2049: 4232 write fh 1326,127488/4 4096 (4096) bytes @ 500060160 09:31:52.339834 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], ack 9001, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512], length 0 This confirms that the NFS client is indeed performing a read-write cycle. However, they also show that the writes are 4kiB, not 8kiB, so no modify is occurring. E.g. it reads 8kiB at offset 606494720 and then immediately writes the 4kiB from the application at offset 606498816 (which is 606494720+4096). It=92s not clear what the purpose of the read is in this scenario. FOLLOW-UP TESTS To help determine which direction to go next, I conducted two follow-up tests. First, I re-ran the test with the default 64k rsize/wsize values, an effort to confirm that it would do 64k reads followed by 4k writes. It was necessary to do this with a much smaller file size due to the poor performance, and with net.inet.tcp.delacktime=3D10 on both client and server to work around a separate issue with TCP writes >64kiB, but the results are sufficient for this purpose: $ iozone -e -s 32m -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: freebsd Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Fri Jan 24 10:35:14 2014 Include fsync in write timing File size set to 32768 KB Record Size 4 KB Command line used: iozone -e -s 32m -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 32768 4 12678 5292 5355 741 iozone test complete. Although the use of TCP writes >64kiB largely mangles tcpdump=92s ability to interpret the NFS messages due to fragmentation, it was possible to grep out enough to see that the suspected 64kiB read / 4kiB write pairs are indeed occurring: 10:36:11.356519 IP 172.20.20.169.1874848588 > 172.20.20.162.2049: 128 read fh 1326,127488/5 65536 bytes @ 18087936 10:36:11.496905 IP 172.20.20.169.1874848606 > 172.20.20.162.2049: 4232 write fh 1326,127488/5 4096 (4096) bytes @ 18096128 As before, the 64kiB read appears not to be used. The result is that roughly 94% of the network I/O spent on a 4kiB write is wasted, with commensurate impact on performance. In an attempt to verify the theory that the data being read is not necessary, the 8k NFS test was retried with a Debian Linux guest. The hypothesis was that if the Debian Linux client did not perform the reads, that would indicate that they were not necessary. The results appeared to support that hypothesis. Not only did the Linux client handily outperform the FreeBSD client on all four tests: # iozone -e -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.397 $ Compiled for 64 bit mode. Build: linux-AMD64 [...] Include fsync in write timing File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random KB reclen write rewrite read reread read write 1048576 4 142393 158101 79471 133974 iozone test complete. But it also did not perform any reads during the rewrite and random write tests, as shown by the server=92s nfsstat output: Read Write 0 0 0 0 0 0 0 1905 <-- sequential write test begins 0 19571 0 20066 0 20214 0 20036 0 20051 0 17221 0 13311 <-- rewrite test begins 0 20721 0 22237 0 21710 0 21619 0 21104 0 18339 6363 4040 <-- random read test begins 8906 0 9885 0 10921 0 12487 0 13311 0 14141 0 16598 0 16178 0 17138 0 17796 0 18765 0 16760 0 2816 6787 <-- random write test begins 0 20892 0 19528 0 20879 0 20758 0 17040 0 21327 0 19713 <-- tests finished 0 0 0 0 0 0 CONCLUSION The rewrite and random write iozone tests appear to demonstrate that in some cases the FreeBSD NFS client treats the rsize/wsize settings as a fixed block size rather than a not-to-exceed size. The data suggests that it requests a full rsize-sized block read from the server before performing a 4kiB write issued by an application, and that the result of the read is not used. Linux does not perform the same read-before-write strategy, and consequently achieves much higher throughput. The next step would probably be to examine the NFS implementation to determine the source of the apparently unnecessary reads. If they could be eliminated, it may lead to performance improvements of up to 20x on some workloads. Thanks! From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 14:56:18 2014 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5A5CB1A6 for ; Fri, 24 Jan 2014 14:56:18 +0000 (UTC) Received: from mail-ob0-x233.google.com (mail-ob0-x233.google.com [IPv6:2607:f8b0:4003:c01::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 187C114C5 for ; Fri, 24 Jan 2014 14:56:18 +0000 (UTC) Received: by mail-ob0-f179.google.com with SMTP id wo20so3683154obc.38 for ; Fri, 24 Jan 2014 06:56:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=870Q+KtNtAr/59PL8rdCgjRjHjBxQJGkWUdmfFqPjHU=; b=zj/C1AUOuHOWCzXa/Rmo/Qh2hwKVYPXjVrKjtCZHQA3XWy/p8opfsJr5xZa+Vs2xK+ 9R8b9L2Dsc7eunDjdZGj0ooMxxGDBXPpvYbqqs1pqFxSCJQTweiQhQJQlPVjVV5e4o7M 9Bv1E0r4McP3wcwZEVvcqpUjRn360F/69WgDmjdltw0v2rC26rb8fuRCN5j8+AT7XjXX FmXlG7neA/iJ6kMFH8Tv9T8+zxRQkWYAVFpHYzuqO/QdIkdnyQrQ/v4TdowWggHm7OR9 YvHi4moW7lEZ3LLgkmTSPOCUPhA66SZOZ1OKLLqMrP1wqgcxk1aqqFptu9BQPNf7YPVw a6eA== MIME-Version: 1.0 X-Received: by 10.182.65.36 with SMTP id u4mr12346545obs.31.1390575377301; Fri, 24 Jan 2014 06:56:17 -0800 (PST) Received: by 10.60.103.238 with HTTP; Fri, 24 Jan 2014 06:56:17 -0800 (PST) In-Reply-To: <52E1E272.8060009@huawei.com> References: <52D74E15.1040909@huawei.com> <92C7725B-B30A-4A19-925A-A93A2489A525@iet.unipi.it> <52D8A5E1.9020408@huawei.com> <52DD1914.7090506@iet.unipi.it> <52E1E272.8060009@huawei.com> Date: Fri, 24 Jan 2014 15:56:17 +0100 Message-ID: Subject: Re: netmap: I got some troubles with netmap From: Vincenzo Maffione To: Wang Weidong Content-Type: multipart/mixed; boundary=047d7b604cbc65dca304f0b88f04 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: =?ISO-8859-1?Q?facolt=E0?= , Giuseppe Lettieri , Luigi Rizzo , net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 14:56:18 -0000 --047d7b604cbc65dca304f0b88f04 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 2014/1/24 Wang Weidong > On 2014/1/20 20:39, Giuseppe Lettieri wrote: > > Hi Wang, > > > > OK, you are using the netmap support in the upstream qemu git. That doe= s > not yet include all our modifications, some of which are very important f= or > high throughput with VALE. In particular, the upstream qemu does not > include the batching improvements in the frontend/backend interface, and = it > does not include the "map ring" optimization of the e1000 frontend. Pleas= e > find attached a gzipped patch that contains all of our qemu code. The pat= ch > is against the latest upstream master (commit 1cf892ca). > > > > Please ./configure the patched qemu with the following option, in > addition to any other option you may need: > > > > --enable-e1000-paravirt --enable-netmap \ > > --extra-cflags=3D-I/path/to/netmap/sys/directory > > > > Note that --enable-e1000-paravirt is needed to enable the "map ring" > optimization in the e1000 frontend, even if you are not going to use the > e1000-paravirt device. > > > > Now you should be able to rerun your tests. I am also attaching a READM= E > file that describes some more tests you may want to run. > > > > Hello, > Yes, I patch the qemu-netmap-bc767e701.patch to the qemu, download the > 20131019-tinycore-netmap.hdd. > And I do some test that: > > 1. I use the bridge below: > qemu-system-x86_64 -m 2048 -boot c -net nic -net bridge,br=3Dbr1 -hda > /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > test between two vms. > br1 without device. > Use pktgen, I got the 237.95 kpps. > Use the netserver/netperf I got the speed 1037M bits/sec with TCP_STREAM. > The max speed is up to 1621M. > Use the netserver/netperf I got the speed 3296/s with TCP_RR > Use the netserver/netperf I got the speed 234M/86M bits/sec with UDP_STRE= AM > > When I add a device from host to the br1, the speed is 159.86 kpps. > Use the netserver/netperf I got the speed 720M bits/sec with TCP_STREAM. > The max speed is up to 1000M. > Use the netserver/netperf I got the speed 3556/s with TCP_RR > Use the netserver/netperf I got the speed 181M/181M bits/sec with > UDP_STREAM > > What do you think of these data? > You are using the old/deprecated QEMU command line syntax (-net), and therefore honestly It's not clear to me what kind of network configuration you are running. Please use our scripts "launch-qemu.sh", "prep-taps.sh", according to what described in the README.images file (attached). Alternatively, use the syntax like in the following examples (#1) qemu-system-x86_64 archdisk.qcow -enable-kvm -device virtio-net-pci,netdev=3Dmynet -netdev tap,ifname=3Dtap01,id=3Dmynet,script=3Dno,downscript=3Dno -smp 2 (#2) qemu-system-x86_64 archdisk.qcow -enable-kvm -device e1000,mitigation=3Doff,mac=3D00:AA:BB:CC:DD:01,netdev=3Dmynet -netdev netmap,ifname=3Dvale0:01,id=3Dmynet -smp 2 so that it's clear to us what network frontend (e.g. emulated NIC) and network backend (e.g. netmap, tap, vde, ecc..) you are using. In example #1 we are using virtio-net as frontend and tap as backend, while in example #2 we are using e1000 as frontend and netmap as backend. Also consider giving more than one core (e.g. -smp 2) to each guest, to mitigate receiver livelock problems. > > 2. I use the vale below: > qemu-system-x86_64 -m 2048 -boot c -net nic -net netmap,vale0:0 -hda > /home/wwd/tinycores/20131019-tinycore-netmap.hdd -enable-kvm -vnc :0 > > Same for here, it's not clear what you are using. I guess each guest has an e1000 device and is connected to a different port of the same vale switch (e.g. vale0:0 and vale0:1)? Test with 2 vms from the same host > vale0 without device. > I use the pkt-gen, the speed is 938 Kpps > You should get ~4Mpps with e1000 frontend + netmap backend on a reasonably good machine. Make sure you have ./configure'd QEMU with --enable-e1000-paravirt. > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 195M/195M, > then add -- -m 8, I only got 1.07M/1.07M. > When use the smaller msg size, the speed will smaller? > If you use e1000 with netperf (without pkt-gen) your performance is doomed to be horrible. Use e1000-paravirt (as a frontend) instead if you are interested in netperf experiment. Also consider that the point in using the "-- -m8" options is experimenting high packet rates, so what you should measure here is not the througput in Mbps, but the packet rate: netperf reports the number of packets sent and received, so you can obtain the packet rate by dividing by the running time= . The throughput in Mbps is uninteresting, if you want high bulk throughput you just don't use "-- -m 8", but leave the defaults. Using virtio-net in this case will help because of the TSO offloadings. cheers Vincenzo > > with vale-ctl -a vale0:eth2, > use pkt-gen, the speed is 928 Kpps > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 209M/208M, > then add -- -m 8, I only got 1.06M/1.06M. > > with vale-ctl -h vale0:eth2, > use pkt-gen, the speed is 928 Kpps > I use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 192M/192M, > then add -- -m 8, I only got 1.06M/1.06M. > > Test with 2 vms form two host, > I only can test it by vale-ctl -h vale0:eth2 and set eth2 into promisc > use pkt-gen with the default params, the speed is about 750 Kpps > use netperf -H 10.0.0.2 -t UDP_STREAM, I got the speed is 160M/160M > Is this right? > > 3. I can't use the l2 utils. > When I do the "sudo l2open -t eth0 l2recv[l2send], I got that "l2open > ioctl(TUNSETIFF...): Invalid argument" > and "use l2open -r eth0 l2recv", wait a moment (only several seconds), I > got the result: > TEST-RESULT: 0.901 kpps 1pkts > select/read=3D100.00 err=3D0 > > And I can't find the l2 utils from the net? Is it implemented by your tea= m? > > All of them is tested on vms. > > Cheers. > Wang > > > > > > Cheers, > > Giuseppe > > > > Il 17/01/2014 04:39, Wang Weidong ha scritto: > >> On 2014/1/16 18:24, facolt=E0 wrote: > [...] > >> > >> > > > > > > > --=20 Vincenzo Maffione --047d7b604cbc65dca304f0b88f04 Content-Type: application/octet-stream; name="README.images" Content-Disposition: attachment; filename="README.images" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hqtjyltm0 CUVYUEVSSU1FTlRJTkcgV0lUSCBORVRNQVAsIFZBTEUgQU5EIEZBU1QgUUVNVQoJLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCgpUbyBlYXNlIGV4cGVyaW1lbnRz IHdpdGggTmV0bWFwLCB0aGUgVkFMRSBzd2l0Y2ggYW5kIG91ciBRZW11IGVuaGFuY2VtZW50cwp3 ZSBoYXZlIHByZXBhcmVkIGEgY291cGxlIG9mIGJvb3RhYmxlIGltYWdlcyAobGludXggYW5kIEZy ZWVCU0QpLgpZb3UgY2FuIGZpbmQgdGhlbSBvbiB0aGUgbmV0bWFwIHBhZ2UKCglodHRwOi8vaW5m by5pZXQudW5pcGkuaXQvfmx1aWdpL25ldG1hcC8KCndoZXJlIHlvdSBjYW4gYWxzbyBsb29rIGF0 IG1vcmUgcmVjZW50IHZlcnNpb25zIG9mIHRoaXMgZmlsZS4KCkJlbG93IGFyZSBzdGVwLWJ5LXN0 ZXAgaW5zdHJ1Y3Rpb25zIG9uIGV4cGVyaW1lbnRzIHlvdSBjYW4gcnVuCndpdGggdGhlc2UgaW1h Z2VzLiBUaGUgdHdvIG1haW4gdmVyc2lvbnMgYXJlCgoJcGljb2JzZC5oZGQJLT4gRnJlZUJTRCBI RUFEIChuZXRtYXAgKyBWQUxFKQoJdGlueWNvcmUuaGRkCS0+IExpbnV4IChxZW11ICsgbmV0bWFw ICsgVkFMRSkgICAgICAKCkJvb3RpbmcgdGhlIGltYWdlCi0tLS0tLS0tLS0tLS0tLS0tCkZvciBh bGwgZXhwZXJpbWVudHMgeW91IG5lZWQgdG8gY29weSB0aGUgaW1hZ2Ugb24gYSBVU0Igc3RpY2sK YW5kIGJvb3QgYSBQQyB3aXRoIGl0LiBBbHRlcm5hdGl2ZWx5LCB5b3UgY2FuIHVzZSB0aGUgaW1h Z2UKd2l0aCBWaXJ0dWFsQm94LCBRZW11IG9yIG90aGVyIGVtdWxhdG9ycywgYXMgYW4gZXhhbXBs ZQoKICAgIHFlbXUtc3lzdGVtLXg4Nl82NCAtaGRhIElNQUdFX0ZJTEUgLW0gMUcgLW1hY2hpbmUg YWNjZWw9a3ZtIC4uLgoKKHJlbW92ZSAnYWNjZWw9a3ZtJyBpZiB5b3VyIGhvc3QgZG9lcyBub3Qg c3VwcG9ydCBrdm0pLgpUaGUgaW1hZ2VzIGRvIG5vdCBpbnN0YWxsIGFueXRoaW5nIG9uIHRoZSBo YXJkIGRpc2suCgpCb3RoIHN5c3RlbXMgaGF2ZSBwcmVsb2FkZWQgZHJpdmVycyBmb3IgYSBudW1i ZXIgb2YgbmV0d29yayBjYXJkcwooaW5jbHVkaW5nIHRoZSBpbnRlbCAxMCBHYml0IG9uZXMpIHdp dGggbmV0bWFwIGV4dGVuc2lvbnMuClRoZSBWQUxFIHN3aXRjaCBpcyBhbHNvIGF2YWlsYWJsZSAo aXQgaXMgcGFydCBvZiB0aGUgbmV0bWFwIG1vZHVsZSkuCnNzaCwgc2NwIGFuZCBhIGZldyBvdGhl ciB1dGlsaXRpZXMgYXJlIGFsc28gaW5jbHVkZWQuCgpGcmVlQlNEIGltYWdlOgoKICArIHRoZSBP UyBib290cyBkaXJlY3RseSBpbiBjb25zb2xlIG1vZGUsIHlvdSBjYW4gc3dpdGNoCiAgICBiZXR3 ZWVuIHRlcm1pbmFscyB3aXRoIEFMVC1Gbi4KICAgIFRoZSBwYXNzd29yZCBmb3IgdGhlICdyb290 JyBhY2NvdW50IGlzICdzZXR1cCcKCiAgKyBpZiB5b3UgYXJlIGNvbm5lY3RlZCB0byBhIG5ldHdv cmssIHlvdSBjYW4gdXNlCiAgICAJZGhjbGllbnQgZW0wICMgb3Igb3RoZXIgaW50ZXJmYWNlIG5h bWUKICAgIHRvIG9idGFpbiBhbiBJUCBhZGRyZXNzIGFuZCBleHRlcm5hbCBjb25uZWN0aXZpdHku CgpMaW51eCBpbWFnZToKCiAgKyBpbiBhZGRpdGlvbiB0byB0aGUgbmV0bWFwL1ZBTEUgbW9kdWxl cywgdGhlIEtWTSBrZXJuZWwgbW9kdWxlCiAgICBpcyBhbHNvIHByZWxvYWRlZC4KCiAgKyB0aGUg Ym9vdC1sb2FkZXIgZ2l2ZXMgeW91IHR3byBtYWluIG9wdGlvbnMgKGVhY2ggd2l0aAogICAgYSB2 YXJpYW50IHRvIGRlbGF5IGJvb3QgaW4gY2FzZSB5b3UgaGF2ZSBzbG93IGRldmljZXMpOgoKICAg ICsgIkJvb3QgVGlueUNvcmUiCiAgICAgIGJvb3RzIGluIGFuIFgxMSBlbnZpcm9ubWVudCBhcyB1 c2VyICd0YycuCiAgICAgIFlvdSBjYW4gY3JlYXRlIGEgZmV3IHRlcm1pbmFscyB1c2luZyB0aGUg aWNvbiBhdCB0aGUKICAgICAgYm90dG9tLiBZb3UgY2FuIHVzZSAic3VkbyAtcyIgdG8gZ2V0IHJv b3QgYWNjZXNzLgogICAgICBJbiBjYXNlIG5vIHN1aXRhYmxlIHZpZGVvIGNhcmQgaXMgYXZhaWxh YmxlL2RldGVjdGVkLAogICAgICBpdCBmYWxscyBiYWNrIHRvIGNvbW1hbmQgbGluZSBtb2RlLgoK ICAgICsgIkJvb3QgQ29yZSAoY29tbWFuZCBsaW5lIG9ubHkpIgogICAgICBib290cyBpbiBjb25z b2xlIG1vZGUgd2l0aCB2aXJ0dWFsIHRlcm1pbmFscy4KICAgICAgWW91J3JlIGF1dG9tYXRpY2Fs bHkgbG9nZ2VkIGluIGFzIHVzZXIgJ3RjJy4KICAgICAgVG8gbG9nIGluIHRoZSBvdGhlciB0ZXJt aW5hbHMgdXNlIHRoZSBzYW1lIHVzZXJuYW1lIAogICAgICAobm8gcGFzc3dvcmQgcmVxdWlyZWQp LgoKICArIFRoZSBzeXN0ZW0gc2hvdWxkIGF1dG9tYXRpY2FsbHkgcmVjb2duaXplIHRoZSBleGlz dGluZyBldGhlcm5ldAogICAgZGV2aWNlcywgYW5kIGxvYWQgdGhlIGFwcHJvcHJpYXRlIG5ldG1h cC1jYXBhYmxlIGRldmljZSBkcml2ZXJzCiAgICB3aGVuIGF2YWlsYWJsZS4gIEludGVyZmFjZXMg YXJlIGNvbmZpZ3VyZWQgdGhyb3VnaCBESENQIHdoZW4gcG9zc2libGUuCgoKR2VuZXJhbCB0ZXN0 IHJlY29tbWVuZGF0aW9ucwotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCk5PVEU6IFRoZSB0 ZXN0cyBvdXRsaW5lZCBpbiB0aGUgZm9sbG93aW5nIHNlY3Rpb25zIGNhbiBnZW5lcmF0ZSB2ZXJ5 IGhpZ2gKcGFja2V0IHJhdGVzLCBhbmQgc29tZSBoYXJkd2FyZSBtaXNjb25maWd1cmF0aW9uIHBy b2JsZW1zIG1heSBwcmV2ZW50CnlvdSBmcm9tIGFjaGlldmluZyBtYXhpbXVtIHNwZWVkLgpDb21t b24gcHJvYmxlbXMgYXJlOgoKKyBzbG93IGxpbmsgYXV0b25lZ290aWF0aW9uLgogIE91ciBwcm9n cmFtcyB0eXBpY2FsbHkgd2FpdCAyLTQgc2Vjb25kcyBmb3IKICBsaW5rIG5lZ290aWF0aW9uIHRv IGNvbXBsZXRlLCBidXQgc29tZSBOSUMvc3dpdGNoIGNvbWJpbmF0aW9ucwogIGFyZSBtdWNoIHNs b3dlci4gSW4gdGhpcyBjYXNlIHlvdSBzaG91bGQgaW5jcmVhc2UgdGhlIGRlbGF5CiAgKHBrdC1n ZW4gaGFzIHRoZSAtdyBYWCBvcHRpb24gZm9yIHRoYXQpIG9yIHBvc3NpYmx5IGZvcmNlCiAgdGhl IGxpbmsgc3BlZWQgYW5kIGR1cGxleCBtb2RlIG9uIGJvdGggc2lkZXMuCgogIENoZWNrIHRoZSBs aW5rIHNwZWVkIHRvIG1ha2Ugc3VyZSB0aGVyZSBhcmUgbm8gbm9nb3RpYXRpb24KICBwcm9ibGVt cywgYW5kIHRoYXQgeW91IHNlZSB0aGUgZXhwZWN0ZWQgc3BlZWQuCgogICAgZXRodG9vbCBJRk5B TUUJIyBvbiBsaW51eAogICAgaWZjb25maWcgSUZOQU1FCSMgb24gRnJlZUJTRAoKKyBldGhlcm5l dCBmbG93IGNvbnRyb2wuCiAgSWYgdGhlIHJlY2VpdmluZyBwb3J0IGlzIHNsb3cgKG9mdGVuIHRo ZSBjYXNlIGluIHByZXNlbmNlCiAgb2YgbXVsdGljYXN0L2Jyb2FkY2FzdCB0cmFmZmljLCBvciBh bHNvIHVuaWNhc3QgaWYgeW91IGFyZQogIHNlbmRpbmcgdG8gbm9uLW5ldG1hcCByZWNlaXZlcnMp LCBpdCB3aWxsIGdlbmVyYXRlIGV0aGVybmV0CiAgZmxvdyBjb250cm9sIGZyYW1lcyB0aGF0IHRo cm90dGxlIGRvd24gdGhlIHNlbmRlci4KCiAgV2UgcmVjb21tZW5kIHRvIGRpc2FibGUgQk9USCBS WCBhbmQgVFggZXRoZXJuZXQgZmxvdyBjb250cm9sCiAgb24gQk9USCBzZW5kZXIgYW5kIHJlY2Vp dmVyLgogIE9uIExpbnV4IHRoaXMgY2FuIGJlIGRvbmUgd2l0aCBldGh0b29sOgoKICAgIGV0aHRv b2wgLUEgSUZOQU1FIHR4IG9mZiByeCBvZmYKCiAgd2hlcmVhcyBvbiBGcmVlQlNEIHRoZXJlIGFy ZSBkZXZpY2Utc3BlY2lmaWMgc3lzY3RsCgoJc3lzY3RsIGRldi5peC4wLnF1ZXVlMC5mbG93X2Nv bnRyb2wgPSAwCgorIENQVSBwb3dlciBzYXZpbmcuCiAgVGhlIENQVSBnb3Zlcm5vciBvbiBsaW51 eCwgb3IgZXF1aXZhbGVudCBpbiBGcmVlQlNELCB0ZW5kIHRvCiAgdGhyb3R0bGUgZG93biB0aGUg Y2xvY2sgcmF0ZSByZWR1Y2luZyBwZXJmb3JtYW5jZS4KICBVbmxpa2Ugb3RoZXIgc2ltaWxhciBz eXN0ZW1zLCBuZXRtYXAgZG9lcyBub3QgaGF2ZSBidXN5LXdhaXQKICBsb29wcywgc28gdGhlIENQ VSBsb2FkIGlzIGdlbmVyYWxseSBsb3cgYW5kIHRoaXMgY2FuIHRyaWdnZXIKICB0aGUgY2xvY2sg c2xvd2Rvd24uCgogIE1ha2Ugc3VyZSB0aGF0IEFMTCBDUFVzIHJ1biBhdCBtYXhpbXVtIHNwZWVk IGRpc2FibGluZyB0aGUKICBkeW5hbWljIGZyZXF1ZW5jeS1zY2FsaW5nIG1lY2hhbmlzbXMuCgog ICAgY3B1ZnJlcS1zZXQgLWdwZXJmb3JtYW5jZQkjIG9uIGxpbnV4CgogICAgc3lzY3RsIGRldi5j cHUuMC5mcmVxPTM0MDEJIyBvbiBGcmVlQlNELgoKKyB3cm9uZyBNQUMgYWRkcmVzcwogIG5ldG1h cCBkb2VzIG5vdCBwdXQgdGhlIE5JQyBpbiBwcm9taXNjdW91cyBtb2RlLCBzbyB1bmxlc3MgdGhl CiAgYXBwbGljYXRpb24gZG9lcyBpdCwgdGhlIE5JQyB3aWxsIG9ubHkgcmVjZWl2ZSBicm9hZGNh c3QgdHJhZmZpYyBvcgogIHVuaWNhc3QgZGlyZWN0ZWQgdG8gaXRzIG93biBNQUMgYWRkcmVzcy4K CgpTVEFOREFSRCBTT0NLRVQgVEVTVFMKLS0tLS0tLS0tLS0tLS0tLS0tLS0tCkZvciBtb3N0IHNv Y2tldC1iYXNlZCBleHBlcmltZW50cyB5b3UgY2FuIHVzZSB0aGUgIm5ldHBlcmYiIHRvb2wgaW5z dGFsbGVkCm9uIHRoZSBzeXN0ZW0gKHZlcnNpb24gMi42LjApLiBCZSBjYXJlZnVsIHRvIHVzZSBh IG1hdGNoaW5nIHZlcnNpb24gZm9yCnRoZSBvdGhlciBuZXRwZXJmIGVuZHBvaW50IChlLmcuIG5l dHNlcnZlcikgd2hlbiBydW5uaW5nIHRlc3RzIGJldHdlZW4KZGlmZmVyZW50IG1hY2hpbmVzLgoK SW50ZXJlc3RpbmcgZXhwZXJpbWVudHMgYXJlOgoKICAgIG5ldHBlcmYgLUggeC55LnoudyAtdFRD UF9TVFJFQU0gICMgdGVzdCBUQ1AgdGhyb3VnaHB1dAogICAgbmV0cGVyZiAtSCB4Lnkuei53IC10 VENQX1JSICAgICAgIyB0ZXN0IGxhdGVuY3kKICAgIG5ldHBlcmYgLUggeC55LnoudyAtdFVEUF9T VFJFQU0gLS0gLW04ICAjIHRlc3QgVURQIHRocm91Z2hwdXQgd2l0aCBzaG9ydCBwYWNrZXRzCgp3 aGVyZSB4Lnkuei53IGlzIHRoZSBob3N0IHJ1bm5pbmcgIm5ldHNlcnZlciIuCgoKUkFXIFNPQ0tF VCBBTkQgVEFQIFRFU1RTCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQpGb3IgZXhwZXJpbWVudHMg d2l0aCByYXcgc29ja2V0cyBhbmQgdGFwIGRldmljZXMgeW91IGNhbiB1c2UgdGhlIGwyCnV0aWxp dGllcyAobDJvcGVuLCBsMnNlbmQsIGwycmVjdikgaW5zdGFsbGVkIG9uIHRoZSBzeXN0ZW0uCldp dGggdGhlc2UgdXRpbGl0aWVzIHlvdSBjYW4gc2VuZC9yZWNlaXZlIGN1c3RvbSBuZXR3b3JrIHBh Y2tldHMKdG8vZnJvbSByYXcgc29ja2V0cyBvciB0YXAgZmlsZSBkZXNjcmlwdG9ycy4KClRoZSBy ZWNlaXZlciBjYW4gYmUgcnVuIHdpdGggb25lIG9mIHRoZSBmb2xsb3dpbmcgY29tbWFuZHMKCiAg ICBsMm9wZW4gLXIgSUZOQU1FIGwycmVjdiAgICAgIyByZWNlaXZlIGZyb20gYSByYXcgc29ja2V0 IGF0dGFjaGVkIHRvIElGTkFNRQogICAgbDJvcGVuIC10IElGTkFNRSBsMnJlY3YgICAgICMgcmVj ZWl2ZSBmcm9tIGEgZmlsZSBkZXNjcmlwdG9yIG9wZW5lZCBvbiB0aGUgdGFwIElGTkFNRQoKVGhl IHJlY2VpdmVyIHByb2Nlc3Mgd2lsbCB3YWl0IGluZGVmaW5pdGVseSBmb3IgdGhlIGZpcnN0IHBh Y2tldAphbmQgdGhlbiBrZWVwIHJlY2VpdmluZyBhcyBsb25nIGFzIHBhY2tldHMga2VlcCBjb21p bmcuIFdoZW4gdGhlCmZsb3cgc3RvcHMgKGFmdGVyIGEgMiBzZWNvbmRzIHRpbWVvdXQpIHRoZSBw cm9jZXNzIHRlcm1pbmF0ZXMgYW5kCnByaW50cyB0aGUgcmVjZWl2ZWQgcGFja2V0IHJhdGUgYW5k IHBhY2tldCBjb3VudC4KClRvIHJ1biB0aGUgc2VuZGVyIGluIGFuIGVhc3kgd2F5LCB5b3UgY2Fu IHVzZSB0aGUgc2NyaXB0IGwyLXNlbmQuc2gKaW4gdGhlIGhvbWUgZGlyZWN0b3J5LiBUaGlzIHNj cmlwdCBkZWZpbmVzIHNldmVyYWwgc2hlbGwgdmFyaWFibGVzCnRoYXQgY2FuIGJlIG1hbnVhbGx5 IGNoYW5nZWQgdG8gY3VzdG9taXplIHRoZSB0ZXN0IChzZWUKdGhlIGNvbW1lbnRzIGluIHRoZSBz Y3JpcHQgaXRzZWxmKS4KCkFzIGFuIGV4YW1wbGUsIHlvdSBjYW4gdGVzdCBjb25maWd1cmF0aW9u cyB3aXRoIFZpcnR1YWwKTWFjaGluZXMgYXR0YWNoZWQgdG8gaG9zdCB0YXAgZGV2aWNlcyBicmlk Z2VkIHRvZ2V0aGVyLgoKClRlc3RzIHVzaW5nIHRoZSBMaW51eCBpbi1rZXJuZWwgcGt0Z2VuCi0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tClRvIHVzZSB0aGUgTGludXggaW4t a2VybmVsIHBhY2tldCBnZW5lcmF0b3IsIHlvdSBjYW4gdXNlIHRoZQpzY3JpcHQgImxpbnV4LXBr dGdlbi5zaCIgaW4gdGhlIGhvbWUgZGlyZWN0b3J5LgpUaGUgcGt0Z2VuIGNyZWF0ZXMgYSBrZXJu ZWwgdGhyZWFkIGZvciBlYWNoIGhhcmR3YXJlIFRYIHF1ZXVlCm9mIGEgZ2l2ZW4gTklDLgoKQnkg bWFudWFsbHkgY2hhbmdpbmcgdGhlIHNjcmlwdCBzaGVsbCB2YXJpYWJsZSBkZWZpbml0aW9ucyB5 b3UKY2FuIGNoYW5nZSB0aGUgdGVzdCBjb25maWd1cmF0aW9uIChlLmcuIGFkZHJlc3NlcyBpbiB0 aGUgZ2VuZXJhdGVkCnBhY2tldCkuIFBsZWFzZSBjaGFuZ2UgdGhlICJOQ1BVIiB2YXJpYWJsZSB0 byBtYXRjaCB0aGUgbnVtYmVyCm9mIENQVXMgb24geW91ciBtYWNoaW5lLiBUaGUgc2NyaXB0IGhh cyBhbiBhcmd1bWVudCB3aGljaApzcGVjaWZpZXMgdGhlIG51bWJlciBvZiBOSUMgcXVldWVzIChp LmUuIGtlcm5lbCB0aHJlYWRzKQp0byB1c2UgbWludXMgb25lLgoKRm9yIGV4YW1wbGU6CgogICAg Li9saW51eC1wa3RnZW4uc2ggMiAgIyBVc2VzIDMgTklDIHF1ZXVlcwoKV2hlbiB0aGUgc2NyaXB0 IHRlcm1pbmF0ZXMsIGl0IHByaW50cyB0aGUgcGVyLXF1ZXVlIHJhdGVzIGFuZAp0aGUgdG90YWwg cmF0ZSBhY2hpZXZlZC4KCgpORVRNQVAgQU5EIFZBTEUgRVhQRVJJTUVOVFMKLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tCgpGb3IgbW9zdCBleHBlcmltZW50cyB3aXRoIG5ldG1hcCB5b3UgY2Fu IHVzZSB0aGUgInBrdC1nZW4iIGNvbW1hbmQKKGRvIG5vdCBjb25mdXNlIGl0IHdpdGggdGhlIExp bnV4IGluLWtlcm5lbCBwa3RnZW4pLCB3aGljaCBoYXMgYSBsYXJnZQpudW1iZXIgb2Ygb3B0aW9u cyB0byBzZW5kIGFuZCByZWNlaXZlIHRyYWZmaWMgKGFsc28gb24gVEFQIGRldmljZXMpLgoKcGt0 LWdlbiBub3JtYWxseSBnZW5lcmF0ZXMgVURQIHRyYWZmaWMgZm9yIGEgc3BlY2lmaWMgSVAgYWRk cmVzcwphbmQgdXNpbmcgdGhlIGJyb2RhZGNhc3QgTUFDIGFkZHJlc3MKCk5ldG1hcCB0ZXN0aW5n IHdpdGggbmV0d29yayBpbnRlcmZhY2VzCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tCgpSZW1lbWJlciB0aGF0IHlvdSBuZWVkIGEgbmV0bWFwLWNhcGFibGUgZHJpdmVyIGlu IG9yZGVyIHRvIHVzZQpuZXRtYXAgb24gYSBzcGVjaWZpYyBOSUMuIEN1cnJlbnRseSBzdXBwb3J0 ZWQgZHJpdmVycyBhcmUgZTEwMDAsCmUxMDAwZSwgaXhnYmUsIGlnYi4gRm9yIHVwZGF0ZWQgaW5m b3JtYXRpb24gcGxlYXNlIHZpc2l0Cmh0dHA6Ly9pbmZvLmlldC51bmlwaS5pdC9+bHVpZ2kvbmV0 bWFwLwoKQmVmb3JlIHJ1bm5pbmcgcGt0LWdlbiwgbWFrZSBzdXJlIHRoYXQgdGhlIGxpbmsgaXMg dXAuCgpSdW4gcGt0LWdlbiBvbiBhbiBpbnRlcmZhY2UgY2FsbGVkICJJRk5BTUUiOgoKICAgIHBr dC1nZW4gLWkgSUZOQU1FIC1mIHR4ICAjIHJ1biBhIHBrdC1nZW4gc2VuZGVyCiAgICBwa3QtZ2Vu IC1pIElGTkFNRSAtZiByeCAgIyBydW4gYSBwa3QtZ2VuIHJlY2VpdmVyCgpwa3QtZ2VuIHdpdGhv dXQgYXJndW1lbnRzIHdpbGwgc2hvdyBvdGhlciBvcHRpb25zLCBlLmcuCiAgKyAtdyBzZWMJbW9k aWZpZXMgdGhlIHdhaXQgdGltZSBmb3IgbGluayBuZWdvdGlvYXRpb24KICArIC1sIGxlbgltb2Rp ZmllcyB0aGUgcGFja2V0IHNpemUKICArIC1kLCAtcwlzZXQgdGhlIElQIGRlc3RpbmF0aW9uL3Nv dXJjZSBhZGRyZXNzZXMgYW5kIHBvcnRzCiAgKyAtRCwgLVMJc2V0IHRoZSBNQUMgZGVzdGluYXRp b24vc291cmNlIGFkZHJlc3NlcwoKYW5kIG1vcmUuCgpUZXN0aW5nIHRoZSBWQUxFIHN3aXRjaAot LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KClRvIHVzZSB0aGUgVkFMRSBzd2l0Y2ggaW5zdGVhZCBv ZiBwaHlzaWNhbCBwb3J0cyB5b3Ugb25seSBuZWVkCnRvIGNoYW5nZSB0aGUgaW50ZXJmYWNlIG5h bWUgaW4gdGhlIHBrdC1nZW4gY29tbWFuZC4KQXMgYW4gZXhhbXBsZSwgb24gYSBzaW5nbGUgbWFj aGluZSwgeW91IGNhbiBydW4gc2VuZGVycyBhbmQgcmVjZWl2ZXJzCm9uIG11bHRpcGxlIHBvcnRz IG9mIGEgVkFMRSBzd2l0Y2ggYXMgZm9sbG93cyAocnVuIHRoZSBjb21tYW5kcyBpbnRvCnNlcGFy YXRlIHRlcm1pbmFscyB0byBzZWUgdGhlIG91dHB1dCkKCiAgICBwa3QtZ2VuIC1pdmFsZTA6MDEg LWZ0eCAgIyBydW4gYSBzZW5kZXIgb24gdGhlIHBvcnQgMDEgb2YgdGhlIHN3aXRjaCB2YWxlMAog ICAgcGt0LWdlbiAtaXZhbGUwOjAyIC1mcnggICMgcmVjZWl2ZXIgb24gdGhlIHBvcnQgMDIgb2Yg c2FtZSBzd2l0Y2gKICAgIHBrdC1nZW4gLWl2YWxlMDowMyAtZnR4ICAjIGFub3RoZXIgc2VuZGVy IG9uIHRoZSBwb3J0IDAzCgpUaGUgVkFMRSBzd2l0Y2hlcyBhbmQgcG9ydHMgYXJlIGNyZWF0ZWQg KGFuZCBkZXN0cm95ZWQpIG9uIHRoZSBmbHkuCgoKVHJhbnNwYXJlbnQgY29ubmVjdGlvbiBvZiBw aHlzaWNhbCBwb3J0cyB0byB0aGUgVkFMRSBzd2l0Y2gKLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KCkl0IGlzIGFsc28gcG9zc2libGUg dG8gdXNlIGEgbmV0d29yayBkZXZpY2UgYXMgYSBwb3J0IG9mIGEgVkFMRQpzd2l0Y2guIFlvdSBj YW4gZG8gdGhpcyB3aXRoIHRoZSBmb2xsb3dpbmcgY29tbWFuZDoKCiAgICB2YWxlLWN0bCAtaCB2 YWxlMDpldGgwICAjIGF0dGFjaCBpbnRlcmZhY2UgImV0aDAiIHRvIHRoZSAidmFsZTAiIHN3aXRj aAoKVG8gZGV0YWNoIGFuIGludGVyZmFjZSBmcm9tIGEgYnJpZGdlOgoKICAgIHZhbGUtY3RsIC1k IHZhbGUwOmV0aDAgICMgZGV0YWNoIGludGVyZmFjZSAiZXRoMCIgZnJvbSB0aGUgInZhbGUwIiBz d2l0Y2gKClRoZXNlIG9wZXJhdGlvbnMgY2FuIGJlIGlzc3VlZCBhdCBhbnkgbW9tZW50LgoKClRl c3RzIHdpdGggb3VyIG1vZGlmaWVkIFFFTVUKLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoK VGhlIExpbnV4IGltYWdlIGFsc28gY29udGFpbnMgb3VyIG1vZGlmaWVkIFFFTVUsIHdpdGggdGhl IFZBTEUgYmFja2VuZCBhbmQKdGhlICJlMTAwMC1wYXJhdmlydCIgZnJvbnRlbmQgKGEgcGFyYXZp cnR1YWxpemVkIGUxMDAwIGVtdWxhdGlvbikuCgpBZnRlciB5b3UgaGF2ZSBib290ZWQgdGhlIGlt YWdlIG9uIGEgcGh5c2ljYWwgbWFjaGluZSAoc28geW91IGNhbiBleHBsb2l0CktWTSksIHlvdSBj YW4gYm9vdCB0aGUgc2FtZSBpbWFnZSBhIHNlY29uZCB0aW1lIChyZWN1cnNpdmVseSkgd2l0aCBR RU1VLgpUaGVyZWZvcmUsIHlvdSBjYW4gcnVuIGFsbCB0aGUgdGVzdHMgYWJvdmUgYWxzbyBmcm9t IHdpdGhpbiB0aGUgdmlydHVhbAptYWNoaW5lIGVudmlyb25tZW50LgoKVG8gbWFrZSBWTSB0ZXN0 aW5nIGVhc2llciwgdGhlIGhvbWUgZGlyZWN0b3J5IGNvbnRhaW5zIHNvbWUKc29tZSB1c2VmdWwg c2NyaXB0cyB0byBzZXQgdXAgYW5kIGxhdW5jaCBWTXMgb24gdGhlIHBoeXNpY2FsIG1hY2hpbmUu CgorICJwcmVwLXRhcHMuc2giCiAgY3JlYXRlcyBhbmQgc2V0cyB1cCB0d28gcGVybWFuZW50IHRh cCBpbnRlcmZhY2VzICgidGFwMDEiIGFuZCAidGFwMDIiKQogIGFuZCBhIExpbnV4IGluLWtlcm5l bCBicmlkZ2UuIFRoZSB0YXAgaW50ZXJmYWNlcyBhcmUgdGhlbiBicmlkZ2VkCiAgdG9nZXRoZXIg b24gdGhlIHNhbWUgYnJpZGdlLiBUaGUgYnJpZGdlIGludGVyZmFjZSAoImJyMCIpLCBpcyBnaXZl bgogIHRoZSBhZGRyZXNzIDEwLjAuMC4yMDAvMjQuCgogIFRoaXMgc2V0dXAgY2FuIGJlIHVzZWQg dG8gbWFrZSB0d28gVk1zIGNvbW11bmljYXRlIHRocm91Z2ggdGhlCiAgaG9zdCBicmlkZ2UsIG9y IHRvIHRlc3QgdGhlIHNwZWVkIG9mIGEgbGludXggc3dpdGNoIHVzaW5nCiAgbDJvcGVuCgorICJ1 bnByZXAtdGFwcy5zaCIKICB1bmRvZXMgdGhlIGFib3ZlIHNldHVwLgoKKyAibGF1bmNoLXFlbXUu c2giCiAgY2FuIGJlIHVzZWQgdG8gcnVuIFFFTVUgdmlydHVhbCBtYWNoaW5lcy4gSXQgdGFrZXMg Zm91ciBhcmd1bWVudHM6CgogICAgKyBUaGUgZmlyc3QgYXJndW1lbnQgY2FuIGJlICJxZW11IiBv ciAia3ZtIiwgZGVwZW5kaW5nIG9uCiAgICAgIHdoZXRoZXIgd2Ugd2FudCB0byB1c2UgdGhlIHN0 YW5kYXJkIFFFTVUgYmluYXJ5IHRyYW5zbGF0aW9uCiAgICAgIG9yIHRoZSBoYXJkd2FyZSB2aXJ0 dWFsaXphdGlvbiBhY2NlbGVyYXRpb24uCgogICAgKyBUaGUgdGhpcmQgYXJndW1lbnQgY2FuIGJl ICItLXRhcCIsICItLW5ldHVzZXIiIG9yICItLXZhbGUiLAogICAgICBhbmQgdGVsbHMgUUVNVSB3 aGF0IG5ldHdvcmsgYmFja2VuZCB0byB1c2U6IGEgdGFwIGRldmljZSwKICAgICAgdGhlIFFFTVUg dXNlciBuZXR3b3JraW5nIChzbGlycCksIG9yIGEgVkFMRSBzd2l0Y2ggcG9ydC4KCiAgICArIFdo ZW4gdGhlIHRoaXJkIGFyZ3VtZW50IGlzICItLXRhcCIgb3IgIi0tdmFsZSIsIHRoZSBmb3VydGgK ICAgICAgYXJndW1lbnQgc3BlY2lmaWVzIGFuIGluZGV4IChlLmcuICIwMSIsICIwMiIsIGV0Yy4u KSB3aGljaAogICAgICB0ZWxscyBRRU1VIHdoYXQgdGFwIGRldmljZSBvciBWQUxFIHBvcnQgdG8g dXNlIGFzIGJhY2tlbmQuCgogIFlvdSBjYW4gbWFudWFsbHkgbW9kaWZ5IHRoZSBzY3JpcHQgdG8g c2V0IHRoZSBzaGVsbCB2YXJpYWJsZXMgdGhhdAogIHNlbGVjdCB0aGUgdHlwZSBvZiBlbXVsYXRl ZCBkZXZpY2UgKGUuZy4gIGUxMDAwLCB2aXJ0aW8tbmV0LXBjaSwgLi4uKQogIGFuZCByZWxhdGVk IG9wdGlvbnMgKGlvZXZlbnRmZCwgdmlydGlvIHZob3N0LCBlMTAwMCBtaXRpZ2F0aW9uLCAuLi4u KS4KCiAgVGhlIGRlZmF1bHQgc2V0dXAgaGFzIGFuICJlMTAwMCIgZGV2aWNlIHdpdGggaW50ZXJy dXB0IG1pdGlnYXRpb24KICBkaXNhYmxlZC4KCllvdSBjYW4gdHJ5IHRoZSBwYXJhdmlydHVhbGl6 ZWQgZTEwMDAgZGV2aWNlICgiZTEwMDAtcGFyYXZpcnQiKQpvciB0aGUgInZpcnRpby1uZXQiIGRl dmljZSB0byBnZXQgYmV0dGVyIHBlcmZvcm1hbmNlLiBIb3dldmVyLCBiZWFyCmluIG1pbmQgdGhh dCB0aGVzZSBwYXJhdmlydHVhbGl6ZWQgZGV2aWNlcyBkb24ndCBoYXZlIG5ldG1hcCBzdXBwb3J0 Cih3aGVyZWFzIHRoZSBzdGFuZGFyZCBlMTAwMCBkb2VzIGhhdmUgbmV0bWFwIHN1cHBvcnQpLgoK RXhhbXBsZXM6CgogICAgIyBSdW4gYSBrdm0gVk0gYXR0YWNoZWQgdG8gdGhlIHBvcnQgMDEgb2Yg YSBWQUxFIHN3aXRjaAogICAgLi9sYXVuY2gtcWVtdS5zaCBrdm0gLS12YWxlIDAxCgogICAgIyBS dW4gYSBrdm0gVk0gYXR0YWNoZWQgdG8gdGhlIHBvcnQgMDIgb2YgdGhlIHNhbWUgVkFMRSBzd2l0 Y2gKICAgIC4vbGF1bmNoLXFlbXUuc2gga3ZtIC0tdmFsZSAwMgoKICAgICMgUnVuIGEga3ZtIFZN IGF0dGFjaGVkIHRvIHRoZSB0YXAgY2FsbGVkICJ0YXAwMSIKICAgIC4vbGF1bmNoLXFlbXUuc2gg a3ZtIC0tdGFwIDAxCgogICAgIyBSdW4gYSBrdm0gVk0gYXR0YWNoZWQgdG8gdGhlIHRhcCBjYWxs ZWQgInRhcDAyIgogICAgLi9sYXVuY2gtcWVtdS5zaCBrdm0gLS10YXAgMDIKCgpHdWVzdC10by1n dWVzdCB0ZXN0cwotLS0tLS0tLS0tLS0tLS0tLS0tLQoKSWYgeW91IHJ1biB0d28gVk1zIGF0dGFj aGVkIHRvIHRoZSBzYW1lIHN3aXRjaCAod2hpY2ggY2FuIGJlIGEgTGludXgKYnJpZGdlIG9yIGEg VkFMRSBzd2l0Y2gpLCB5b3UgY2FuIHJ1biBndWVzdC10by1ndWVzdCBleHBlcmltZW50cy4KCkFs bCB0aGUgdGVzdHMgcmVwb3J0ZWQgaW4gdGhlIHByZXZpb3VzIHNlY3Rpb25zIGFyZSBwb3NzaWJs ZSAobm9ybWFsCnNvY2tldHMsIHJhdyBzb2NrZXRzLCBwa3QtZ2VuLCAuLi4pLCBpbmRpcGVuZGVu dGx5IG9mIHRoZSBiYWNrZW5kIHVzZWQuCgpJbiB0aGUgZm9sbG93aW5nIGV4YW1wbGVzIHdlIGFz c3VtZSB0aGF0OgoKICAgICsgRWFjaCBWTSBoYXMgYW4gZXRoZXJuZXQgaW50ZXJmYWNlIGNhbGxl ZCAiZXRoMCIuCgogICAgKyBUaGUgaW50ZXJmYWNlIG9mIHRoZSBmaXJzdCBWTSBpcyBnaXZlbiB0 aGUgSVAgMTAuMC4wLjEvMjQuCgogICAgKyBUaGUgaW50ZXJmYWNlIG9mIHRoZSBzZWNvbmQgVk0g aXMgZ2l2ZW4gdGhlIElQIDEwLjAuMC4yLzI0LgoKICAgICsgVGhlIExpbnV4IGJyaWRnZSBpbnRl cmZhY2UgImJyMCIgb24gdGhlIGhvc3QgaXMgZ2l2ZW4gdGhlCiAgICAgIElQIDEwLjAuMC4yMDAv MjQuCgpFeGFtcGxlczoKCiAgICBbMV0gIyMjIFRlc3QgVURQIHNob3J0IHBhY2tldHMgb3ZlciB0 cmFkaXRpb25hbCBzb2NrZXRzICMjIwogICAgICAgICMgT24gdGhlIGd1ZXN0IDEwLjAuMC4yIHJ1 bgogICAgICAgICAgICBuZXRzZXJ2ZXIKICAgICAgICAjIG9uIHRoZSBndWVzdCAxMC4wLjAuMSBy dW4KICAgICAgICAgICAgbmV0cGVyZiAtSDEwLjAuMC4yIC10VURQX1NUUkVBTSAtLSAtbTgKCiAg ICBbMl0gIyMjIFRlc3QgVURQIHNob3J0IHBhY2tldHMgd2l0aCBwa3QtZ2VuICMjIwogICAgICAg ICMgT24gdGhlIGd1ZXN0IDEwLjAuMC4yIHJ1bgogICAgICAgICAgICBwa3QtZ2VuIC1pZXRoMCAt ZnJ4CiAgICAgICAgIyBPbiB0aGUgZ3Vlc3QgMTAuMC4wLjEgcnVuCiAgICAgICAgICAgIHBrdC1n ZW4gLWlldGgwIC1mdHgKCiAgICBbM10gIyMjIFRlc3QgZ3Vlc3QtdG8tZ3Vlc3QgbGF0ZW5jeSAj IyMKICAgICAgICAjIE9uIHRoZSBndWVzdCAxMC4wLjAuMiBydW4KICAgICAgICAgICAgbmV0c2Vy dmVyCiAgICAgICAgIyBPbiB0aGUgZ3Vlc3QgMTAuMC4wLjEgcnVuCiAgICAgICAgICAgIG5ldHBl cmYgLUgxMC4wLjAuMiAtdFRDUF9SUgoKTm90ZSB0aGF0IHlvdSBjYW4gdXNlIHBrdC1nZW4gaW50 byBhIFZNIG9ubHkgaWYgdGhlIGVtdWxhdGVkIGV0aGVybmV0CmRldmljZSBpcyBzdXBwb3J0ZWQg YnkgbmV0bWFwLiBUaGUgZGVmYXVsdCBlbXVsYXRlZCBkZXZpY2UgaXMKImUxMDAwIiwgd2hpY2gg aGFzIG5ldG1hcCBzdXBwb3J0LiAgSWYgeW91IHRyeSB0byBydW4gcGt0LWdlbiBvbgphbiB1bnN1 cHBvcnRlZCBkZXZpY2UsIHBrdC1nZW4gd2lsbCBub3Qgd29yaywgcmVwb3J0aW5nIHRoYXQgaXQg aXMKdW5hYmxlIHRvIHJlZ2lzdGVyIHRoZSBpbnRlcmZhY2UuCgoKR3Vlc3QtdG8taG9zdCB0ZXN0 cyAoZm9sbG93cyBmcm9tIHRoZSBwcmV2aW91cyBzZWN0aW9uKQotLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCgpJZiB5b3UgcnVuIG9ubHkgYSBW TSBvbiB5b3VyIGhvc3QgbWFjaGluZSwgeW91IGNhbiBtZWFzdXJlIHRoZQpuZXR3b3JrIHBlcmZv cm1hbmNlIGJldHdlZW4gdGhlIFZNIGFuZCB0aGUgaG9zdCBtYWNoaW5lLiAgSW4gdGhpcwpjYXNl IHRoZSBleHBlcmltZW50IHNldHVwIGRlcGVuZHMgb24gdGhlIGJhY2tlbmQgeW91IGFyZSB1c2lu Zy4KCldpdGggdGhlIHRhcCBiYWNrZW5kLCB5b3UgY2FuIHVzZSB0aGUgYnJpZGdlIGludGVyZmFj ZSAiYnIwIiBhcyBhCmNvbW11bmljYXRpb24gZW5kcG9pbnQuIFlvdSBjYW4gcnVuIG5vcm1hbC9y YXcgc29ja2V0cyBleHBlcmltZW50cywKYnV0IHlvdSBjYW5ub3QgdXNlIHBrdC1nZW4gb24gdGhl ICJicjAiIGludGVyZmFjZSwgc2luY2UgdGhlIExpbnV4CmJyaWRnZSBpbnRlcmZhY2UgaXMgbm90 IHN1cHBvcnRlZCBieSBuZXRtYXAuCgpFeGFtcGxlcyB3aXRoIHRoZSB0YXAgYmFja2VuZDoKCiAg ICBbMV0gIyMjIFRlc3QgVENQIHRocm91Z2hwdXQgb3ZlciB0cmFkaXRpb25hbCBzb2NrZXRzICMj IwogICAgICAgICMgT24gdGhlIGhvc3QgcnVuCiAgICAgICAgICAgIG5ldHNlcnZlcgogICAgICAg ICMgb24gdGhlIGd1ZXN0IDEwLjAuMC4xIHJ1bgogICAgICAgICAgICBuZXRwZXJmIC1IMTAuMC4w LjIwMCAtdFRDUF9TVFJFQU0KCiAgICBbMl0gIyMjIFRlc3QgVURQIHNob3J0IHBhY2tldHMgd2l0 aCBwa3QtZ2VuIGFuZCBsMiAjIyMKICAgICAgICAjIE9uIHRoZSBob3N0IHJ1bgogICAgICAgICAg ICBsMm9wZW4gLXIgYnIwIGwycmVjdgogICAgICAgICMgT24gdGhlIGd1ZXN0IDEwLjAuMC4xIHJ1 biAoeHg6eXk6eno6d3c6dXU6dnYgaXMgdGhlCiAgICAgICAgIyAiYnIwIiBoYXJkd2FyZSBhZGRy ZXNzKQogICAgICAgICAgICBwa3QtZ2VuIC1pZXRoMCAtZnR4IC1kMTAuMC4wLjIwMDo3Nzc3IC1E eHg6eXk6eno6d3c6dXU6dnYKCgpXaXRoIHRoZSBWQUxFIGJhY2tlbmQgeW91IGNhbiBwZXJmb3Jt IG9ubHkgVURQIHRlc3RzLCBzaW5jZSB3ZSBkb24ndCBoYXZlCmEgbmV0bWFwIGFwcGxpY2F0aW9u IHdoaWNoIGltcGxlbWVudHMgYSBUQ1AgZW5kcG9pbnQ6IHBrdC1nZW4gZ2VuZXJhdGVzClVEUCBw YWNrZXRzLgpBcyBhIGNvbW11bmljYXRpb24gZW5kcG9pbnQgb24gdGhlIGhvc3QsIHlvdSBjYW4g dXNlIGEgdmlydHVhbCBWQUxFIHBvcnQKb3BlbmVkIG9uIHRoZSBmbHkgYnkgYSBwa3QtZ2VuIGlu c3RhbmNlLgoKRXhhbXBsZXMgd2l0aCB0aGUgVkFMRSBiYWNrZW5kOgoKICAgIFsxXSAjIyMgVGVz dCBVRFAgc2hvcnQgcGFja2V0cyAjIyMKICAgICAgICAjIE9uIHRoZSBob3N0IHJ1bgogICAgICAg ICAgICBwa3QtZ2VuIC1pdmFsZTA6OTkgLWZyeAogICAgICAgICMgT24gdGhlIGd1ZXN0IDEwLjAu MC4xIHJ1bgogICAgICAgICAgICBwa3QtZ2VuIC1pZXRoMCAtZnR4CgogICAgWzJdICMjIyBUZXN0 IFVEUCBiaWcgcGFja2V0cyAocmVjZWl2ZXIgb24gdGhlIGd1ZXN0KSAjIyMKICAgICAgICAjIE9u IHRoZSBndWVzdCAxMC4wLjAuMSBydW4KICAgICAgICAgICAgcGt0LWdlbiAtaWV0aDAgLWZyeAog ICAgICAgICMgT24gdGhlIGhvc3QgcnVuIHBrdC1nZW4gLWl2YWxlMDo5OSAtZnR4IC1sMTQ2MAoK --047d7b604cbc65dca304f0b88f04-- From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 21:26:59 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D7C7A549; Fri, 24 Jan 2014 21:26:59 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AA50319FD; Fri, 24 Jan 2014 21:26:59 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0OLQx31073659; Fri, 24 Jan 2014 21:26:59 GMT (envelope-from jmg@freefall.freebsd.org) Received: (from jmg@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0OLQxDf073658; Fri, 24 Jan 2014 21:26:59 GMT (envelope-from jmg) Date: Fri, 24 Jan 2014 21:26:59 GMT Message-Id: <201401242126.s0OLQxDf073658@freefall.freebsd.org> To: marii.vasile@gmail.com, jmg@FreeBSD.org, freebsd-net@FreeBSD.org From: jmg@FreeBSD.org Subject: Re: kern/132277: [crypto] [ipsec] poor performance using cryptodevice for IPSEC X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 21:26:59 -0000 Synopsis: [crypto] [ipsec] poor performance using cryptodevice for IPSEC State-Changed-From-To: open->closed State-Changed-By: jmg State-Changed-When: Fri Jan 24 21:26:35 UTC 2014 State-Changed-Why: closed because kern/132622 has a patch... http://www.freebsd.org/cgi/query-pr.cgi?pr=132277 From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 22:54:32 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3E16F6B3; Fri, 24 Jan 2014 22:54:32 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id E1C921161; Fri, 24 Jan 2014 22:54:31 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEACbu4lKDaFve/2dsb2JhbABQCoNEVoJ9uHJPgR50giUBAQEDAQEBASArIAsFFhgCAg0ZAikBCSYGCAIFBAEaAgEDh1wIDawGnQ8XgSmNAQYKAgEbNAeCL0CBSQSJSIwMhAWQbINLHjGBPQ X-IronPort-AV: E=Sophos;i="4.95,715,1384318800"; d="scan'208";a="90291120" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 24 Jan 2014 17:54:29 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DB926B4041; Fri, 24 Jan 2014 17:54:29 -0500 (EST) Date: Fri, 24 Jan 2014 17:54:29 -0500 (EST) From: Rick Macklem To: J David Message-ID: <659117348.16015750.1390604069888.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 22:54:32 -0000 J David wrote: > On Thu, Jan 23, 2014 at 9:27 PM, Rick Macklem > wrote: > > Well, my TCP is pretty rusty, but... > > Since your stats didn't show any jumbo frames, each IP > > datagram needs to fit in the MTU of 1500bytes. NFS hands an mbuf > > list of just over 64K (or 32K) to TCP in a single sosend(), then > > TCP > > will generate about 45 (or about 23 for 32K) TCP segments and put > > each in an IP datagram, then hand it to the network device driver > > for transmission. > > This is *not* what happens with TSO/LRO. > > With TSO, TCP generates IP datagrams of up to 64k which are passed > directly to the driver, which passes them directly to the hardware. > > Furthermore, in this unique case (two virtual machines on the same > host and bridge with both TSO and LRO enabled end-to-end), the packet > is *never* fragmented. The host takes the 64k packet off of one > guest's output ring and puts it onto the other guest's input ring, > intact. > > This is, as you might expect, a *massive* performance win. > Ok, I mistakenly assumed that this driver emulated an ethernet. It does not (at least w.r.t. MTU). It appears that it allows an MTU of up to 64K (I had never heard of such a thing until now). So, who knows what effect that has when an NFS RPC message is just over 64K. The largest jumbo packet supported by the generic mbuf code is 16K (or maybe 9K for 9.2). I have no idea if this matters or not. I've cc'd glebius, since he's the last guy to make commits to the virtio network driver. Maybe he can guess at what is going on. rick > With TSO & LRO: > > $ time iperf -c 172.20.20.162 -d > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > [ 5] local 172.20.20.169 port 60889 connected with 172.20.20.162 > port 5001 > > [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port > 44101 > > [ ID] Interval Transfer Bandwidth > > [ 5] 0.0-10.0 sec 17.0 GBytes 14.6 Gbits/sec > > [ 4] 0.0-10.0 sec 17.4 GBytes 14.9 Gbits/sec > > > real 0m10.061s > > user 0m0.229s > > sys 0m7.711s > > > Without TSO & LRO: > > > $ time iperf -c 172.20.20.162 -d > > ------------------------------------------------------------ > > Server listening on TCP port 5001 > > TCP window size: 1.00 MByte (default) > > ------------------------------------------------------------ > > ------------------------------------------------------------ > > Client connecting to 172.20.20.162, TCP port 5001 > > TCP window size: 1.26 MByte (default) > > ------------------------------------------------------------ > > [ 5] local 172.20.20.169 port 22088 connected with 172.20.20.162 > port 5001 > > [ 4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port > 48615 > > [ ID] Interval Transfer Bandwidth > > [ 5] 0.0-10.0 sec 637 MBytes 534 Mbits/sec > > [ 4] 0.0-10.0 sec 767 MBytes 642 Mbits/sec > > > real 0m10.057s > > user 0m0.231s > > sys 0m3.935s > > > Look at the difference. In this bidirectional test, TSO is over 25x > faster using not even 2x the CPU. This shows how essential TSO/LRO > is > if you plan to move data at real world speeds and still have enough > CPU left to operate on that data. > > > > I recall you saying you tried turning off TSO with no > > effect. You might also try turning off checksum offload. I doubt it > > will > > be where things are broken, but might be worth a try. > > That was not me, that was someone else. If there is a problem with > NFS and TSO, the solution is *not* to disable TSO. That is, at best, > a workaround that produces much more CPU load and much less > throughput. The solution is to find the problem and fix it. > But disabling it will identify if that is causing the problem. And it is a workaround that often helps people get things to work. (With real hardware, there may be no way to "fix" such things, depending on the chipset, etc.) rick ps: If you had looked at the link I had in the email, you would have seen that he gets very good performance once he disables TSO. As they say, your mileage may vary. > More data to follow. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 22:59:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 114B6886 for ; Fri, 24 Jan 2014 22:59:12 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 9CE38118C for ; Fri, 24 Jan 2014 22:59:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAE3v4lKDaFve/2dsb2JhbABag0RWgn24ck+BHnSCJQEBAQMBAQEBICsgCwUWGAICDRkCKQEJJgYIBwQBHAEDh1wIDawHnQ4XgSmNEgEBGzQHgm+BSQSJSIwMhAWQbINLHjGBBDk X-IronPort-AV: E=Sophos;i="4.95,715,1384318800"; d="scan'208";a="90293624" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 24 Jan 2014 17:59:06 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 7F553B4040; Fri, 24 Jan 2014 17:59:06 -0500 (EST) Date: Fri, 24 Jan 2014 17:59:06 -0500 (EST) From: Rick Macklem To: J David Message-ID: <835300790.16017425.1390604346511.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 22:59:12 -0000 J David wrote: > Here's a pair of quick tcpdumps with TSO/LRO on and off showing that > tcpdump does not reassemble small packets into larger ones: > > TSO/LRO on: > > 05:21:54.956061 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [S], seq 2538331145, win 65535, options [mss 1460,nop,wscale > 9,sackOK,TS val 2932122 ecr 0], length 0 > 05:21:54.956239 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [S.], seq 3371756775, ack 2538331146, win 65535, options [mss > 1460,nop,wscale 9,sackOK,TS val 2833562423 ecr 2932122], length 0 > 05:21:54.956292 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], ack 1, win 2050, options [nop,nop,TS val 2932132 ecr > 2833562423], > length 0 > 05:21:54.956372 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [P.], seq 1:25, ack 1, win 2050, options [nop,nop,TS val 2932132 ecr > 2833562423], length 24 > 05:21:54.956432 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 1, win 2050, options [nop,nop,TS val 2833562423 ecr > 2932132], > length 0 > 05:21:54.956495 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 25:4381, ack 1, win 2050, options [nop,nop,TS val 2932132 > ecr > 2833562423], length 4356 > 05:21:54.956604 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 4381, win 2041, options [nop,nop,TS val 2833562423 ecr > 2932132], length 0 > 05:21:54.956620 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 4381:11657, ack 1, win 2050, options [nop,nop,TS val 2932132 > ecr 2833562423], length 7276 > 05:21:55.050656 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 11657, win 2050, options [nop,nop,TS val 2833562523 ecr > 2932132], length 0 > 05:21:55.050686 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 11657:21829, ack 1, win 2050, options [nop,nop,TS val > 2932222 > ecr 2833562523], length 10172 > 05:21:55.150644 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 21829, win 2050, options [nop,nop,TS val 2833562623 ecr > 2932222], length 0 > 05:21:55.150674 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 21829:34897, ack 1, win 2050, options [nop,nop,TS val > 2932322 > ecr 2833562623], length 13068 > 05:21:55.250629 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 34897, win 2050, options [nop,nop,TS val 2833562723 ecr > 2932322], length 0 > 05:21:55.250658 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 34897:50861, ack 1, win 2050, options [nop,nop,TS val > 2932422 > ecr 2833562723], length 15964 > 05:21:55.350647 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 50861, win 2050, options [nop,nop,TS val 2833562823 ecr > 2932422], length 0 > 05:21:55.350677 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 50861:69721, ack 1, win 2050, options [nop,nop,TS val > 2932522 > ecr 2833562823], length 18860 > 05:21:55.450656 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 69721, win 2050, options [nop,nop,TS val 2833562923 ecr > 2932522], length 0 > 05:21:55.450686 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 69721:91477, ack 1, win 2050, options [nop,nop,TS val > 2932622 > ecr 2833562923], length 21756 > 05:21:55.550577 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 91477, win 2050, options [nop,nop,TS val 2833563023 ecr > 2932622], length 0 > 05:21:55.550608 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 91477:116129, ack 1, win 2050, options [nop,nop,TS val > 2932722 ecr 2833563023], length 24652 > 05:21:55.650645 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 116129, win 2050, options [nop,nop,TS val 2833563123 ecr > 2932722], length 0 > 05:21:55.650676 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 116129:143677, ack 1, win 2050, options [nop,nop,TS val > 2932822 ecr 2833563123], length 27548 > 05:21:55.750643 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 143677, win 2050, options [nop,nop,TS val 2833563223 ecr > 2932822], length 0 > 05:21:55.750675 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 143677:174121, ack 1, win 2050, options [nop,nop,TS val > 2932922 ecr 2833563223], length 30444 > 05:21:55.850636 IP 172.20.20.162.5001 > 172.20.20.169.24265: Flags > [.], ack 174121, win 2050, options [nop,nop,TS val 2833563323 ecr > 2932922], length 0 > 05:21:55.850667 IP 172.20.20.169.24265 > 172.20.20.162.5001: Flags > [.], seq 174121:207461, ack 1, win 2050, options [nop,nop,TS val > 2933022 ecr 2833563323], length 33340 > > TSO/LRO off: > > 05:19:34.556302 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [S], seq 529322163, win 65535, options [mss 1460,nop,wscale > 9,sackOK,TS val 2791722 ecr 0], length 0 > 05:19:34.556414 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags > [S.], seq 3835815533, ack 529322164, win 65535, options [mss > 1460,nop,wscale 9,sackOK,TS val 1931664416 ecr 2791722], length 0 > 05:19:34.556443 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], ack 1, win 2050, options [nop,nop,TS val 2791732 ecr > 1931664416], > length 0 > 05:19:34.556505 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [P.], seq 1:25, ack 1, win 2050, options [nop,nop,TS val 2791732 ecr > 1931664416], length 24 > 05:19:34.556604 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 25:1473, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr > 1931664416], length 1448 > 05:19:34.556621 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags > [.], ack 1, win 2050, options [nop,nop,TS val 1931664416 ecr > 2791732], > length 0 > 05:19:34.556648 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 1473:2921, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr 1931664416], length 1448 > 05:19:34.556672 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 2921:4369, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr 1931664416], length 1448 > 05:19:34.556711 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags > [.], ack 1473, win 2047, options [nop,nop,TS val 1931664416 ecr > 2791732], length 0 > 05:19:34.556730 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 4369:5817, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr 1931664416], length 1448 > 05:19:34.556743 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 5817:7265, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr 1931664416], length 1448 > 05:19:34.556755 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags > [.], ack 4369, win 2041, options [nop,nop,TS val 1931664416 ecr > 2791732], length 0 > 05:19:34.556774 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 7265:8713, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr 1931664416], length 1448 > 05:19:34.556793 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 8713:10161, ack 1, win 2050, options [nop,nop,TS val 2791732 > ecr 1931664416], length 1448 > 05:19:34.556813 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 10161:11609, ack 1, win 2050, options [nop,nop,TS val > 2791732 > ecr 1931664416], length 1448 > 05:19:34.556830 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 11609:13057, ack 1, win 2050, options [nop,nop,TS val > 2791732 > ecr 1931664416], length 1448 > 05:19:34.556848 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags > [.], ack 7265, win 2036, options [nop,nop,TS val 1931664416 ecr > 2791732], length 0 > 05:19:34.556865 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 13057:14505, ack 1, win 2050, options [nop,nop,TS val > 2791732 > ecr 1931664416], length 1448 > 05:19:34.556881 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 14505:15953, ack 1, win 2050, options [nop,nop,TS val > 2791732 > ecr 1931664416], length 1448 > 05:19:34.556893 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 15953:17401, ack 1, win 2050, options [nop,nop,TS val > 2791732 > ecr 1931664416], length 1448 > 05:19:34.556912 IP 172.20.20.169.53485 > 172.20.20.162.5001: Flags > [.], seq 17401:18849, ack 1, win 2050, options [nop,nop,TS val > 2791732 > ecr 1931664416], length 1448 > 05:19:34.556929 IP 172.20.20.162.5001 > 172.20.20.169.53485: Flags > [.], ack 10161, win 2030, options [nop,nop,TS val 1931664416 ecr > 2791732], length 0 > > The stated lengths represent the length of each IP packet as it > appears on the "wire." The result is the same whether it is obtained > from the client, the server, or the KVM host snooping directly on the > bridge. > Yes, although no real wire would ever have packets this big. I now realize that you have a virtual pseudo ethernet that handles 64K packets. Sorry for the confusion this has caused. Since I know absolutely nothing about what the effects of such a virtual device are, I will be useless to you in helping with this. (I don't even know if wireshark could figure this out?) rick > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 23:05:59 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 58B72DD5 for ; Fri, 24 Jan 2014 23:05:59 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2FE321228 for ; Fri, 24 Jan 2014 23:05:58 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0ON5vtJ005714 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 24 Jan 2014 15:05:58 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0ON5vBH005713; Fri, 24 Jan 2014 15:05:57 -0800 (PST) (envelope-from jmg) Date: Fri, 24 Jan 2014 15:05:57 -0800 From: John-Mark Gurney To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140124230557.GF75135@funkthat.com> Mail-Followup-To: Rick Macklem , J David , freebsd-net@freebsd.org References: <659117348.16015750.1390604069888.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <659117348.16015750.1390604069888.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Fri, 24 Jan 2014 15:05:58 -0800 (PST) Cc: freebsd-net@freebsd.org, J David X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 23:05:59 -0000 Rick Macklem wrote this message on Fri, Jan 24, 2014 at 17:54 -0500: > The largest jumbo packet supported by the generic mbuf code is 16K > (or maybe 9K for 9.2). I have no idea if this matters or not. This is only partly true. Our allocators only supports mbufs of 2k (standard size), 4k (page size), 9216 and 16184... If you allocate a 9k or 16k mbuf, it is guaranteed that the data will be physically contiguous so that cards that can't do scatter/gather DMA can handle larger frames... But if the card can handle S/G DMA, they can send a 32KB packet made up of normal 2k clusters, or any other odd sized mbufs... There are only a couple drivers (and I plan on working to remove the limit) that limit the size of MTU, but if the driver hardware supports it, there is nothing in our stack preventing large, as in 64KB, MTU use.. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 23:36:58 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2E446DCE; Fri, 24 Jan 2014 23:36:58 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 8DC6114B0; Fri, 24 Jan 2014 23:36:56 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvYEAJn44lKDaFve/2dsb2JhbABag0RWgn21R4MvT4EedIIlAQEBAwEBAQEXCSsXCQsFFhgCAg0ZAikBCSYGCAIFBAEcAQOHXAgNrAOdCBeBKY0SAQEbATMHgi9AgUkEiUiKd4EVhAWQbINLHjGBBDk X-IronPort-AV: E=Sophos;i="4.95,715,1384318800"; d="scan'208";a="90308818" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 24 Jan 2014 18:36:55 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 3AD6979284; Fri, 24 Jan 2014 18:36:55 -0500 (EST) Date: Fri, 24 Jan 2014 18:36:55 -0500 (EST) From: Rick Macklem To: J David Message-ID: <496663147.16031504.1390606615194.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 23:36:58 -0000 J David wrote: > INTRODUCTION >=20 > While researching NFS performance problems in our environment, I ran > a > series of four iozone tests twice, once with rsize=3D4096,wsize=3D4096 > and > again with rsize=3D8192,wsize=3D8192. That produced additional avenues > of > inquiry leading to the identification of two separate potential > problems with NFS performance. This message discusses the first > problem, apparently extraneous fixed-sized reads before writes. >=20 > In order to maximize network throughput (>16Gbit/sec) and minimize > network latency (<0.1ms), these tests were performed using two > FreeBSD > guests on the same KVM host node with bridged virtio adapters having > TSO and LRO enabled. To remove local or network-backed virtual disks > as a bottleneck, a 2GiB memory-backed-UFS-over-NFS filesystem served > as the NFS export from the server to the client. >=20 > BENCHMARK DATA >=20 > Here is the iozone against the 4k NFS mount: >=20 > $ iozone -e -I -s 1g -r 4k -i 0 -i 2 > Iozone: Performance Test of File I/O > Version $Revision: 3.420 $ > [...] > Include fsync in write timing > File size set to 1048576 KB > Record Size 4 KB > Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 > Output is in Kbytes/sec > Time Resolution =3D 0.000005 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random > random > KB reclen write rewrite read reread read > write > 1048576 4 71167 73714 27525 > 46349 >=20 > iozone test complete. >=20 > Now, here are the read and write columns from nfsstat -s -w 1 over > the > entire run: >=20 > Read Write > 0 0 > 0 0 > 0 0 > 0 11791 <-- sequential write test begins > 0 17215 > 0 18260 > 0 17968 > 0 17810 > 0 17839 > 0 17959 > 0 17912 > 0 18180 > 0 18285 > 0 18636 > 0 18554 > 0 18178 > 0 17361 > 0 17803 <-- sequential rewrite test begins > 0 18358 > 0 18188 > 0 18817 > 0 17757 > 0 18153 > 0 18924 > 0 19444 > 0 18775 > 0 18995 > 0 18198 > 0 18949 > 0 17978 > 0 18879 > 0 19055 > 7904 67 <-- random read test begins > 7502 0 > 7194 0 > 7432 0 > 6995 0 > 6844 0 > 6730 0 > 6761 0 > 7011 0 > 7058 0 > 7477 0 > 7139 0 > 6793 0 > 7047 0 > 6402 0 > 6621 0 > 7111 0 > 6911 0 > 7413 0 > 7431 0 > 7047 0 > 7002 0 > 7104 0 > 6987 0 > 6849 0 > 6580 0 > 6268 0 > 6868 0 > 6775 0 > 6335 0 > 6588 0 > 6595 0 > 6587 0 > 6512 0 > 6861 0 > 6953 0 > 7273 0 > 5184 1688 <-- random write test begins > 0 11795 > 0 11915 > 0 11916 > 0 11838 > 0 12035 > 0 11408 > 0 11780 > 0 11488 > 0 11836 > 0 11787 > 0 11824 > 0 12099 > 0 11863 > 0 12154 > 0 11127 > 0 11434 > 0 11815 > 0 11960 > 0 11510 > 0 11623 > 0 11714 > 0 11896 > 0 1637 <-- test finished > 0 0 > 0 0 > 0 0 >=20 > This looks exactly like you would expect. >=20 > Now, we re-run the exact same test with rsize/wsize set to 8k: >=20 > $ iozone -e -I -s 1g -r 4k -i 0 -i 2 > Iozone: Performance Test of File I/O > Version $Revision: 3.420 $ > Compiled for 64 bit mode. > Build: freebsd >=20 > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, > Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave > Boone, > Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, > Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, > Vangel Bojaxhi, Ben England, Vikentsi Lapa. >=20 > Run began: Fri Jan 24 07:45:06 2014 >=20 > Include fsync in write timing > File size set to 1048576 KB > Record Size 4 KB > Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 > Output is in Kbytes/sec > Time Resolution =3D 0.000005 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random > random > KB reclen write rewrite read reread read > write > 1048576 4 124820 34850 35996 > 16258 >=20 > iozone test complete. >=20 > And here are the second-by-second counts of corresponding read/write > operations from nfsstat: >=20 > Read Write > 0 0 > 0 0 > 0 0 > 11 10200 <-- sequential write test begins > 13 16109 > 10 15360 > 12 16055 > 4 15623 > 8 15533 > 9 16241 > 11 16045 > 1891 11874 <-- sequential rewrite test begins > 4508 4508 > 4476 4476 > 4353 4353 > 4350 4351 > 4230 4229 > 4422 4422 > 4731 4731 > 4600 4600 > 4550 4550 > 4314 4314 > 4273 4273 > 4422 4422 > 4284 4284 > 4609 4609 > 4600 4600 > 4362 4361 > 4386 4387 > 4451 4451 > 4467 4467 > 4370 4370 > 4394 4394 > 4252 4252 > 4155 4155 > 4427 4428 > 4436 4435 > 4300 4300 > 3989 3989 > 4438 4438 > 4632 4632 > 6440 1836 <-- random read test begins > 6681 0 > 6416 0 > 6290 0 > 6560 0 > 6569 0 > 6237 0 > 6331 0 > 6403 0 > 6310 0 > 6010 0 > 6023 0 > 6271 0 > 6068 0 > 6044 0 > 6003 0 > 5957 0 > 5683 0 > 5833 0 > 5774 0 > 5708 0 > 5810 0 > 5829 0 > 5787 0 > 5791 0 > 5838 0 > 5922 0 > 6002 0 > 6191 0 > 4162 2624 <-- random write test begins > 3998 4068 > 4056 4100 > 3929 3969 > 4164 4128 > 4188 4173 > 4256 4164 > 3899 4068 > 4018 3933 > 4077 4114 > 4210 4222 > 4050 4116 > 4157 3983 > 4066 4122 > 4129 4162 > 4303 4199 > 4235 4412 > 4100 3932 > 4056 4254 > 4192 4186 > 4204 4120 > 4085 4138 > 4103 4146 > 4272 4193 > 3943 4028 > 4161 4163 > 4153 4131 > 4218 4017 > 4132 4233 > 4195 4101 > 4041 4145 > 3994 4086 > 4297 4193 > 4205 4341 > 4101 4103 > 4134 3962 > 4297 4408 > 4282 4242 > 4180 4175 > 4286 4216 > 4217 4397 > 4364 4253 > 3673 3720 > 3843 3807 > 4147 4183 > 4171 4181 > 4280 4224 > 4126 4158 > 3977 4074 > 4146 3919 > 4147 4362 > 4079 4060 > 3755 3760 > 4157 4130 > 4087 4109 > 4006 3873 > 3860 3967 > 3982 4048 > 4146 3963 > 4188 4203 > 4040 4063 > 3976 4046 > 3859 3815 > 4114 4193 > 393 447 <-- test finished > 0 0 > 0 0 > 0 0 >=20 > To go through the tests individually... >=20 > SEQUENTIAL WRITE TEST >=20 > The first test is sequential write, the ktrace for this test looks > like this: >=20 > 3648 iozone CALL write(0x3,0x801500000,0x1000) > 3648 iozone GIO fd 3 wrote 4096 bytes > 3648 iozone RET write 4096/0x1000 > 3648 iozone CALL write(0x3,0x801500000,0x1000) > 3648 iozone GIO fd 3 wrote 4096 bytes > 3648 iozone RET write 4096/0x1000 >=20 > It=E2=80=99s just writing 4k in a loop 262,144 times. >=20 > This test is substantially faster with wsize=3D8192 than wsize=3D4096 > (124,820 KiB/sec at 8k vs 71,167 KiB/sec at 4k). We see from the > nfsstat output that this is because writes are being coalesced: there > are half as many writes in the 8k test than the 4k test (8k =3D > 131,072, > 4k =3D 262,144). >=20 > So far so good. >=20 > SEQUENTIAL REWRITE TEST >=20 > But when we move on to the rewrite test, there=E2=80=99s trouble in River > City. Here=E2=80=99s what it looks like in ktrace: >=20 > 3648 iozone CALL write(0x3,0x801500000,0x1000) > 3648 iozone GIO fd 3 wrote 4096 bytes > 3648 iozone RET write 4096/0x1000 > 3648 iozone CALL write(0x3,0x801500000,0x1000) > 3648 iozone GIO fd 3 wrote 4096 bytes > 3648 iozone RET write 4096/0x1000 >=20 > As with the write test, it just writes 4k in a loop 262,144 times. >=20 > Yet despite being exactly the same sequence of calls, this test is is > less than half as fast with 8k NFS as 4k NFS (34,850 KiB/sec at 8k vs > 73,714 KiB/sec at 4k). The only difference between this test and the > previous one is that now the file exists. And although the 4k NFS > client writes it the same way as the first test and therefore gets > almost exactly same speed, the 8k NFS client is reading and writing > in > roughly equal proportion, and drops by a factor of 3.5x from its own > performance on the first write test. Note that the total number of > writes/sec on the 4k NFS rewrite test hovers around 18,500. On the > 8k > NFS test, the reads and writes are roughly balanced at around 4242. > So the interleaved read/writes performed by the NFS client cut the > total IOPs available by more than 50%, twice, leaving less than a > quarter available for the application. And these numbers roughly > correlate to the observed speeds: 4252 * 8KiB =3D 33,936kiB, 18,500 * > 4kiB =3D 74,000 kiB. A total of 131,072 NFS writes are issued during > the 8k NFS test for a maximum possible write of 1 GiB. Iozone is > writing a total of 1GiB (262,144 x 4kiB). So none of the read data > is > being used. So those extraneous reads appear to be the source of the > performance drop. >=20 Yes, this is to be expected, because NFS will read in the buffer cache block (8K) before modifying part of it. Doing it this way wasn't always how BSD (as in 4.4BSD) did it. Way back when I wrote this code in the 1980s= , it wouldn't first read in a block before doing a partial write. The problem with not reading the block before a partial write, is that a write must be done synchronously to the server if/when a read of a part of the block not just written happens, followed by a read. Someone changed this a long time ago (well over a decade, I believe). (I looked in the commit log and hit the end of it before I found the commit.) Here's the comment out of the old client code (cloned for the new client): 1077 =09* Issue a READ if B_CACHE is not set. In special-append 1078 =09* mode, B_CACHE is based on the buffer prior to the write 1079 =09* op and is typically set, avoiding the read. If a read 1080 =09* is required in special append mode, the server will 1081 =09* probably send us a short-read since we extended the file 1082 =09* on our end, resulting in b_resid =3D=3D 0 and, thusly, 1083 =09* B_CACHE getting set. 1084 =09* 1085 =09* We can also avoid issuing the read if the write covers 1086 =09* the entire buffer. We have to make sure the buffer state 1087 =09* is reasonable in this case since we will not be initiating 1088 =09* I/O. See the comments in kern/vfs_bio.c's getblk() for 1089 =09* more information. 1090 =09* 1091 =09* B_CACHE may also be set due to the buffer being cached 1092 =09* normally. 1093 =09*/ If a real client was predominately running an app. that did 4K writes sequentially then, yes, wsize=3D4096 might be the best w.r.t. performance. (I doubt most real client mounts have predominately apps. doing 4K writes, but the mount option is there to handle that case.) Everything in the NFS client's caching is a compromise that experience over time (much of it done by others, not me) has shown to be a good compromise for most real client file activity. rick > RANDOM READ TEST >=20 > Considering the random read test, things look better again. Here=E2=80= =99s > what it looks like in ktrace: >=20 > 3947 iozone CALL lseek(0x3,0x34caa000,SEEK_SET) > 3947 iozone RET lseek 885694464/0x34caa000 > 3947 iozone CALL read(0x3,0x801500000,0x1000) > 3947 iozone GIO fd 3 read 4096 bytes > 3947 iozone RET read 4096/0x1000 > 3947 iozone CALL lseek(0x3,0x3aa53000,SEEK_SET) > 3947 iozone RET lseek 983904256/0x3aa53000 > 3947 iozone CALL read(0x3,0x801500000,0x1000) > 3947 iozone GIO fd 3 read 4096 bytes > 3947 iozone RET read 4096/0x1000 >=20 > This test lseeks to a random location (on a 4k boundary) then reads > 4k, in a loop, 262,144 times. >=20 > On this test, the 8k NFS moderately outperforms the 4k NFS (35,996 > kiB/sec at 8k vs 27,525 kiB/sec at 4k). This is probably > attributable > to lucky adjacent requests, which are visible in the total number of > reads. 4k NFS produces the expected 262,144 NFS read ops, but 8k NFS > performs about 176,000, an average of about 6k per read, suggesting > it > was doing 8k reads of which the whole 8k was useful about 50% of the > time. >=20 > RANDOM WRITE TEST >=20 > Finally, the random write test. Here=E2=80=99s the ktrace output from th= is > one: >=20 > 3262 iozone CALL lseek(0x3,0x29665000,SEEK_SET) > 3262 iozone RET lseek 694571008/0x29665000 > 3262 iozone CALL write(0x3,0x801500000,0x1000) > 3262 iozone GIO fd 3 wrote 4096 bytes > 3262 iozone RET write 4096/0x1000 > 3262 iozone CALL lseek(0x3,0x486c000,SEEK_SET) > 3262 iozone RET lseek 75939840/0x486c000 > 3262 iozone CALL write(0x3,0x801500000,0x1000) > 3262 iozone GIO fd 3 wrote 4096 bytes > 3262 iozone RET write 4096/0x1000 > 3262 iozone CALL lseek(0x3,0x33310000,SEEK_SET) > 3262 iozone RET lseek 858849280/0x33310000 > 3262 iozone CALL write(0x3,0x801500000,0x1000) > 3262 iozone GIO fd 3 wrote 4096 bytes > 3262 iozone RET write 4096/0x1000 > 3262 iozone CALL lseek(0x3,0x2ae3000,SEEK_SET) >=20 > That=E2=80=99s all it is, lseek() then write(), repeated 262144 times. N= o > reads, and again, all the lseek=E2=80=99s are on 4k boundaries. >=20 > This test is dramatically slower with the 8k NFS, falling from 46,349 > kiB/sec for 4k NFS to 16,258 kiB/sec for 8k NFS. >=20 > Like the rewrite test, the nfsstat output shows an even mix of reads > and writes during the 8k NFS run of this test, even though no such > writes are ever issued by the client. Unlike the 8k NFS rewrite > test, > which performs 131,072 NFS write operations, the 8k NFS random write > test performs 262,025 operations. This strongly suggests it was > converting each 4kiB write into an 8kiB read-modify-write cycle, with > the commensurate performance penalty in almost every case, with a > handful of packets (about 119) lucky enough to be adjacent writes. > Since the NFS server is supposed to be able to handle > arbitrary-length > writes at arbitrary file offsets, this appears to be unnecessary. >=20 > To confirm that this is the case, and to determine whether it is > doing > 4kiB or 8kiB writes, we will have to go to the packets. Here is a > sample from the random write test: >=20 > 09:31:52.338749 IP 172.20.20.169.1877003014 > 172.20.20.162.2049: 128 > read fh 1326,127488/4 8192 bytes @ 606494720 > 09:31:52.338877 IP 172.20.20.162.2049 > 172.20.20.169.1877003014: > reply ok 8320 read > 09:31:52.338944 IP 172.20.20.169.1877003015 > 172.20.20.162.2049: > 4232 > write fh 1326,127488/4 4096 (4096) bytes @ 606498816 > 09:31:52.338964 IP 172.20.20.169.1877003016 > 172.20.20.162.2049: 128 > read fh 1326,127488/4 8192 bytes @ 85237760 > 09:31:52.339037 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], > ack 4008446365, win 29118, options [nop,nop,TS val 4241448191 ecr > 17929512], length 0 > 09:31:52.339076 IP 172.20.20.162.2049 > 172.20.20.169.1877003015: > reply ok 160 write [|nfs] > 09:31:52.339117 IP 172.20.20.162.2049 > 172.20.20.169.1877003016: > reply ok 8320 read > 09:31:52.339142 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.], > ack 8488, win 515, options [nop,nop,TS val 17929512 ecr 4241448191], > length 0 > 09:31:52.339183 IP 172.20.20.169.1877003017 > 172.20.20.162.2049: > 4232 > write fh 1326,127488/4 4096 (4096) bytes @ 85241856 > 09:31:52.339201 IP 172.20.20.169.1877003018 > 172.20.20.162.2049: 128 > read fh 1326,127488/4 8192 bytes @ 100843520 > 09:31:52.339271 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], > ack 4369, win 29118, options [nop,nop,TS val 4241448191 ecr > 17929512], > length 0 > 09:31:52.339310 IP 172.20.20.162.2049 > 172.20.20.169.1877003017: > reply ok 160 write [|nfs] > 09:31:52.339332 IP 172.20.20.162.2049 > 172.20.20.169.1877003018: > reply ok 8320 read > 09:31:52.339355 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.], > ack 16976, win 515, options [nop,nop,TS val 17929512 ecr 4241448191], > length 0 > 09:31:52.339408 IP 172.20.20.169.1877003019 > 172.20.20.162.2049: 128 > read fh 1326,127488/4 8192 bytes @ 330153984 > 09:31:52.339514 IP 172.20.20.162.2049 > 172.20.20.169.1877003019: > reply ok 8320 read > 09:31:52.339562 IP 172.20.20.169.1877003020 > 172.20.20.162.2049: 128 > read fh 1326,127488/4 8192 bytes @ 500056064 > 09:31:52.339669 IP 172.20.20.162.2049 > 172.20.20.169.1877003020: > reply ok 8320 read > 09:31:52.339728 IP 172.20.20.169.1877003022 > 172.20.20.162.2049: 128 > read fh 1326,127488/4 8192 bytes @ 778158080 > 09:31:52.339758 IP 172.20.20.169.1877003021 > 172.20.20.162.2049: > 4232 > write fh 1326,127488/4 4096 (4096) bytes @ 500060160 > 09:31:52.339834 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], > ack 9001, win 29118, options [nop,nop,TS val 4241448191 ecr > 17929512], > length 0 >=20 > This confirms that the NFS client is indeed performing a read-write > cycle. However, they also show that the writes are 4kiB, not 8kiB, > so > no modify is occurring. E.g. it reads 8kiB at offset 606494720 and > then immediately writes the 4kiB from the application at offset > 606498816 (which is 606494720+4096). It=E2=80=99s not clear what the pur= pose > of the read is in this scenario. >=20 What is needed is knowledge of what real apps. do w.r.t. I/O. I would suspect they would use larger I/O sizes than 4K, but I don't know. The default read/write size done by the stdio library would be a good start, but I'll admit I don't know what that is, either. > FOLLOW-UP TESTS >=20 > To help determine which direction to go next, I conducted two > follow-up tests. First, I re-ran the test with the default 64k > rsize/wsize values, an effort to confirm that it would do 64k reads > followed by 4k writes. It was necessary to do this with a much > smaller file size due to the poor performance, and with > net.inet.tcp.delacktime=3D10 on both client and server to work around a > separate issue with TCP writes >64kiB, but the results are sufficient > for this purpose: >=20 > $ iozone -e -s 32m -r 4k -i 0 -i 2 > Iozone: Performance Test of File I/O > Version $Revision: 3.420 $ > Compiled for 64 bit mode. > Build: freebsd >=20 > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, > Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave > Boone, > Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, > Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, > Vangel Bojaxhi, Ben England, Vikentsi Lapa. >=20 > Run began: Fri Jan 24 10:35:14 2014 >=20 > Include fsync in write timing > File size set to 32768 KB > Record Size 4 KB > Command line used: iozone -e -s 32m -r 4k -i 0 -i 2 > Output is in Kbytes/sec > Time Resolution =3D 0.000005 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random > random bkwd record stride > KB reclen write rewrite read reread read > write read rewrite read fwrite frewrite fread freread > 32768 4 12678 5292 5355 > 741 >=20 > iozone test complete. >=20 > Although the use of TCP writes >64kiB largely mangles tcpdump=E2=80=99s > ability to interpret the NFS messages due to fragmentation, it was > possible to grep out enough to see that the suspected 64kiB read / > 4kiB write pairs are indeed occurring: >=20 > 10:36:11.356519 IP 172.20.20.169.1874848588 > 172.20.20.162.2049: 128 > read fh 1326,127488/5 65536 bytes @ 18087936 > 10:36:11.496905 IP 172.20.20.169.1874848606 > 172.20.20.162.2049: > 4232 > write fh 1326,127488/5 4096 (4096) bytes @ 18096128 >=20 > As before, the 64kiB read appears not to be used. The result is that > roughly 94% of the network I/O spent on a 4kiB write is wasted, with > commensurate impact on performance. >=20 > In an attempt to verify the theory that the data being read is not > necessary, the 8k NFS test was retried with a Debian Linux guest. > The > hypothesis was that if the Debian Linux client did not perform the > reads, that would indicate that they were not necessary. The results > appeared to support that hypothesis. Not only did the Linux client > handily outperform the FreeBSD client on all four tests: >=20 > # iozone -e -s 1g -r 4k -i 0 -i 2 > Iozone: Performance Test of File I/O > Version $Revision: 3.397 $ > Compiled for 64 bit mode. > Build: linux-AMD64 > [...] > Include fsync in write timing > File size set to 1048576 KB > Record Size 4 KB > Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 > Output is in Kbytes/sec > Time Resolution =3D 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random > random > KB reclen write rewrite read reread read > write > 1048576 4 142393 158101 79471 > 133974 >=20 > iozone test complete. >=20 > But it also did not perform any reads during the rewrite and random > write tests, as shown by the server=E2=80=99s nfsstat output: >=20 > Read Write > 0 0 > 0 0 > 0 0 > 0 1905 <-- sequential write test begins > 0 19571 > 0 20066 > 0 20214 > 0 20036 > 0 20051 > 0 17221 > 0 13311 <-- rewrite test begins > 0 20721 > 0 22237 > 0 21710 > 0 21619 > 0 21104 > 0 18339 > 6363 4040 <-- random read test begins > 8906 0 > 9885 0 > 10921 0 > 12487 0 > 13311 0 > 14141 0 > 16598 0 > 16178 0 > 17138 0 > 17796 0 > 18765 0 > 16760 0 > 2816 6787 <-- random write test begins > 0 20892 > 0 19528 > 0 20879 > 0 20758 > 0 17040 > 0 21327 > 0 19713 <-- tests finished > 0 0 > 0 0 > 0 0 >=20 >=20 > CONCLUSION >=20 > The rewrite and random write iozone tests appear to demonstrate that > in some cases the FreeBSD NFS client treats the rsize/wsize settings > as a fixed block size rather than a not-to-exceed size. The data > suggests that it requests a full rsize-sized block read from the > server before performing a 4kiB write issued by an application, and > that the result of the read is not used. > Linux does not perform the > same read-before-write strategy, and consequently achieves much > higher > throughput. >=20 I shouldn't say this, because I am not sure, but I believe that Linux caches files in pages (which happen to be 4K for x86). As such "-r 4k" is optimal for Linux. Try some iozone tests with other "-r" sizes. I suspect with 1K or 2K, you will see Linux doing the read before write, but I could be wrong. You should also try both Linux and FreeBSD with "-r 32k" and rsize=3D32768,= wsize=3D32768 for both and see how they compare (you should collect nfsstat RPC counts as well as iozone's transfer rates). The 64K case seems to be broken for FreeBSD, I suspect due to some issue related to the virtio driver using a 64K MTU, so I left that out of the abo= ve. (It would be nice to resolve this, but I have no idea if glebius@ can help w.r.t. this one.) > The next step would probably be to examine the NFS implementation to > determine the source of the apparently unnecessary reads. If they > could be eliminated, it may lead to performance improvements of up to > 20x on some workloads. >=20 > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" >=20 From owner-freebsd-net@FreeBSD.ORG Fri Jan 24 23:37:38 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D38D6F3A; Fri, 24 Jan 2014 23:37:38 +0000 (UTC) Received: from mail-ie0-x235.google.com (mail-ie0-x235.google.com [IPv6:2607:f8b0:4001:c03::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9321914BF; Fri, 24 Jan 2014 23:37:38 +0000 (UTC) Received: by mail-ie0-f181.google.com with SMTP id tq11so3666447ieb.12 for ; Fri, 24 Jan 2014 15:37:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=HdLDnHYjQTX6MyMfL1n+GBW3CiJXkp1xO/xZ3SxcrUI=; b=lNL9K8aHKqX72jqW9561Xos7SabwdbKackXEpdiB5TSXwhrn072ct/W/+aCeMdm7lR 3DZbRIfqbVPCGtRxa28PQLcKK2IOLltUPpysUPeqzGO1iiArz0nQ4Ze4fd6R/4R+7ZsK F0O3osK1wx6BSNuOT31fH0SNji9gbmv0UzclZin2wefYEqqT+fJH+dQFGuw8WawMNVfy M0KlfeYsRAV6AjMdS+OWlXrfPw0F2b6de96canfq0z7ffPwmjpoIImcYaYBkDMBQN1F2 trWBYNkzoXoPki+5CFry/kwTZZmf2ztME8UhtsspcmQsVvczOxKPGaIhkxpWu3vFHb4r lbCw== MIME-Version: 1.0 X-Received: by 10.43.161.2 with SMTP id me2mr13175103icc.20.1390606658118; Fri, 24 Jan 2014 15:37:38 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 24 Jan 2014 15:37:38 -0800 (PST) In-Reply-To: <659117348.16015750.1390604069888.JavaMail.root@uoguelph.ca> References: <659117348.16015750.1390604069888.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 18:37:38 -0500 X-Google-Sender-Auth: 9muO36VsdQ0aOJ6d6VrC7cXUcv8 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jan 2014 23:37:39 -0000 On Fri, Jan 24, 2014 at 5:54 PM, Rick Macklem wrote: > But disabling it will identify if that is causing the problem. And it > is a workaround that often helps people get things to work. (With real > hardware, there may be no way to "fix" such things, depending on the > chipset, etc.) There are two problems that are crippling NFS performance with large block sizes. One is the extraneous NFS read-on-write issue I documented earlier today that has nothing to do with network topology or packet size. You might have more interest in that one. This other thing is a five-way negative interaction between 64k NFS, TSO, LRO, delayed ack, and congestion control. Disabling *any* one of them is sufficient to see significant improvement, but does not serve to identify that it is causing the problem since it is not a unique characterstic. (Even if it was, that would not determine whether a problem was with component X or with component Y's ability to interact with component X.) Figuring out what's really happening has proven very difficult for me, largely due to my limited knowledge of these areas. And the learning curve on the TCP code is pretty steep. The "simple" explanation appears to be that NFS generates two packets, one just under 64k and one containing "the rest" and the alternating sizes prevent the delayed ack code from ever seeing two full-size segments in a row, so traffic gets pinned down to one packet per net.inet.tcp.delacktime (100ms default), for 10pps, as observed earlier. But unfortunately, like a lot of simple explanations, this one appears to have the disadvantage of being more or less completely wrong. > ps: If you had looked at the link I had in the email, you would have > seen that he gets very good performance once he disables TSO. As > they say, your mileage may vary. Pretty much every word written on this subject has come across my screens at this point. "Very good performance" is relative. Yes, you can get about 10-20x better performance by disabling TSO, at the expense of using vastly more CPU. Which is definitely a big improvement, and may be sufficient for many applications. But in absolute terms, the overall performance and particularly the efficiency remains unsatisfactory. Thanks! From owner-freebsd-net@FreeBSD.ORG Sat Jan 25 00:10:08 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C47BF15A; Sat, 25 Jan 2014 00:10:08 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 4ADAD171C; Sat, 25 Jan 2014 00:10:07 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,715,1384318800"; d="scan'208";a="90882901" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 24 Jan 2014 19:10:00 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 525DCB40EC; Fri, 24 Jan 2014 19:10:00 -0500 (EST) Date: Fri, 24 Jan 2014 19:10:00 -0500 (EST) From: Rick Macklem To: J David Message-ID: <179007387.16041087.1390608600325.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 00:10:08 -0000 J David wrote: > On Fri, Jan 24, 2014 at 5:54 PM, Rick Macklem > wrote: > > But disabling it will identify if that is causing the problem. And > > it > > is a workaround that often helps people get things to work. (With > > real > > hardware, there may be no way to "fix" such things, depending on > > the > > chipset, etc.) > > There are two problems that are crippling NFS performance with large > block sizes. > > One is the extraneous NFS read-on-write issue I documented earlier > today that has nothing to do with network topology or packet size. > You might have more interest in that one. > Afraid not. Here is the commit message for the commit where the read before partial write was added. It is r46349 dated May 2, 1999. (As you will see, there is a lot to this and I am not the guy to try and put it back the old way without breaking anything.) The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon I would like to hear if you find Linux doing read before write when you use "-r 2k", since I think that is writing less than a page. > This other thing is a five-way negative interaction between 64k NFS, > TSO, LRO, delayed ack, and congestion control. Disabling *any* one > of > them is sufficient to see significant improvement, but does not serve > to identify that it is causing the problem since it is not a unique > characterstic. (Even if it was, that would not determine whether a > problem was with component X or with component Y's ability to > interact > with component X.) Figuring out what's really happening has proven > very difficult for me, largely due to my limited knowledge of these > areas. And the learning curve on the TCP code is pretty steep. > > The "simple" explanation appears to be that NFS generates two > packets, > one just under 64k and one containing "the rest" and the alternating > sizes prevent the delayed ack code from ever seeing two full-size > segments in a row, so traffic gets pinned down to one packet per > net.inet.tcp.delacktime (100ms default), for 10pps, as observed > earlier. But unfortunately, like a lot of simple explanations, this > one appears to have the disadvantage of being more or less completely > wrong. > This simple explanation sounds interesting to me. Have you tried a 64K test with the delayed ACK disabled entirely by setting net.inet.tcp.delayed_ack=0. (I thought someone mentioned that the ACK was only delayed when there wasn`t any data to send, but I may be wrong.) Also, I believe that the above is specific to the virtio driver (or possibly others that handle a 64K MTU). I`m afraid that various issues will pop up (like the one I pointed out where disabling TSO was the `magic bullet`) for different network interfaces. Glebius, J David is using the virtio driver to do NFS perf. testing and gets very poor performance when the rsize, wsize is 64K. I think he might be willing to send you a packet capture of this, if you think it might explain what might be going on and whether changing something in the virtio network driver might help. Thanks, rick > > ps: If you had looked at the link I had in the email, you would > > have > > seen that he gets very good performance once he disables TSO. > > As > > they say, your mileage may vary. > > Pretty much every word written on this subject has come across my > screens at this point. "Very good performance" is relative. Yes, > you > can get about 10-20x better performance by disabling TSO, at the > expense of using vastly more CPU. Which is definitely a big > improvement, and may be sufficient for many applications. But in > absolute terms, the overall performance and particularly the > efficiency remains unsatisfactory. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Sat Jan 25 01:02:57 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id F13C3B2D for ; Sat, 25 Jan 2014 01:02:57 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id ACA531B61 for ; Sat, 25 Jan 2014 01:02:57 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAJgM41KDaFve/2dsb2JhbABQCoNEVoJ9uHZPgR50giUBAQEDAQEBASArHwELBRYYAgINGQIpAQkmBggHBAEcBIdcCA2sEp0CF4EpjQEGBAYCAQYVNAeCL0CBSQSJSIwMhAWQbINLHjF7AR8i X-IronPort-AV: E=Sophos;i="4.95,716,1384318800"; d="scan'208";a="90324828" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 24 Jan 2014 20:02:56 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 0FC9BB3F16; Fri, 24 Jan 2014 20:02:56 -0500 (EST) Date: Fri, 24 Jan 2014 20:02:56 -0500 (EST) From: Rick Macklem To: J David Message-ID: <635382404.16057591.1390611776054.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 01:02:58 -0000 J David wrote: > On Fri, Jan 24, 2014 at 5:54 PM, Rick Macklem > wrote: > > But disabling it will identify if that is causing the problem. And > > it > > is a workaround that often helps people get things to work. (With > > real > > hardware, there may be no way to "fix" such things, depending on > > the > > chipset, etc.) > > There are two problems that are crippling NFS performance with large > block sizes. > Btw, just in case it wasn't obvious, I would like to see large (at least as large as the server's file system block size) reads/writes going to the server. For example, here is a write for your Linux case (from you previosu post): 172.20.20.166.2036438470 > 172.20.20.162.2049: 2892 write fh 1325,752613/4 4096 (4096) bytes @ 824033280 This is writing 4096 bytes with filesync. filesync means that the data and metadata must be written to stable storage before the NFS server replies, so that the write won't be lost if the server crashes just after sending the reply. Now, unlike your test case, a typical real NFS server will be using disks as stable storage and doing multiple writes to disk for each of these will take a long time. Like most disk file systems, doing fewer writes of large blocks will make a big difference for NFS server performance. (Your test does the unrealistic case of putting the file system in memory.) Now, I would agree that I would like to see 64Kbyte rsize/wsize work well with the underlying network fabric, but I don't know how to do that, in general. I would actually like to see MAXBSIZE increase to 128K, so that can be the default rsize/wsize. (I've been told that 128K is the blocksize used by ZFS typically. I know nothing about ZFS, but I think the person that emailed this knows ZFS pretty well.) This comes back to my suggestion of testing with "-r 32k", since that seems to be closed to what would be desirable for a real NFS server. (But, if you have a major application that loves to do 4k reads/writes, then I understand why you would use "-r 4k".) rick > One is the extraneous NFS read-on-write issue I documented earlier > today that has nothing to do with network topology or packet size. > You might have more interest in that one. > > This other thing is a five-way negative interaction between 64k NFS, > TSO, LRO, delayed ack, and congestion control. Disabling *any* one > of > them is sufficient to see significant improvement, but does not serve > to identify that it is causing the problem since it is not a unique > characterstic. (Even if it was, that would not determine whether a > problem was with component X or with component Y's ability to > interact > with component X.) Figuring out what's really happening has proven > very difficult for me, largely due to my limited knowledge of these > areas. And the learning curve on the TCP code is pretty steep. > > The "simple" explanation appears to be that NFS generates two > packets, > one just under 64k and one containing "the rest" and the alternating > sizes prevent the delayed ack code from ever seeing two full-size > segments in a row, so traffic gets pinned down to one packet per > net.inet.tcp.delacktime (100ms default), for 10pps, as observed > earlier. But unfortunately, like a lot of simple explanations, this > one appears to have the disadvantage of being more or less completely > wrong. > > > ps: If you had looked at the link I had in the email, you would > > have > > seen that he gets very good performance once he disables TSO. > > As > > they say, your mileage may vary. > > Pretty much every word written on this subject has come across my > screens at this point. "Very good performance" is relative. Yes, > you > can get about 10-20x better performance by disabling TSO, at the > expense of using vastly more CPU. Which is definitely a big > improvement, and may be sufficient for many applications. But in > absolute terms, the overall performance and particularly the > efficiency remains unsatisfactory. > > Thanks! > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-net@FreeBSD.ORG Sat Jan 25 01:07:06 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 64F08F12; Sat, 25 Jan 2014 01:07:06 +0000 (UTC) Received: from mail-ie0-x22e.google.com (mail-ie0-x22e.google.com [IPv6:2607:f8b0:4001:c03::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 28A581CA2; Sat, 25 Jan 2014 01:07:06 +0000 (UTC) Received: by mail-ie0-f174.google.com with SMTP id tp5so3701664ieb.33 for ; Fri, 24 Jan 2014 17:07:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=gImumYW6YoA0EdDGAf45MFg5hC8J9NJt+Z56uaTyos4=; b=B9iEgt/ByCAn5GipupuDDF1j/WiBFa0nnh6fAToxitQc/wCGhXLCASuGrQHAPO6u4j EofVOhsaohHgG4ttDJgsC3CkYX7CFWz6wfL4vANn/K1XwWPoE+5H64Nzq/erHcM1twG7 7GLb7kCSXwRONuvduLnoBTKOGsrFKEIwpPqGxmM3rtinJd+38lFhC5o9VjzalMSlzL0a Q6zBKuWrTjn78z7E7i/8xBwEMH5n7vzAhCdkL9IhkAklzJPMbfHqWgOeYMC2toZIShtW CCbNbTnHbejxTdSJAC7QFfpJXAWsHC7ZyPZBZMumHiKh1NuKHgVmcggiImfJ82HFBxdV 4cJg== MIME-Version: 1.0 X-Received: by 10.50.154.102 with SMTP id vn6mr7280929igb.1.1390612025611; Fri, 24 Jan 2014 17:07:05 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 24 Jan 2014 17:07:05 -0800 (PST) In-Reply-To: <179007387.16041087.1390608600325.JavaMail.root@uoguelph.ca> References: <179007387.16041087.1390608600325.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 20:07:05 -0500 X-Google-Sender-Auth: u0gVtKR1Id0TXsN2Cs3PnrBgPDQ Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 01:07:06 -0000 On Fri, Jan 24, 2014 at 7:10 PM, Rick Macklem wrote: > I would like to hear if you find Linux doing read before write when > you use "-r 2k", since I think that is writing less than a page. It doesn't. As I reported in the original test, I used an 8k rsize/wsize and a 4k write size on the Linux test and no read-before-write was observed. And just now I did as you asked, a 2k test with Linux mounting with 32k rsize/wsize. No extra reads, excellent performance. FreeBSD, with the same mount options, does reads even on the appends in this case and can't. random random KB reclen write rewrite read reread read write Linux 1048576 2 281082 358672 125687 121964 FreeBSD 1048576 2 59042 22624 10304 1933 For comparison, here's the same test with 32k reclen (again, both Linux and FreeBSD using 32k rsize/wsize): random random KB reclen write rewrite read reread read write Linux 1048576 32 319387 373021 411106 364393 FreeBSD 1048576 32 74892 73703 34889 66350 Unfortunately it sounds like this state of affairs isn't really going to improve, at least in the near future. If there was one area where I never thought Linux would surpass us, it was NFS. :( Thanks! From owner-freebsd-net@FreeBSD.ORG Sat Jan 25 01:25:23 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4840B751 for ; Sat, 25 Jan 2014 01:25:23 +0000 (UTC) Received: from mail-ig0-x22d.google.com (mail-ig0-x22d.google.com [IPv6:2607:f8b0:4001:c05::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 15D431E34 for ; Sat, 25 Jan 2014 01:25:23 +0000 (UTC) Received: by mail-ig0-f173.google.com with SMTP id c10so4115122igq.0 for ; Fri, 24 Jan 2014 17:25:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=2+8YbEIa62VTGZ+SnoQTSiJJxRv5JI/pAfBo6AdrijA=; b=G9dW0QoWKytrom+1Bkf5Hsb3Bxy/tMR58VifkqiNSWHEcGcGLEmGSPhVymDfYG5tFk HJwpO4zo005TW/bTvXZb6egj64hWU1wgITo+1+kIo2PyMPnh0iTXynkwcwIE1EFFDP70 Zm+QL/1PzK6lTTWEFCxSicOlMtTZzKUKJ6QjRVpkYvXPFt5dGamunDwEQoVJNT3ltUlP dMdoawSwxiUWQ/wAPA+NNHT2bzAV4XO3j+Q8rwC6l9r5iO2m0wTDfWsM1OdllmS0u3Ib /bE9rJSALFq2uF6rerQJ3HI8RrCFhzHwdhfqpT+eMD/sqUf9nn2lqv36T3NiNGmcgIxX PtXg== MIME-Version: 1.0 X-Received: by 10.50.60.105 with SMTP id g9mr7494116igr.14.1390613122604; Fri, 24 Jan 2014 17:25:22 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Fri, 24 Jan 2014 17:25:22 -0800 (PST) In-Reply-To: <635382404.16057591.1390611776054.JavaMail.root@uoguelph.ca> References: <635382404.16057591.1390611776054.JavaMail.root@uoguelph.ca> Date: Fri, 24 Jan 2014 20:25:22 -0500 X-Google-Sender-Auth: WHhhsk5otJdUmUb7z6JK8naK3us Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 01:25:23 -0000 On Fri, Jan 24, 2014 at 8:02 PM, Rick Macklem wrote: > This comes back to my suggestion of testing with "-r 32k", since that > seems to be closed to what would be desirable for a real NFS server. > (But, if you have a major application that loves to do 4k reads/writes, > then I understand why you would use "-r 4k".) There are -r 32k examples in my previous message. The testing I am doing covers a broad spectrum of sizes from 1k to 128k, in an attempt to find which NFS settings provide the overall best settings for a variety of sizes, as general-purpose file storage is anything but one consistent block size. The 4k examples demonstrate the problems I am encountering vividly, so the focus has been on them. Thanks! From owner-freebsd-net@FreeBSD.ORG Sat Jan 25 15:28:15 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B314BC00 for ; Sat, 25 Jan 2014 15:28:15 +0000 (UTC) Received: from mail-wg0-x232.google.com (mail-wg0-x232.google.com [IPv6:2a00:1450:400c:c00::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3D4DB19D5 for ; Sat, 25 Jan 2014 15:28:15 +0000 (UTC) Received: by mail-wg0-f50.google.com with SMTP id l18so4099724wgh.29 for ; Sat, 25 Jan 2014 07:28:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:from:to:subject:user-agent:date:message-id:mime-version :content-type:content-transfer-encoding; bh=rC7cEIapSbY6I4MOVZL4hi4vzQ9PrDrsTtQw+V2nP2g=; b=GvkgYOVVrtqyorsalTnlahF2bhZQunmM3Z6oIZlnkZO9dFjLEPelgDyNjN6ClATHqV VCqJhljw6QVnt5XkBNty1dY9KHIp65rmKWXFM6xlq/WhENSeepptvll5lkNd8VwP7+22 rl+aDpCCkHrbRvk9ZeAyntfV03YvA/jiDBbDbdDdxX3IdUSxGGwEActKCvxjeAMLuYVa YG9Dv9h1W70tLCQH7zd1u5d90+sqqAle/m4eqzPTRHLdGx5brXAl368i/fMazuhJ/YsU /bKeqszJaV1lplukGfY1WiNbX+HqRz/IOb+3QS6zVjH3UEOoF0CWxq1i0M/Gjc2OAH5t BoeQ== X-Received: by 10.180.76.168 with SMTP id l8mr6880083wiw.40.1390663693604; Sat, 25 Jan 2014 07:28:13 -0800 (PST) Received: from srvbsdfenssv.interne.associated-bears.org (LCaen-151-92-21-48.w217-128.abo.wanadoo.fr. [217.128.200.48]) by mx.google.com with ESMTPSA id dd3sm10543731wjb.9.2014.01.25.07.28.12 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sat, 25 Jan 2014 07:28:13 -0800 (PST) Sender: Eric Masson Received: from srvbsdfenssv.interne.associated-bears.org (localhost [127.0.0.1]) by srvbsdfenssv.interne.associated-bears.org (Postfix) with ESMTP id EB8D5CF163 for ; Sat, 25 Jan 2014 16:28:11 +0100 (CET) X-Virus-Scanned: amavisd-new at interne.associated-bears.org Received: from srvbsdfenssv.interne.associated-bears.org ([127.0.0.1]) by srvbsdfenssv.interne.associated-bears.org (srvbsdfenssv.interne.associated-bears.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id e1QpDaAb2A18 for ; Sat, 25 Jan 2014 16:28:10 +0100 (CET) Received: by srvbsdfenssv.interne.associated-bears.org (Postfix, from userid 1001) id 3726DCF0E5; Sat, 25 Jan 2014 16:28:10 +0100 (CET) From: Eric Masson To: Mailing List FreeBSD Network Subject: [FreeBSD 10.0] nat before vpn, incoming packets not translated User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (berkeley-unix) X-Operating-System: FreeBSD 9.2-RELEASE-p3 amd64 Date: Sat, 25 Jan 2014 16:28:10 +0100 Message-ID: <868uu4rshh.fsf@srvbsdfenssv.interne.associated-bears.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 15:28:15 -0000 Hi, I've setup a lab to experiment nat before ipsec scenario. Architecture : - 3 host only interfaces have been set up on the host - 4 FreeBSD10 guests have been set up : - 2 clients connected to their respective gateways via dedicated host only interfaces. - 2 gateways connected together via dedicated host only interface Client 1 setup : <-----------------------------------------------------------------> emss@client1:~ % more /etc/rc.conf hostname="client1" keymap="fr.iso.acc.kbd" ifconfig_em0="inet 192.168.11.100 netmask 255.255.255.0" ifconfig_em0_ipv6="inet6 accept_rtadv" defaultrouter="192.168.11.15" sshd_enable="YES" dumpdev="AUTO" sendmail_enable="NO" sendmail_submit_enable="NO" sendmail_outbound_enable="NO" sendmail_msp_queue_enable="NO" <-----------------------------------------------------------------> Gateway 1 setup : <-----------------------------------------------------------------> emss@gateway1:~ % more /etc/rc.conf hostname="gateway1" keymap="fr.iso.acc.kbd" ifconfig_em1="inet 192.168.11.15 netmask 255.255.255.0" ifconfig_em1_ipv6="inet6 accept_rtadv" ifconfig_em0="inet 10.0.0.5 netmask 255.255.255.0" gateway_enable="YES" ipsec_enable="YES" ipsec_file="/etc/ipsec.conf" firewall_enable="YES" firewall_script="/etc/ipfw.rules" firewall_logging="YES" sshd_enable="YES" # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable dumpdev="AUTO" sendmail_enable="NO" sendmail_submit_enable="NO" sendmail_outbound_enable="NO" sendmail_msp_queue_enable="NO" emss@gateway1:~ % more /etc/ipfw.rules #!/bin/sh cmd="/sbin/ipfw" $cmd -f flush $cmd add 00100 nat 100 all from 192.168.11.0/24 to 192.168.21.0/24 $cmd nat 100 config log ip 172.16.0.1 reverse emss@gateway1:~ % more /etc/ipsec.conf flush; spdflush; add 10.0.0.5 10.0.0.6 esp 0x1000 -E 3des-cbc "123456789012345678901234"; add 10.0.0.6 10.0.0.5 esp 0x1001 -E 3des-cbc "432109876543210987654321"; add 10.0.0.5 10.0.0.6 ipcomp 0x2000 -C deflate; add 10.0.0.6 10.0.0.5 ipcomp 0x2001 -C deflate; spdadd 192.168.21.0/24 172.16.0.1/32 any -P in ipsec ipcomp/tunnel/10.0.0.6-10.0.0.5/require esp/tunnel/10.0.0.6-10.0.0.5/require; spdadd 172.16.0.1/32 192.168.21.0/24 any -P out ipsec ipcomp/tunnel/10.0.0.5-10.0.0.6/require esp/tunnel/10.0.0.5-10.0.0.6/require; emss@gateway1:~ % more /boot/loader.conf ipfw_load="YES" ipfw_nat_load="YES" net.inet.ip.fw.default_to_accept="1" <-----------------------------------------------------------------> Gateway 2 setup : <-----------------------------------------------------------------> emss@gateway2:~ % more /etc/rc.conf hostname="gateway2" keymap="fr.iso.acc.kbd" ifconfig_em1="inet 10.0.0.6 netmask 255.255.255.0" ifconfig_em0="inet 192.168.21.15 netmask 255.255.255.0" ifconfig_em0_ipv6="inet6 accept_rtadv" gateway_enable="YES" ipsec_enable="YES" ipsec_file="/etc/ipsec.conf" sshd_enable="YES" # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable dumpdev="AUTO" sendmail_enable="NO" sendmail_submit_enable="NO" sendmail_outbound_enable="NO" sendmail_msp_queue_enable="NO" emss@gateway2:~ % more /etc/ipsec.conf flush; spdflush; add 10.0.0.5 10.0.0.6 esp 0x1000 -E 3des-cbc "123456789012345678901234"; add 10.0.0.6 10.0.0.5 esp 0x1001 -E 3des-cbc "432109876543210987654321"; add 10.0.0.5 10.0.0.6 ipcomp 0x2000 -C deflate; add 10.0.0.6 10.0.0.5 ipcomp 0x2001 -C deflate; spdadd 192.168.21.0/24 172.16.0.1/32 any -P out ipsec ipcomp/tunnel/10.0.0.6-10.0.0.5/require esp/tunnel/10.0.0.6-10.0.0.5/require; spdadd 172.16.0.1/32 192.168.21.0/24 any -P in ipsec ipcomp/tunnel/10.0.0.5-10.0.0.6/require esp/tunnel/10.0.0.5-10.0.0.6/require; <-----------------------------------------------------------------> Client 2 setup : <-----------------------------------------------------------------> emss@client2:~ % more /etc/rc.conf hostname="client2" keymap="fr.iso.acc.kbd" ifconfig_em0="inet 192.168.21.100 netmask 255.255.255.0" ifconfig_em0_ipv6="inet6 accept_rtadv" defaultrouter="192.168.21.15" sshd_enable="YES" # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable dumpdev="AUTO" sendmail_enable="NO" sendmail_submit_enable="NO" sendmail_outbound_enable="NO" sendmail_msp_queue_enable="NO" <-----------------------------------------------------------------> Test setup by pinging client2 from client1 : On client1 : emss@client1:~ % ping 192.168.21.100 PING 192.168.21.100 (192.168.21.100): 56 data bytes On gateway1 inside interface : root@gateway1:~ # tcpdump -i em1 17:16:08.600154 IP 192.168.11.100 > 192.168.21.100: ICMP echo request, id 10499, seq 7207, length 64 17:16:08.600660 IP 192.168.11.100 > 192.168.21.100: ICMP echo request, id 59651, seq 213, length 64 ... On gateway1 outside interface : root@gateway1:~ # tcpdump -i em0 17:16:48.501317 IP 10.0.0.5 > 10.0.0.6: ESP(spi=0x00001000,seq=0x1ed4), length 128 17:16:48.501612 IP 10.0.0.5 > 10.0.0.6: ESP(spi=0x00001000,seq=0x1ed5), length 128 17:16:48.502665 IP 10.0.0.6 > 10.0.0.5: ESP(spi=0x00001001,seq=0x1e67), length 128 17:16:48.502938 IP 10.0.0.6 > 10.0.0.5: ESP(spi=0x00001001,seq=0x1e68), length 128 ... On client2 : root@client2:~ # tcpdump -i em0 17:14:17.671181 IP 172.16.0.1 > 192.168.21.100: ICMP echo request, id 59651, seq 107, length 64 17:14:17.671230 IP 192.168.21.100 > 172.16.0.1: ICMP echo reply, id 59651, seq 107, length 64 ... So, the only remaining issue is that gateway1 doesn't nat back ipsec decapsulated packets (if no nat in scenario, everything works fine). Setting net.inet.ip.fw.one_pass to 0 doesn't change anything. Any idea, please ? Regards Éric Masson -- R: >>gruik! gruik! jâðaaaaadooooore les incon*gruik*tés! :P ¯¯¯ ¯¯ c'est pas bien mon RoDouDou! tu t'obstines avec ton unicode incomplet! -+-I in : Unicode toujours, tu m'interresse -+- From owner-freebsd-net@FreeBSD.ORG Sat Jan 25 20:43:20 2014 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 25963EDA; Sat, 25 Jan 2014 20:43:20 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id EBECE14F9; Sat, 25 Jan 2014 20:43:19 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id s0PKhJxR029538; Sat, 25 Jan 2014 20:43:19 GMT (envelope-from glebius@freefall.freebsd.org) Received: (from glebius@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id s0PKhJi1029537; Sat, 25 Jan 2014 20:43:19 GMT (envelope-from glebius) Date: Sat, 25 Jan 2014 20:43:19 GMT Message-Id: <201401252043.s0PKhJi1029537@freefall.freebsd.org> To: beorn@binaries.fr, glebius@FreeBSD.org, freebsd-net@FreeBSD.org, glebius@FreeBSD.org From: glebius@FreeBSD.org Subject: Re: kern/185909: [altq] [patch] ALTQ activation problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Jan 2014 20:43:20 -0000 Synopsis: [altq] [patch] ALTQ activation problem State-Changed-From-To: open->patched State-Changed-By: glebius State-Changed-When: Sat Jan 25 20:39:27 UTC 2014 State-Changed-Why: Committed, thanks! Responsible-Changed-From-To: freebsd-net->glebius Responsible-Changed-By: glebius Responsible-Changed-When: Sat Jan 25 20:39:27 UTC 2014 Responsible-Changed-Why: Committed, thanks! http://www.freebsd.org/cgi/query-pr.cgi?pr=185909