From owner-freebsd-stable@FreeBSD.ORG Tue Jul 8 13:15:01 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CB75D98E for ; Tue, 8 Jul 2014 13:15:01 +0000 (UTC) Received: from smtp10.server.rpi.edu (smtp10.server.rpi.edu [128.113.2.230]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8FC002189 for ; Tue, 8 Jul 2014 13:15:01 +0000 (UTC) Received: from smtp-auth1.server.rpi.edu (smtp-auth1.server.rpi.edu [128.113.2.231]) by smtp10.server.rpi.edu (8.14.3/8.14.3/Debian-9.4) with ESMTP id s68DEqc9008133 for ; Tue, 8 Jul 2014 09:14:53 -0400 Received: from smtp-auth1.server.rpi.edu (localhost [127.0.0.1]) by smtp-auth1.server.rpi.edu (Postfix) with ESMTP id C9CB758072 for ; Tue, 8 Jul 2014 09:14:52 -0400 (EDT) Received: from [128.113.209.59] (vpn-209-59.net.rpi.edu [128.113.209.59]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: healer) by smtp-auth1.server.rpi.edu (Postfix) with ESMTPSA id 98C335800A for ; Tue, 8 Jul 2014 09:14:52 -0400 (EDT) Message-ID: <53BBEECD.6000709@rpi.edu> Date: Tue, 08 Jul 2014 09:14:53 -0400 From: Bob Healey User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: freebsd-stable@freebsd.org Subject: Re: Interactions with mxge, pf, nfsd, and the kernel References: <53B42139.302@rpi.edu> In-Reply-To: <53B42139.302@rpi.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-Bayes-Prob: 0.0001 (Score 0, tokens from: outgoing, @@RPTN) X-Spam-Score: 0.00 () [Hold at 15.10] C55(0) X-CanIt-Incident-Id: 03MnNeRFK X-CanIt-Geo: ip=128.113.209.59; country=US; region=New York; city=Troy; latitude=42.7495; longitude=-73.5951; http://maps.google.com/maps?q=42.7495,-73.5951&z=6 X-CanItPRO-Stream: outgoing X-Canit-Stats-ID: Bayes signature not available X-Scanned-By: CanIt (www . roaringpenguin . com) on 128.113.2.230 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Jul 2014 13:15:02 -0000 I've been running one of these machines without pf, and it has ceased responding on all interfaces (mxge and bce). The console still works fine, and a reboot will clear the problems for now. I'm running out of ideas. root@helo:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll mxge0 9000 00:60:dd:44:d2:07 44838061 164399 0 31944144 0 0 mxge0 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - 3 - - bce0 1500 08:9e:01:50:a3:08 97018 0 0 0 0 0 bce0 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - 3 - - bce1 1500 08:9e:01:50:a3:09 889442915 1791 0 557044449 0 0 bce1 1500 128.113.12.0 helo 888129846 - - 676300451 - - bce1 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - 4 - - lo0 16384 28448 0 0 28448 0 0 lo0 16384 localhost ::1 59 - - 59 - - lo0 16384 fe80::1%lo0 fe80::1 0 - - 0 - - lo0 16384 your-net localhost 28389 - - 28389 - - vlan2 9000 00:60:dd:44:d2:07 28107520 0 0 19859118 0 0 vlan2 9000 10.2.3.0 helo.galactica.lo 28088754 - - 24433917 - - vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - 3 - - vlan2 9000 00:60:dd:44:d2:07 16730541 0 0 12084894 0 0 vlan2 9000 10.2.4.0 helo.enterprise.l 16724370 - - 12924742 - - vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - 3 - - root@helo:~ # netstat -m 7632/6798/14430 mbufs in use (current/cache/total) 4186/2886/7072/1018944 mbuf clusters in use (current/cache/total/max) 4080/1420 mbuf+clusters out of packet secondary zone in use (current/cache) 0/6/6/509472 4k (page size) jumbo clusters in use (current/cache/total/max) 593/25/618/150954 9k jumbo clusters in use (current/cache/total/max) 0/0/0/84912 16k jumbo clusters in use (current/cache/total/max) 15617K/7720K/23337K bytes allocated to network (current/cache/total) 3/72461/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 122/391912/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile root@helo:~ # uptime 9:07AM up 12 days, 8:15, 1 user, load averages: 0.19, 0.19, 0.20 root@helo:~ # ifconfig mxge0: flags=8843 metric 0 mtu 9000 options=6c03bb ether 00:60:dd:44:d2:07 inet6 fe80::260:ddff:fe44:d207%mxge0 prefixlen 64 scopeid 0x1 nd6 options=29 media: Ethernet 10Gbase-CX4 status: active bce0: flags=8843 metric 0 mtu 1500 options=c01bb ether 08:9e:01:50:a3:08 inet6 fe80::a9e:1ff:fe50:a308%bce0 prefixlen 64 scopeid 0x2 nd6 options=29 media: Ethernet autoselect (1000baseT ) status: active bce1: flags=8843 metric 0 mtu 1500 options=c01bb ether 08:9e:01:50:a3:09 inet 128.113.12.134 netmask 0xffffff00 broadcast 128.113.12.255 inet6 fe80::a9e:1ff:fe50:a309%bce1 prefixlen 64 scopeid 0x3 nd6 options=29 media: Ethernet autoselect (1000baseT ) status: active lo0: flags=8049 metric 0 mtu 16384 options=600003 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 inet 127.0.0.1 netmask 0xff000000 nd6 options=21 vlan23: flags=8843 metric 0 mtu 9000 options=303 ether 00:60:dd:44:d2:07 inet 10.2.3.244 netmask 0xffffff00 broadcast 10.2.3.255 inet6 fe80::260:ddff:fe44:d207%vlan23 prefixlen 64 scopeid 0x5 nd6 options=29 media: Ethernet 10Gbase-CX4 status: active vlan: 23 parent interface: mxge0 vlan24: flags=8843 metric 0 mtu 9000 options=303 ether 00:60:dd:44:d2:07 inet 10.2.4.244 netmask 0xffffff00 broadcast 10.2.4.255 inet6 fe80::260:ddff:fe44:d207%vlan24 prefixlen 64 scopeid 0x6 nd6 options=29 media: Ethernet 10Gbase-CX4 status: active vlan: 24 parent interface: mxge0 rc.conf: hostname="helo.bio.rpi.edu" ifconfig_bce1=" inet 128.113.12.134 netmask 0xffffff00" ifconfig_mxge0="up mtu 9000" ifconfig_bce0="up" cloned_interfaces="vlan23 vlan24" ifconfig_vlan23="inet 10.2.3.244 netmask 255.255.255.0 vlan 23 vlandev mxge0" ifconfig_vlan24="inet 10.2.4.244 netmask 255.255.255.0 vlan 24 vlandev mxge0" defaultrouter="128.113.12.254" sshd_enable="YES" ntpd_enable="YES" powerd_enable="YES" # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable dumpdev="NO" zfs_enable="YES" nisdomainname="GALACTICA.BIO.RPI.EDU" ntpdate_enable="YES" ntpdate_hosts="ntp.rpi.edu" rpc_lockd_enable="YES" rpc_statd_enable="YES" rpcbind_enable="YES" nis_client_enable="YES" nis_client_flags="-m -S GALACTICA.BIO.RPI.EDU,adama.galactica.local" nfs_server_enable="YES" mountd_enable="YES" nfsd_enable="YES" apcupsd_enable="YES" #pf_enable="YES" netwait_enable="YES" netwait_ip="128.113.12.254" netwait_if="mxge0" static_routes="management" route_management="-net 10.1.1.0/24 10.2.3.254" amd_enable="YES" # Run amd service with $amd_flags (or NO). amd_flags="-a /.amd_mnt -l syslog /home amd.home" amd_map_program="NO" # Can be set to "ypcat -k amd.master" root@helo:~ # uname -a FreeBSD helo.bio.rpi.edu 10.0-RELEASE-p4 FreeBSD 10.0-RELEASE-p4 #0: Tue Jun 3 13:14:57 UTC 2014 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 Bob Healey Systems Administrator Biocomputation and Bioinformatics Constellation and Molecularium healer@rpi.edu (518) 276-4407 On 7/2/2014 11:11 AM, Bob Healey wrote: > Hello. > > I've been wrestling with this on and off for a few months now. I have > an assortment of systems (some Dell Poweredge R515, R610, and IBM > x3630M3) with 10 gig Myricom ethernet cards acting as nfs servers to > Linux HPC compute clusters (12-36 nodes, 384 - 480 cores) connected > via gigabit ethernet. They are also connected to the outside world > via onboard bce (Dell) or igb (IBM). After a variable length of time, > I will lose all network access to a host. Connecting via console, the > machine tends to be fully responsive. A reboot clears the problem, but > I have yet to figure out any sysctls/loader.conf tunables to clear the > problem and make it stay away. PF is in use to restrict access to the > host to a pair of public /24's, and to 10/8. If there is a way in > zfs's sharenfs property to make that restriction, I'd be happy to > change, but I really don't like leaving nfs open to the university's > quartet of /16's, so PF it is. The vlan2 interface has mxge0 as its > parent. > > Thanks for any help. > > This host is getting ready to crash soon, based on netstat. > root@husker:~ # netstat -i > Name Mtu Network Address Ipkts Ierrs Idrop Opkts > Oerrs Coll > mxge0 9000 00:60:dd:44:d2:0a 6358280 262 0 > 4061637 0 0 > mxge0 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - > 2 - - > bce0 1500 08:9e:01:50:a1:ac 276391 0 0 > 0 0 0 > bce0 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - > 3 - - > bce1 1500 08:9e:01:50:a1:ad 2229709391 16921 0 > 1182942116 0 0 > bce1 1500 128.113.12.0 husker 2226254093 - - > 1183962005 - - > bce1 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - > 3 - - > lo0 16384 2030 0 0 > 2030 0 0 > lo0 16384 localhost ::1 4 - - > 4 - - > lo0 16384 fe80::1%lo0 fe80::1 0 - - > 0 - - > lo0 16384 your-net localhost 2026 - - > 2026 - - > vlan2 9000 00:60:dd:44:d2:0a 4387250 0 0 > 3060586 0 0 > vlan2 9000 10.2.3.0 husker.galactica. 4370309 - - > 3963931 - - > vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - > 2 - - > vlan2 9000 00:60:dd:44:d2:0a 1971034 0 0 > 1001061 0 0 > vlan2 9000 10.2.4.0 husker.enterprise 1700742 - - > 1961891 - - > vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - > 4 - - > root@husker:~ # netstat -im > 6157/3233/9390 mbufs in use (current/cache/total) > 4081/1883/5964/1018800 mbuf clusters in use (current/cache/total/max) > 4080/795 mbuf+clusters out of packet secondary zone in use > (current/cache) > 0/5/5/509399 4k (page size) jumbo clusters in use > (current/cache/total/max) > 512/23/535/150933 9k jumbo clusters in use (current/cache/total/max) > 0/0/0/84899 16k jumbo clusters in use (current/cache/total/max) > 14309K/4801K/19110K bytes allocated to network (current/cache/total) > 10/1883/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) > 2/1736/0 requests for jumbo clusters denied (4k/9k/16k) > 0 requests for sfbufs denied > 0 requests for sfbufs delayed > 0 requests for I/O initiated by sendfile > root@husker:~ # uptime > 11:07AM up 23 days, 19:27, 1 user, load averages: 0.14, 0.17, 0.13 > root@husker:~ # sysctl -a | grep nmb > kern.ipc.nmbclusters: 1018800 > kern.ipc.nmbjumbop: 509399 > kern.ipc.nmbjumbo9: 452799 > kern.ipc.nmbjumbo16: 339596 > kern.ipc.nmbufs: 6520320 > root@husker:~ # cat /boot/loader.conf > zfs_load="YES" > amdtemp_load="YES" > if_mxge_load="YES" > mxge_ethp_z8e_load="YES" > mxge_eth_z8e_load="YES" > mxge_rss_ethp_z8e_load="YES" > mxge_rss_eth_z8e_load="YES" > vfs.zfs.arc_max="12288M" > root@husker:~ # cat /var/run/dmesg.boot | head -16 > Copyright (c) 1992-2014 The FreeBSD Project. > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > The Regents of the University of California. All rights reserved. > FreeBSD is a registered trademark of The FreeBSD Foundation. > FreeBSD 10.0-RELEASE-p4 #0: Tue Jun 3 13:14:57 UTC 2014 > root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 > FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610 > CPU: AMD Opteron(tm) Processor 4122 (2200.07-MHz K8-class CPU) > Origin = "AuthenticAMD" Id = 0x100f80 Family = 0x10 Model = 0x8 > Stepping = 0 > Features=0x178bfbff > > Features2=0x802009 > AMD > Features=0xee500800 > AMD > Features2=0x837ff > TSC: P-state invariant > real memory = 17179869184 (16384 MB) > avail memory = 16588054528 (15819 MB) > >