From owner-freebsd-stable@FreeBSD.ORG Wed Jul 9 14:47:36 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1888C916; Wed, 9 Jul 2014 14:47:36 +0000 (UTC) Received: from smtp9.server.rpi.edu (smtp9.server.rpi.edu [128.113.2.229]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BE2C22696; Wed, 9 Jul 2014 14:47:35 +0000 (UTC) Received: from smtp-auth1.server.rpi.edu (smtp-auth1.server.rpi.edu [128.113.2.231]) by smtp9.server.rpi.edu (8.14.3/8.14.3/Debian-9.4) with ESMTP id s69ElXS6027740; Wed, 9 Jul 2014 10:47:33 -0400 Received: from smtp-auth1.server.rpi.edu (localhost [127.0.0.1]) by smtp-auth1.server.rpi.edu (Postfix) with ESMTP id EFE6158020; Wed, 9 Jul 2014 10:47:32 -0400 (EDT) Received: from [128.113.208.247] (vpn-208-247.net.rpi.edu [128.113.208.247]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: healer) by smtp-auth1.server.rpi.edu (Postfix) with ESMTPSA id B2B135801B; Wed, 9 Jul 2014 10:47:32 -0400 (EDT) Message-ID: <53BD5607.9080406@rpi.edu> Date: Wed, 09 Jul 2014 10:47:35 -0400 From: Bob Healey User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: FreeBSD Stable Mailing List Subject: Re: Interactions with mxge, pf, nfsd, and the kernel References: <2035928323.8860293.1404849282793.JavaMail.root@uoguelph.ca> In-Reply-To: <2035928323.8860293.1404849282793.JavaMail.root@uoguelph.ca> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-Bayes-Prob: 0.5 (Score 0, tokens from: outgoing, @@RPTN) X-Spam-Score: 0.00 () [Hold at 15.10] C55(0) X-CanIt-Incident-Id: 02MoeLxM6 X-CanIt-Geo: ip=128.113.208.247; country=US; region=New York; city=Troy; latitude=42.7495; longitude=-73.5951; http://maps.google.com/maps?q=42.7495,-73.5951&z=6 X-CanItPRO-Stream: outgoing X-Canit-Stats-ID: Bayes signature not available X-Scanned-By: CanIt (www . roaringpenguin . com) on 128.113.2.229 Cc: Hans Petter Selasky X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Jul 2014 14:47:36 -0000 These machines primary purpose is to provide nfs services to an HPC cluster (12 hosts, 384 cores) on a gigabit interconnection. mxge0 is a 10G link to an HP Procurve 2910al with the hpc nodes connected to it, existing in RFC 1918 space. bce1 is connected to the public network. bce0 is only used for accessing the IPMI chipset. 9K packets are enabled in what appears to be a futile attempt to improve nfs and mpi performance within the cluster. I can revert back to 1500 byte packets, and just chalk this up to being things mere mortals are not to meddle with. Bob Healey Systems Administrator Biocomputation and Bioinformatics Constellation and Molecularium healer@rpi.edu (518) 276-4407 On 7/8/2014 3:54 PM, Rick Macklem wrote: > Bob Healey wrote: >> I've been running one of these machines without pf, and it has ceased >> responding on all interfaces (mxge and bce). The console still works >> fine, and a reboot will clear the problems for now. I'm running out >> of >> ideas. >> root@helo:~ # netstat -i >> Name Mtu Network Address Ipkts Ierrs Idrop >> Opkts Oerrs Coll >> mxge0 9000 00:60:dd:44:d2:07 44838061 164399 0 >> 31944144 0 0 >> mxge0 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - >> 3 - - >> bce0 1500 08:9e:01:50:a3:08 97018 0 0 >> 0 0 0 >> bce0 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - >> 3 - - >> bce1 1500 08:9e:01:50:a3:09 889442915 1791 0 >> 557044449 0 0 >> bce1 1500 128.113.12.0 helo 888129846 - - >> 676300451 - - >> bce1 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - >> 4 - - >> lo0 16384 28448 0 0 >> 28448 0 0 >> lo0 16384 localhost ::1 59 - - >> 59 - - >> lo0 16384 fe80::1%lo0 fe80::1 0 - - >> 0 - - >> lo0 16384 your-net localhost 28389 - - >> 28389 - - >> vlan2 9000 00:60:dd:44:d2:07 28107520 0 0 >> 19859118 0 0 >> vlan2 9000 10.2.3.0 helo.galactica.lo 28088754 - - >> 24433917 - - >> vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - >> 3 - - >> vlan2 9000 00:60:dd:44:d2:07 16730541 0 0 >> 12084894 0 0 >> vlan2 9000 10.2.4.0 helo.enterprise.l 16724370 - - >> 12924742 - - >> vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - >> 3 - - >> root@helo:~ # netstat -m >> 7632/6798/14430 mbufs in use (current/cache/total) >> 4186/2886/7072/1018944 mbuf clusters in use (current/cache/total/max) >> 4080/1420 mbuf+clusters out of packet secondary zone in use >> (current/cache) >> 0/6/6/509472 4k (page size) jumbo clusters in use >> (current/cache/total/max) >> 593/25/618/150954 9k jumbo clusters in use (current/cache/total/max) > Hmm, since you are using jumbo clusters, running out of kernel address > space such that it can no longer allocate boundary tags might be a possibility. > > Do a "ps axHl" and look for any threads with a WCHAN of "btallo". If you find > any of those, this is definitely what is happening. (Unfortunately, for the case > of M_NOWAIT, the threads will just be in "R" state when this happens, since they > never do a pause("btalloc"); > > From what little I understand (and saw when I had it happen while testing NFS > using PAGE_SIZE clusters) is that, once this happens, pretty well all uma_zalloc()s > fail (which implies all mbuf allocations). As such, the machine is pretty well > dead w.r.t. networking. > > If you can run without jumbo clusters, I think that would be worth a try. > > Good luck with it, rick > ps: I've added Hans to the cc list, since he is proposing a case where jumbo > clusters would be used more and this can be problematic. > >> 0/0/0/84912 16k jumbo clusters in use (current/cache/total/max) >> 15617K/7720K/23337K bytes allocated to network (current/cache/total) >> 3/72461/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) >> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) >> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) >> 122/391912/0 requests for jumbo clusters denied (4k/9k/16k) >> 0 requests for sfbufs denied >> 0 requests for sfbufs delayed >> 0 requests for I/O initiated by sendfile >> root@helo:~ # uptime >> 9:07AM up 12 days, 8:15, 1 user, load averages: 0.19, 0.19, 0.20 >> root@helo:~ # ifconfig >> mxge0: flags=8843 metric 0 >> mtu 9000 >> options=6c03bb >> ether 00:60:dd:44:d2:07 >> inet6 fe80::260:ddff:fe44:d207%mxge0 prefixlen 64 scopeid >> 0x1 >> nd6 options=29 >> media: Ethernet 10Gbase-CX4 >> status: active >> bce0: flags=8843 metric 0 mtu >> 1500 >> options=c01bb >> ether 08:9e:01:50:a3:08 >> inet6 fe80::a9e:1ff:fe50:a308%bce0 prefixlen 64 scopeid 0x2 >> nd6 options=29 >> media: Ethernet autoselect (1000baseT ) >> status: active >> bce1: flags=8843 metric 0 mtu >> 1500 >> options=c01bb >> ether 08:9e:01:50:a3:09 >> inet 128.113.12.134 netmask 0xffffff00 broadcast >> 128.113.12.255 >> inet6 fe80::a9e:1ff:fe50:a309%bce1 prefixlen 64 scopeid 0x3 >> nd6 options=29 >> media: Ethernet autoselect (1000baseT ) >> status: active >> lo0: flags=8049 metric 0 mtu 16384 >> options=600003 >> inet6 ::1 prefixlen 128 >> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 >> inet 127.0.0.1 netmask 0xff000000 >> nd6 options=21 >> vlan23: flags=8843 metric 0 >> mtu 9000 >> options=303 >> ether 00:60:dd:44:d2:07 >> inet 10.2.3.244 netmask 0xffffff00 broadcast 10.2.3.255 >> inet6 fe80::260:ddff:fe44:d207%vlan23 prefixlen 64 scopeid >> 0x5 >> nd6 options=29 >> media: Ethernet 10Gbase-CX4 >> status: active >> vlan: 23 parent interface: mxge0 >> vlan24: flags=8843 metric 0 >> mtu 9000 >> options=303 >> ether 00:60:dd:44:d2:07 >> inet 10.2.4.244 netmask 0xffffff00 broadcast 10.2.4.255 >> inet6 fe80::260:ddff:fe44:d207%vlan24 prefixlen 64 scopeid >> 0x6 >> nd6 options=29 >> media: Ethernet 10Gbase-CX4 >> status: active >> vlan: 24 parent interface: mxge0 >> rc.conf: >> hostname="helo.bio.rpi.edu" >> ifconfig_bce1=" inet 128.113.12.134 netmask 0xffffff00" >> ifconfig_mxge0="up mtu 9000" >> ifconfig_bce0="up" >> cloned_interfaces="vlan23 vlan24" >> ifconfig_vlan23="inet 10.2.3.244 netmask 255.255.255.0 vlan 23 >> vlandev >> mxge0" >> ifconfig_vlan24="inet 10.2.4.244 netmask 255.255.255.0 vlan 24 >> vlandev >> mxge0" >> defaultrouter="128.113.12.254" >> sshd_enable="YES" >> ntpd_enable="YES" >> powerd_enable="YES" >> # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable >> dumpdev="NO" >> zfs_enable="YES" >> nisdomainname="GALACTICA.BIO.RPI.EDU" >> ntpdate_enable="YES" >> ntpdate_hosts="ntp.rpi.edu" >> rpc_lockd_enable="YES" >> rpc_statd_enable="YES" >> rpcbind_enable="YES" >> nis_client_enable="YES" >> nis_client_flags="-m -S GALACTICA.BIO.RPI.EDU,adama.galactica.local" >> nfs_server_enable="YES" >> mountd_enable="YES" >> nfsd_enable="YES" >> apcupsd_enable="YES" >> #pf_enable="YES" >> netwait_enable="YES" >> netwait_ip="128.113.12.254" >> netwait_if="mxge0" >> static_routes="management" >> route_management="-net 10.1.1.0/24 10.2.3.254" >> amd_enable="YES" # Run amd service with $amd_flags >> (or NO). >> amd_flags="-a /.amd_mnt -l syslog /home amd.home" >> amd_map_program="NO" # Can be set to "ypcat -k amd.master" >> root@helo:~ # uname -a >> FreeBSD helo.bio.rpi.edu 10.0-RELEASE-p4 FreeBSD 10.0-RELEASE-p4 #0: >> Tue >> Jun 3 13:14:57 UTC 2014 >> root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 >> >> Bob Healey >> Systems Administrator >> Biocomputation and Bioinformatics Constellation >> and Molecularium >> healer@rpi.edu >> (518) 276-4407 >> >> On 7/2/2014 11:11 AM, Bob Healey wrote: >>> Hello. >>> >>> I've been wrestling with this on and off for a few months now. I >>> have >>> an assortment of systems (some Dell Poweredge R515, R610, and IBM >>> x3630M3) with 10 gig Myricom ethernet cards acting as nfs servers >>> to >>> Linux HPC compute clusters (12-36 nodes, 384 - 480 cores) connected >>> via gigabit ethernet. They are also connected to the outside world >>> via onboard bce (Dell) or igb (IBM). After a variable length of >>> time, >>> I will lose all network access to a host. Connecting via console, >>> the >>> machine tends to be fully responsive. A reboot clears the problem, >>> but >>> I have yet to figure out any sysctls/loader.conf tunables to clear >>> the >>> problem and make it stay away. PF is in use to restrict access to >>> the >>> host to a pair of public /24's, and to 10/8. If there is a way in >>> zfs's sharenfs property to make that restriction, I'd be happy to >>> change, but I really don't like leaving nfs open to the >>> university's >>> quartet of /16's, so PF it is. The vlan2 interface has mxge0 as >>> its >>> parent. >>> >>> Thanks for any help. >>> >>> This host is getting ready to crash soon, based on netstat. >>> root@husker:~ # netstat -i >>> Name Mtu Network Address Ipkts Ierrs Idrop >>> Opkts >>> Oerrs Coll >>> mxge0 9000 00:60:dd:44:d2:0a 6358280 262 0 >>> 4061637 0 0 >>> mxge0 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - >>> 2 - - >>> bce0 1500 08:9e:01:50:a1:ac 276391 0 0 >>> 0 0 0 >>> bce0 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - >>> 3 - - >>> bce1 1500 08:9e:01:50:a1:ad 2229709391 16921 0 >>> 1182942116 0 0 >>> bce1 1500 128.113.12.0 husker 2226254093 - - >>> 1183962005 - - >>> bce1 1500 fe80::a9e:1ff fe80::a9e:1ff:fe5 0 - - >>> 3 - - >>> lo0 16384 2030 0 0 >>> 2030 0 0 >>> lo0 16384 localhost ::1 4 - - >>> 4 - - >>> lo0 16384 fe80::1%lo0 fe80::1 0 - - >>> 0 - - >>> lo0 16384 your-net localhost 2026 - - >>> 2026 - - >>> vlan2 9000 00:60:dd:44:d2:0a 4387250 0 0 >>> 3060586 0 0 >>> vlan2 9000 10.2.3.0 husker.galactica. 4370309 - - >>> 3963931 - - >>> vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - >>> 2 - - >>> vlan2 9000 00:60:dd:44:d2:0a 1971034 0 0 >>> 1001061 0 0 >>> vlan2 9000 10.2.4.0 husker.enterprise 1700742 - - >>> 1961891 - - >>> vlan2 9000 fe80::260:ddf fe80::260:ddff:fe 0 - - >>> 4 - - >>> root@husker:~ # netstat -im >>> 6157/3233/9390 mbufs in use (current/cache/total) >>> 4081/1883/5964/1018800 mbuf clusters in use >>> (current/cache/total/max) >>> 4080/795 mbuf+clusters out of packet secondary zone in use >>> (current/cache) >>> 0/5/5/509399 4k (page size) jumbo clusters in use >>> (current/cache/total/max) >>> 512/23/535/150933 9k jumbo clusters in use >>> (current/cache/total/max) >>> 0/0/0/84899 16k jumbo clusters in use (current/cache/total/max) >>> 14309K/4801K/19110K bytes allocated to network >>> (current/cache/total) >>> 10/1883/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) >>> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) >>> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) >>> 2/1736/0 requests for jumbo clusters denied (4k/9k/16k) >>> 0 requests for sfbufs denied >>> 0 requests for sfbufs delayed >>> 0 requests for I/O initiated by sendfile >>> root@husker:~ # uptime >>> 11:07AM up 23 days, 19:27, 1 user, load averages: 0.14, 0.17, 0.13 >>> root@husker:~ # sysctl -a | grep nmb >>> kern.ipc.nmbclusters: 1018800 >>> kern.ipc.nmbjumbop: 509399 >>> kern.ipc.nmbjumbo9: 452799 >>> kern.ipc.nmbjumbo16: 339596 >>> kern.ipc.nmbufs: 6520320 >>> root@husker:~ # cat /boot/loader.conf >>> zfs_load="YES" >>> amdtemp_load="YES" >>> if_mxge_load="YES" >>> mxge_ethp_z8e_load="YES" >>> mxge_eth_z8e_load="YES" >>> mxge_rss_ethp_z8e_load="YES" >>> mxge_rss_eth_z8e_load="YES" >>> vfs.zfs.arc_max="12288M" >>> root@husker:~ # cat /var/run/dmesg.boot | head -16 >>> Copyright (c) 1992-2014 The FreeBSD Project. >>> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, >>> 1994 >>> The Regents of the University of California. All rights >>> reserved. >>> FreeBSD is a registered trademark of The FreeBSD Foundation. >>> FreeBSD 10.0-RELEASE-p4 #0: Tue Jun 3 13:14:57 UTC 2014 >>> root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC >>> amd64 >>> FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610 >>> CPU: AMD Opteron(tm) Processor 4122 (2200.07-MHz K8-class CPU) >>> Origin = "AuthenticAMD" Id = 0x100f80 Family = 0x10 Model = >>> 0x8 >>> Stepping = 0 >>> Features=0x178bfbff >>> >>> Features2=0x802009 >>> AMD >>> Features=0xee500800 >>> AMD >>> Features2=0x837ff >>> TSC: P-state invariant >>> real memory = 17179869184 (16384 MB) >>> avail memory = 16588054528 (15819 MB) >>> >>> >> _______________________________________________ >> freebsd-stable@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to >> "freebsd-stable-unsubscribe@freebsd.org" >>