From owner-freebsd-net@freebsd.org Mon Jul 2 21:11:40 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D9952FD0AFB for ; Mon, 2 Jul 2018 21:11:39 +0000 (UTC) (envelope-from joe@truespeed.com) Received: from mail.truespeed.com (s2.truespeed.com [31.210.26.217]) by mx1.freebsd.org (Postfix) with ESMTP id 48DDF7A42E for ; Mon, 2 Jul 2018 21:11:39 +0000 (UTC) (envelope-from joe@truespeed.com) Received: from mail.truespeed.com (mail.truespeed.com [31.210.26.210]) by mail.truespeed.com (Postfix) with ESMTP id 7EF9D16154 for ; Mon, 2 Jul 2018 21:11:32 +0000 (UTC) Received: from dspam.truespeed.com (localhost [127.0.0.1]) by mail.truespeed.com (Postfix) with SMTP id 36B922E5B47 for ; Mon, 2 Jul 2018 21:11:32 +0000 (UTC) Received: from unnamed-89.karthauser.co.uk (unnamed-89.karthauser.co.uk [90.155.77.89]) (Authenticated sender: joe@truespeed.com) by mail.truespeed.com (Postfix) with ESMTPSA id 6D78F2E5B2F; Mon, 2 Jul 2018 21:11:29 +0000 (UTC) Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: epair failure in production on 11.1-STABLE (r328930) ? weird! From: Dr Josef Karthauser Date: Mon, 2 Jul 2018 22:11:28 +0100 Cc: David Athay Message-Id: References: <20180620095844.9182416723@smtp-relay2.localdomain> To: freebsd-net@freebsd.org X-Mailer: Apple Mail (2.3124) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.27 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jul 2018 21:11:40 -0000 We=E2=80=99re experiencing a strange issue in production failure with = epair (which we=E2=80=99re using to talk vimage to jails). FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb 6 = 16:05:59 GMT 2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED amd64 Looks like epair has suddenly stopped forwarding packets between the = pair interfaces. Our server has been up for 82 days and it=E2=80=99s = been working fine, but suddenly packets have stopped being forwarded = between epairs across the entire system. (We=E2=80=99ve got around 30 = epairs on the host). So, we=E2=80=99ve got a sudden ARP resolution = failure which is affecting all services. :(. Here=E2=80=99s the test. On a working machine this works fine: # Create an email and put an IP address on it, so we can = generate ARP traffic with PING.=20 root@magnesium:/usr/home/systems # ifconfig epair create epair7a root@magnesium:/usr/home/systems # ifconfig epair7a up root@magnesium:/usr/home/systems # ifconfig epair7b up root@magnesium:/usr/home/systems # ifconfig epair7a inet = 10.140.0.1/30 # Generate ARP traffic over the epair=E2=80=A6 should see arp = requests on epair7b. root@magnesium:/usr/home/systems # ping 10.140.0.2 PING 10.140.0.2 (10.140.0.2): 56 data bytes # Watch traffic coming in from the epair root@magnesium:/usr/home/systems # tcpdump -i epair7b 10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, = length 28 10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, = length 28 ^C 2 packets captured 2 packets received by filter 0 packets dropped by kernel Works fine. However, on the failing machine we don=E2=80=99t get any packets = forwarded (any more =E2=80=94 remember it=E2=80=99s been working fine = for a few months - suddenly stopped working :( ). root@s5:/usr/home/systems # ifconfig pair create epair19a root@s5:/usr/home/systems # ifconfig epair19a up root@s5:/usr/home/systems # ifconfig epair7b up root@s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30 root@s5:/usr/home/systems # ping 10.140.0.2 PING 10.140.0.2 (10.140.0.2): 56 data bytes root@s5:/usr/home/systems # tcpdump -ni epair19a 09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, = length 28 09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, = length 28 ^C=20 root@s5:/usr/home/systems # tcpdump -ni epair19b [Tumble weed - no traffic seen] ^C Has anyone seen this before? We=E2=80=99re going to reboot and see if = that fixes the problem. The failing kernel in question is: FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb 6 = 16:05:59 GMT 2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED amd64 Break break. We=E2=80=99ve just seen a bug bugzilla report 22710, = reporting that epair fails when the queue limit is hit = (net.link.epair.netisr_maxqlen). We=E2=80=99ve just introduced a high = bandwidth service on this machine and so it=E2=80=99s probably that = that=E2=80=99s what=E2=80=99s caused the issue. We=E2=80=99ve currently got a value of: net.link.epair.netisr_maxqlen: 2100 root@s5:/usr/home/systems # netstat -Q Configuration: Setting Current Limit Thread count 1 1 Default queue limit 256 10240 Dispatch policy direct n/a Threads bound to CPUs disabled n/a Protocols: Name Proto QLimit Policy Dispatch Flags ip 1 256 flow default --- igmp 2 256 source default --- rtsock 3 256 source default --- arp 4 256 source default --- ether 5 256 source direct --- ip6 6 256 flow default --- epair 8 2100 cpu default CD- Workstreams: WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued = Handled 0 0 ip 0 253 385468689 0 0 49360754 = 434829441 0 0 igmp 0 0 0 0 0 0 = 0 0 0 rtsock 0 5 0 0 0 1144 = 1144 0 0 arp 0 0 5573045 0 0 0 = 5573045 0 0 ether 0 0 1125223166 0 0 0 = 1125223166 0 0 ip6 0 4 90 0 0 1220274 = 1220364 0 0 epair 0 2100 0 0 214 4994675481 = 4994675481 But we can=E2=80=99t see how much of the queue is currently being used, = or what size we need to set it to. But, why has hitting the queue limit broken it entirely!=20 Help! Cheers, Joe =E2=80=94=20 Dr Josef Karthauser Chief Technical Officer (01225) 300371 / (07703) 596893 www.truespeed.com / theTRUESPEED =20 @theTRUESPEED =20 This email contains TrueSpeed information, which may be privileged or = confidential. It's meant only for the individual(s) or entity named = above. If you're not the intended recipient, note that disclosing, = copying, distributing or using this information is prohibited. If you've = received this email in error, please let me know immediately on the = email address above. Thank you. We monitor our email system, and may record your emails.