Date: Mon, 2 Jul 2018 22:11:28 +0100 From: Dr Josef Karthauser <joe@truespeed.com> To: freebsd-net@freebsd.org Cc: David Athay <davida@truespeed.com> Subject: epair failure in production on 11.1-STABLE (r328930) ? weird! Message-ID: <F58994AA-5012-482D-9D80-3DB9EEC16F71@truespeed.com> References: <20180620095844.9182416723@smtp-relay2.localdomain>
next in thread | previous in thread | raw e-mail | index | archive | help
We=E2=80=99re experiencing a strange issue in production failure with = epair (which we=E2=80=99re using to talk vimage to jails). FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb 6 = 16:05:59 GMT 2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED amd64 Looks like epair has suddenly stopped forwarding packets between the = pair interfaces. Our server has been up for 82 days and it=E2=80=99s = been working fine, but suddenly packets have stopped being forwarded = between epairs across the entire system. (We=E2=80=99ve got around 30 = epairs on the host). So, we=E2=80=99ve got a sudden ARP resolution = failure which is affecting all services. :(. Here=E2=80=99s the test. On a working machine this works fine: # Create an email and put an IP address on it, so we can = generate ARP traffic with PING.=20 root@magnesium:/usr/home/systems # ifconfig epair create epair7a root@magnesium:/usr/home/systems # ifconfig epair7a up root@magnesium:/usr/home/systems # ifconfig epair7b up root@magnesium:/usr/home/systems # ifconfig epair7a inet = 10.140.0.1/30 # Generate ARP traffic over the epair=E2=80=A6 should see arp = requests on epair7b. root@magnesium:/usr/home/systems # ping 10.140.0.2 PING 10.140.0.2 (10.140.0.2): 56 data bytes # Watch traffic coming in from the epair root@magnesium:/usr/home/systems # tcpdump -i epair7b 10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, = length 28 10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, = length 28 ^C 2 packets captured 2 packets received by filter 0 packets dropped by kernel Works fine. However, on the failing machine we don=E2=80=99t get any packets = forwarded (any more =E2=80=94 remember it=E2=80=99s been working fine = for a few months - suddenly stopped working :( ). root@s5:/usr/home/systems # ifconfig pair create epair19a root@s5:/usr/home/systems # ifconfig epair19a up root@s5:/usr/home/systems # ifconfig epair7b up root@s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30 root@s5:/usr/home/systems # ping 10.140.0.2 PING 10.140.0.2 (10.140.0.2): 56 data bytes root@s5:/usr/home/systems # tcpdump -ni epair19a 09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, = length 28 09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, = length 28 ^C=20 root@s5:/usr/home/systems # tcpdump -ni epair19b [Tumble weed - no traffic seen] ^C Has anyone seen this before? We=E2=80=99re going to reboot and see if = that fixes the problem. The failing kernel in question is: FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb 6 = 16:05:59 GMT 2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED amd64 Break break. We=E2=80=99ve just seen a bug bugzilla report 22710, = reporting that epair fails when the queue limit is hit = (net.link.epair.netisr_maxqlen). We=E2=80=99ve just introduced a high = bandwidth service on this machine and so it=E2=80=99s probably that = that=E2=80=99s what=E2=80=99s caused the issue. We=E2=80=99ve currently got a value of: net.link.epair.netisr_maxqlen: 2100 root@s5:/usr/home/systems # netstat -Q Configuration: Setting Current Limit Thread count 1 1 Default queue limit 256 10240 Dispatch policy direct n/a Threads bound to CPUs disabled n/a Protocols: Name Proto QLimit Policy Dispatch Flags ip 1 256 flow default --- igmp 2 256 source default --- rtsock 3 256 source default --- arp 4 256 source default --- ether 5 256 source direct --- ip6 6 256 flow default --- epair 8 2100 cpu default CD- Workstreams: WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued = Handled 0 0 ip 0 253 385468689 0 0 49360754 = 434829441 0 0 igmp 0 0 0 0 0 0 = 0 0 0 rtsock 0 5 0 0 0 1144 = 1144 0 0 arp 0 0 5573045 0 0 0 = 5573045 0 0 ether 0 0 1125223166 0 0 0 = 1125223166 0 0 ip6 0 4 90 0 0 1220274 = 1220364 0 0 epair 0 2100 0 0 214 4994675481 = 4994675481 But we can=E2=80=99t see how much of the queue is currently being used, = or what size we need to set it to. But, why has hitting the queue limit broken it entirely!=20 Help! Cheers, Joe =E2=80=94=20 Dr Josef Karthauser Chief Technical Officer (01225) 300371 / (07703) 596893 www.truespeed.com <http://www.truespeed.com/> / theTRUESPEED <http://www.facebook.com/theTRUESPEED>=20 @theTRUESPEED <https://twitter.com/thetruespeed> =20 This email contains TrueSpeed information, which may be privileged or = confidential. It's meant only for the individual(s) or entity named = above. If you're not the intended recipient, note that disclosing, = copying, distributing or using this information is prohibited. If you've = received this email in error, please let me know immediately on the = email address above. Thank you. We monitor our email system, and may record your emails.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F58994AA-5012-482D-9D80-3DB9EEC16F71>