Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 Jul 2018 22:11:28 +0100
From:      Dr Josef Karthauser <joe@truespeed.com>
To:        freebsd-net@freebsd.org
Cc:        David Athay <davida@truespeed.com>
Subject:   epair failure in production on 11.1-STABLE (r328930) ? weird!
Message-ID:  <F58994AA-5012-482D-9D80-3DB9EEC16F71@truespeed.com>
References:  <20180620095844.9182416723@smtp-relay2.localdomain>

next in thread | previous in thread | raw e-mail | index | archive | help
We=E2=80=99re experiencing a strange issue in production failure with =
epair (which we=E2=80=99re using to talk vimage to jails).

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 =
16:05:59 GMT 2018     root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64

Looks like epair has suddenly stopped forwarding packets between the =
pair interfaces. Our server has been up for 82 days and it=E2=80=99s =
been working fine, but suddenly packets have stopped being forwarded =
between epairs across the entire system. (We=E2=80=99ve got around 30 =
epairs on the host).  So, we=E2=80=99ve got a sudden ARP resolution =
failure which is affecting all services. :(.

Here=E2=80=99s the test. On a working machine this works fine:

	# Create an email and put an IP address on it, so we can =
generate ARP traffic with PING.=20
	root@magnesium:/usr/home/systems # ifconfig epair create
	epair7a
	root@magnesium:/usr/home/systems # ifconfig epair7a up
	root@magnesium:/usr/home/systems # ifconfig epair7b up
	root@magnesium:/usr/home/systems # ifconfig epair7a inet =
10.140.0.1/30

	# Generate ARP traffic over the epair=E2=80=A6 should see arp =
requests on epair7b.
	root@magnesium:/usr/home/systems # ping 10.140.0.2
	PING 10.140.0.2 (10.140.0.2): 56 data bytes

	# Watch traffic coming in from the epair
	root@magnesium:/usr/home/systems # tcpdump -i epair7b
	10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, =
length 28
	10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, =
length 28
	^C
	2 packets captured
	2 packets received by filter
	0 packets dropped by kernel

Works fine.

However, on the failing machine we don=E2=80=99t get any packets =
forwarded (any more =E2=80=94 remember it=E2=80=99s been working fine =
for a few months - suddenly stopped working :( ).

	root@s5:/usr/home/systems # ifconfig pair create
	epair19a
	root@s5:/usr/home/systems # ifconfig epair19a up
	root@s5:/usr/home/systems # ifconfig epair7b up
	root@s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

	root@s5:/usr/home/systems # ping 10.140.0.2
	PING 10.140.0.2 (10.140.0.2): 56 data bytes

	root@s5:/usr/home/systems # tcpdump -ni epair19a
	09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, =
length 28
	09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, =
length 28
	^C=20

	root@s5:/usr/home/systems # tcpdump -ni epair19b
	[Tumble weed - no traffic seen]
	^C

Has anyone seen this before? We=E2=80=99re going to reboot and see if =
that fixes the problem.

The failing kernel in question is:

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 =
16:05:59 GMT 2018     root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64


Break break. We=E2=80=99ve just seen a bug bugzilla report 22710, =
reporting that epair fails when the queue limit is hit =
(net.link.epair.netisr_maxqlen). We=E2=80=99ve just introduced a high =
bandwidth service on this machine and so it=E2=80=99s probably that =
that=E2=80=99s what=E2=80=99s caused the issue.

We=E2=80=99ve currently got a value of:

	net.link.epair.netisr_maxqlen: 2100

root@s5:/usr/home/systems # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         1            1
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs         disabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256   flow  default   ---
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256 source   direct   ---
ip6        6    256   flow  default   ---
epair      8   2100    cpu  default   CD-

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  =
Handled
   0   0   ip         0   253 385468689        0        0 49360754 =
434829441
   0   0   igmp       0     0        0        0        0        0        =
0
   0   0   rtsock     0     5        0        0        0     1144     =
1144
   0   0   arp        0     0  5573045        0        0        0  =
5573045
   0   0   ether      0     0 1125223166        0        0        0 =
1125223166
   0   0   ip6        0     4       90        0        0  1220274  =
1220364
   0   0   epair      0  2100        0        0      214 4994675481 =
4994675481

But we can=E2=80=99t see how much of the queue is currently being used, =
or what size we need to set it to.

But, why has hitting the queue limit broken it entirely!=20

Help!

Cheers,
Joe
=E2=80=94=20
Dr Josef Karthauser
Chief Technical Officer
(01225) 300371 / (07703) 596893
www.truespeed.com <http://www.truespeed.com/>;
  / theTRUESPEED <http://www.facebook.com/theTRUESPEED>=20
  @theTRUESPEED <https://twitter.com/thetruespeed>;
=20
This email contains TrueSpeed information, which may be privileged or =
confidential. It's meant only for the individual(s) or entity named =
above. If you're not the intended recipient, note that disclosing, =
copying, distributing or using this information is prohibited. If you've =
received this email in error, please let me know immediately on the =
email address above. Thank you.
We monitor our email system, and may record your emails.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F58994AA-5012-482D-9D80-3DB9EEC16F71>