Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Oct 2019 08:41:46 +0200
From:      Michael Tuexen <tuexen@freebsd.org>
To:        Paul <devgs@ukr.net>
Cc:        freebsd-net@freebsd.org, freebsd-stable@freebsd.org
Subject:   Re: Network anomalies after update from 11.2 STABLE to 12.1 STABLE
Message-ID:  <D699E191-6FDB-4054-B169-FC081861AFD3@freebsd.org>
In-Reply-To: <1571398510.796520000.8iwbi4pd@frv39.fwdcdn.com>
References:  <1571398510.796520000.8iwbi4pd@frv39.fwdcdn.com>

next in thread | previous in thread | raw e-mail | index | archive | help
> On 18. Oct 2019, at 14:57, Paul <devgs@ukr.net> wrote:
>=20
> Our current version is:
>=20
>   FreeBSD 11.2-STABLE #0 r340725
>=20
> New version that we have problems with:
>=20
>   FreeBSD 12.1-STABLE #5 r352893
>=20
>=20
> After update to new version we have started to observe an incredible =
number of=20
> errors in HTTP requests in between various services in our system. =
This problem
> appeared on all the servers that were upgraded, and seems to not be =
specific to
> concrete network card: we use different models, all are affected.
>=20
> During various tests, we observed a lot of spontaneous TCP stream =
abortions,=20
> including at the establishment stage (SYN) in cases that were 100% =
issue free
> on 11.2-STABLE. Concrete test cases will be shown below.
>=20
> We also want to highlight that, on numerous occasions, we have =
observed random,
> huge ACK indices in a first response to a SYN packet, instead of 1, as =
expected.
> This forces client to abort connection via RST.
>=20
> On the fist glance it looks like races in the kernel, because problem =
disappears when:
>  * we use `dev.ixl.0.iflib.override_nrxqs=3D1` and =
`dev.ixl.0.iflib.override_ntxqs=3D1`
>  * we use `dev.ixl.0.iflib.override_nrxqs=3D0` and =
`dev.ixl.0.iflib.override_ntxqs=3D0`, but don't issue concurrent TCP =
streams
>=20
> These are some debug log messages, emitted by 12.1-STABLE:
>=20
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16304 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16326 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16402 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16652 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16686 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:18562 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, =
no action
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:18918 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19331 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19340 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, =
no action
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19340 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19340 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19489 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19580 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, =
no action
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19580 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19580 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST =
without matching syncache entry (possibly syncookie only), segment =
ignored
> Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:17705 to =
[10.10.10.92]:80; syncache_timer: Response timeout, retransmitting (1) =
SYN|ACK
> Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:18066 to =
[10.10.10.92]:80; syncache_timer: Response timeout, retransmitting (1) =
SYN|ACK
> Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:18066 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was =
rejected, connection attempt aborted by remote endpoint
> Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:17705 to =
[10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was =
rejected, connection attempt aborted by remote endpoint
>=20
> Here, 10.10.10.92 runs 12.1-STABLE, while 10.10.10.39 is a client that =
runs 11.2-STABLE.
>=20
>=20
> In our test case we use nginx and wrk , with a minimal config, where =
nginx always returns=20
> error page 404. nginx is on the 12.1-STABLE, while wrk is on =
11.2-STABLE.
>=20
> We run wrk like so:
>=20
>   wrk -c 10 --header "Connection: close" -d 10 -t 1 --latency =
http://10.10.10.92:80/missing
>=20
> and often see errors like these:
>=20
>   Socket errors: connect 12, read 4, write 4, timeout 0
>=20
> If we reverse the test, by switching two servers places, ie =
12.1-STABLE becomes a client and=20
> issues requests via wrk, we see no problems at all. Same is true =
between two between two
> 11.2-STABLE machines.
>=20
>=20
> It seems like issue appears only when the same local port is used for =
multiple connections=20
> on 12.1-STABLE. Currently this is possible only when  12.1-STABLE is a =
server and accepts=20
> connections on port, say 80, as in our case. To confirm, this we made  =
another test. We've=20
> configured nginx to listen on 10 different ports, 80 through 89, and =
then launched 10=20
> different wrk processes, each using only one concurrent connection, =
meaning that we will=20
> have only 10 TCP streams, each having its own unique port on the =
12.1-STABLE's side:
>=20
>   for I in {0..9}; do wrk -c 1 --header "Connection: close" -d 10 -t 1 =
--latency http://10.10.10.92:8${I}/missing & ; done
>=20
> Socket errors stopped appearing. We ran this test many many times, =
errors just don't appear.
>=20
> Though, whenever we repeat a previous test, using a single port:
>=20
>   wrk -c 10 --header "Connection: close" -d 10 -t 1 --latency =
http://10.10.10.92:80/missing
>=20
> errors start appearing again and again:
>=20
>   Socket errors: connect 8, read 14, write 9, timeout 0
>=20
>=20
> We've tested different drivers with the same outcome:
>=20
> em driver:
> em0@pci0:10:0:0:        class=3D0x020000 card=3D0x000015d9 =
chip=3D0x10d38086 rev=3D0x00 hdr=3D0x00
>   vendor     =3D 'Intel Corporation'
>   device     =3D '82574L Gigabit Network Connection'
>=20
> ixl driver:
> ixl0@pci0:4:0:0:        class=3D0x020000 card=3D0x00078086 =
chip=3D0x15728086 rev=3D0x01 hdr=3D0x00
>   vendor     =3D 'Intel Corporation'
>   device     =3D 'Ethernet Controller X710 for 10GbE SFP+'
>=20
> Even the driver from ports (/usr/ports/net/intel-ixl-kmod): ixl-1.11.9
>=20
>=20
> Help with this matter would be really appreciated.
I would like to reproduce this locally.

Could you send me (privately) the config of nginx such that I can setup =
two machines?
Are your client/server physical machines or virtual machines?  Are there =
any middleboxes
(NAT/Firewall/whatever) involved?

One thing (no idea if it is relevant or not):
Could you set
sudo sysctl -w net.inet.tcp.ts_offset_per_conn=3D0
on the 12.1 machine and test and report if it helps?

Best regards
Michael
>=20
> Best regards,
> -Paul
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D699E191-6FDB-4054-B169-FC081861AFD3>