FreeBSD Mail Archives

Date:      Thu, 6 Mar 2014 21:58:16 +0100
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        Jack Vogel <jfvogel@gmail.com>
Cc:        Johan Kooijman <mail@johankooijman.com>, FreeBSD Net <freebsd-net@freebsd.org>, Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org>
Subject:   Re: 9.2 ixgbe tx queue hang (was: Network loss)
Message-ID:  <02AD7510-C862-4C29-9420-25ABF1A6E744@hostpoint.ch>
In-Reply-To: <CAFOYbcmrVms7VJmPCZHCTMDvBfsV775aDFkHhMrGAEAtPx8-Mw@mail.gmail.com>
References:  <9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D@hostpoint.ch> <CAFOYbcmrVms7VJmPCZHCTMDvBfsV775aDFkHhMrGAEAtPx8-Mw@mail.gmail.com>


On 06.03.2014, at 19:33, Jack Vogel <jfvogel@gmail.com> wrote:

> You did not make it explicit before, but I noticed in your dtrace info =
that
> you are using
> lagg, its been the source of lots of problems, so take it out of the =
setup
> and see if this
> queue problem still happens please.
>=20
> Jack

Well, last year when upgrading another batch of servers (same hardware) =
to 9.2, we tried find a solution to this network problem, and we =
eliminated lagg where we had used it before, which did not help at all. =
That=92s why I didn=92t mention it explicitly.

My point is, I can confirm that 9.2 has network problems on this same =
hardware with or without lagg, so it=92s unlikely that removing it will =
bring immediate success. OTOH, I didn=92t have this tx queue theory back =
then, so I cannot be sure that what we saw then without lagg, and what =
we see now with lagg, really are the same problem.

I guess, for the sake of simplicity I will remove lagg on these new =
systems. But before I do that, to save time, I wanted to ask wether I =
should remove vlan interfaces too? While that didn=92t help either last =
year, my guess is that I should take them out of the picture, unless you =
say otherwise.

Thanks for looking into this.


Markus



> On Thu, Mar 6, 2014 at 2:24 AM, Markus Gebert =
<markus.gebert@hostpoint.ch>wrote:
>=20
>> (creating a new thread, because I'm no longer sure this is related to
>> Johan's thread that I originally used to discuss this)
>>=20
>> On 27.02.2014, at 18:02, Jack Vogel <jfvogel@gmail.com> wrote:
>>=20
>>> I would make SURE that you have enough mbuf resources of whatever =
size
>> pool
>>> that you are
>>> using (2, 4, 9K), and I would try the code in HEAD if you had not.
>>>=20
>>> Jack
>>=20
>> Jack, we've upgraded some other systems on which I get more time to =
debug
>> (no impact for customers). Although those systems use the nfsclient =
too, I
>> no longer think that NFS is the source of the problem (hence the new
>> thread). I think it's the ixgbe driver and/or card. When our problem
>> occurs, it looks like it's a single tx queue that gets stuck somehow =
(its
>> buf_ring remains full).
>>=20
>> I tracked ping using dtrace to determine the source of ENOBUFS it =
returns
>> every few packets when things get weird:
>>=20
>> # dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D =
"ping" / {
>> stack(); }'
>> dtrace: description 'fbt:::return ' matched 25476 probes
>> CPU     ID                    FUNCTION:NAME
>> 26   7730            ixgbe_mq_start:return
>>              if_lagg.ko`lagg_transmit+0xc4
>>              kernel`ether_output_frame+0x33
>>              kernel`ether_output+0x4fe
>>              kernel`ip_output+0xd74
>>              kernel`rip_output+0x229
>>              kernel`sosend_generic+0x3f6
>>              kernel`kern_sendit+0x1a3
>>              kernel`sendit+0xdc
>>              kernel`sys_sendto+0x4d
>>              kernel`amd64_syscall+0x5ea
>>              kernel`0xffffffff80d35667
>>=20
>>=20
>>=20
>> The only way ixgbe_mq_start could return ENOBUFS would be when
>> drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet
>> probably has no flow id, it should be assigned to a queue based on =
curcpu,
>> which made me try to pin ping to single cpus to check wether it's =
always
>> the same tx buf_ring that reports being full. This turned out to be =
true:
>>=20
>> # cpuset -l 0 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.347 ms
>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.135 ms
>>=20
>> # cpuset -l 1 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.184 ms
>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.232 ms
>>=20
>> # cpuset -l 2 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>>=20
>> # cpuset -l 3 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.130 ms
>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.126 ms
>> [...snip...]
>>=20
>> The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which =
use
>> the third tx buf_ring, ping reliably return ENOBUFS. If ping is run =
on any
>> other cpu using any other tx queue, it runs without any packet loss.
>>=20
>> So, when ENOBUFS is returned, this is not due to an mbuf shortage, =
it's
>> because the buf_ring is full. Not surprisingly, netstat -m looks =
pretty
>> normal:
>>=20
>> # netstat -m
>> 38622/11823/50445 mbufs in use (current/cache/total)
>> 32856/11642/44498/132096 mbuf clusters in use =
(current/cache/total/max)
>> 32824/6344 mbuf+clusters out of packet secondary zone in use
>> (current/cache)
>> 16/3906/3922/66048 4k (page size) jumbo clusters in use
>> (current/cache/total/max)
>> 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
>> 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
>> 75431K/41863K/117295K bytes allocated to network =
(current/cache/total)
>> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
>> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
>> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
>> 0/0/0 sfbufs in use (current/peak/max)
>> 0 requests for sfbufs denied
>> 0 requests for sfbufs delayed
>> 0 requests for I/O initiated by sendfile
>> 0 calls to protocol drain routines
>>=20
>> In the meantime I've checked the commit log of the ixgbe driver in =
HEAD
>> and besides there are little differences between HEAD and 9.2, I =
don't see
>> a commit that fixes anything related to what were seeing...
>>=20
>> So, what's the conclusion here? Firmware bug that's only triggered =
under
>> 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue =
stuff
>> was added? Jack, how should we proceed?
>>=20
>>=20
>> Markus
>>=20
>>=20
>>=20
>> On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert
>> <markus.gebert@hostpoint.ch>wrote:
>>=20
>>>=20
>>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>>>=20
>>>> John Baldwin wrote:
>>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>>>> Hi all,
>>>>>>=20
>>>>>> I have a weird situation here where I can't get my head around.
>>>>>>=20
>>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once =
in
>>>>>> a while
>>>>>> the Linux clients loose their NFS connection:
>>>>>>=20
>>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>>>>> timed out
>>>>>>=20
>>>>>> Not all boxes, just one out of the cluster. The weird part is =
that
>>>>>> when I
>>>>>> try to ping a Linux client from the FreeBSD box, I have between =
10
>>>>>> and 30%
>>>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>>>> Linux
>>>>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>>>>> box - no
>>>>>> loss.
>>>>>>=20
>>>>>> The errors I get when pinging a Linux client is this one:
>>>>>> ping: sendto: File too large
>>>=20
>>> We were facing similar problems when upgrading to 9.2 and have =
stayed
>> with
>>> 9.1 on affected systems for now. We've seen this on HP G8 blades =
with
>>> 82599EB controllers:
>>>=20
>>> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 =
rev=3D0x01
>>> hdr=3D0x00
>>>   vendor     =3D 'Intel Corporation'
>>>   device     =3D '82599EB 10 Gigabit Dual Port Backplane Connection'
>>>   class      =3D network
>>>   subclass   =3D ethernet
>>>=20
>>> We didn't find a way to trigger the problem reliably. But when it =
occurs,
>>> it usually affects only one interface. Symptoms include:
>>>=20
>>> - socket functions return the 'File too large' error mentioned by =
Johan
>>> - socket functions return 'No buffer space' available
>>> - heavy to full packet loss on the affected interface
>>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that =
should
>>> have timed out stick around forever (socket on the other side could =
have
>>> been closed ours ago)
>>> - userland programs using the corresponding sockets usually got =
stuck too
>>> (can't find kernel traces right now, but always in network related
>> syscalls)
>>>=20
>>> Network is only lightly loaded on the affected systems (usually 5-20
>> mbit,
>>> capped at 200 mbit, per server), and netstat never showed any =
indication
>> of
>>> ressource shortage (like mbufs).
>>>=20
>>> What made the problem go away temporariliy was to ifconfig down/up =
the
>>> affected interface.
>>>=20
>>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not =
really
>>> stable. Also, we tested a few revisions between 9.1 and 9.2 to find =
out
>>> when the problem started. Unfortunately, the ixgbe driver turned out =
to
>> be
>>> mostly unstable on our systems between these releases, worse than on =
9.2.
>>> The instability was introduced shortly after to 9.1 and fixed only =
very
>>> shortly before 9.2 release. So no luck there. We ended up using 9.1 =
with
>>> backports of 9.2 features we really need.
>>>=20
>>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe =
driver
>>> or a combination of both that causes these problems. Unfortunately =
we ran
>>> out of time (and ideas).
>>>=20
>>>=20
>>>>> EFBIG is sometimes used for drivers when a packet takes too many
>>>>> scatter/gather entries.  Since you mentioned NFS, one thing you =
can
>>>>> try is to
>>>>> disable TSO on the intertface you are using for NFS to see if that
>>>>> "fixes" it.
>>>>>=20
>>>> And please email if you try it and let us know if it helps.
>>>>=20
>>>> I've think I've figured out how 64K NFS read replies can do this,
>>>> but I'll admit "ping" is a mystery? (Doesn't it just send a single
>>>> packet that would be in a single mbuf?)
>>>>=20
>>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>>>> don't know if it can happen for an mbuf chain with < 32 entries?
>>>=20
>>> We don't use the nfs server on our systems, but they're =
(new)nfsclients.
>>> So I don't think our problem is nfs related, unless the default
>> rsize/wsize
>>> for client mounts is not 8K, which I thought it was. Can you confirm
>> this,
>>> Rick?
>>>=20
>>> IIRC, disabling TSO did not make any difference in our case.
>>>=20
>>>=20
>>> Markus
>>>=20
>>>=20
>>=20
>>=20
>>=20
>>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?02AD7510-C862-4C29-9420-25ABF1A6E744>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation