Date: Thu, 6 Mar 2014 21:58:16 +0100 From: Markus Gebert <markus.gebert@hostpoint.ch> To: Jack Vogel <jfvogel@gmail.com> Cc: Johan Kooijman <mail@johankooijman.com>, FreeBSD Net <freebsd-net@freebsd.org>, Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org> Subject: Re: 9.2 ixgbe tx queue hang (was: Network loss) Message-ID: <02AD7510-C862-4C29-9420-25ABF1A6E744@hostpoint.ch> In-Reply-To: <CAFOYbcmrVms7VJmPCZHCTMDvBfsV775aDFkHhMrGAEAtPx8-Mw@mail.gmail.com> References: <9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D@hostpoint.ch> <CAFOYbcmrVms7VJmPCZHCTMDvBfsV775aDFkHhMrGAEAtPx8-Mw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06.03.2014, at 19:33, Jack Vogel <jfvogel@gmail.com> wrote: > You did not make it explicit before, but I noticed in your dtrace info = that > you are using > lagg, its been the source of lots of problems, so take it out of the = setup > and see if this > queue problem still happens please. >=20 > Jack Well, last year when upgrading another batch of servers (same hardware) = to 9.2, we tried find a solution to this network problem, and we = eliminated lagg where we had used it before, which did not help at all. = That=92s why I didn=92t mention it explicitly. My point is, I can confirm that 9.2 has network problems on this same = hardware with or without lagg, so it=92s unlikely that removing it will = bring immediate success. OTOH, I didn=92t have this tx queue theory back = then, so I cannot be sure that what we saw then without lagg, and what = we see now with lagg, really are the same problem. I guess, for the sake of simplicity I will remove lagg on these new = systems. But before I do that, to save time, I wanted to ask wether I = should remove vlan interfaces too? While that didn=92t help either last = year, my guess is that I should take them out of the picture, unless you = say otherwise. Thanks for looking into this. Markus > On Thu, Mar 6, 2014 at 2:24 AM, Markus Gebert = <markus.gebert@hostpoint.ch>wrote: >=20 >> (creating a new thread, because I'm no longer sure this is related to >> Johan's thread that I originally used to discuss this) >>=20 >> On 27.02.2014, at 18:02, Jack Vogel <jfvogel@gmail.com> wrote: >>=20 >>> I would make SURE that you have enough mbuf resources of whatever = size >> pool >>> that you are >>> using (2, 4, 9K), and I would try the code in HEAD if you had not. >>>=20 >>> Jack >>=20 >> Jack, we've upgraded some other systems on which I get more time to = debug >> (no impact for customers). Although those systems use the nfsclient = too, I >> no longer think that NFS is the source of the problem (hence the new >> thread). I think it's the ixgbe driver and/or card. When our problem >> occurs, it looks like it's a single tx queue that gets stuck somehow = (its >> buf_ring remains full). >>=20 >> I tracked ping using dtrace to determine the source of ENOBUFS it = returns >> every few packets when things get weird: >>=20 >> # dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D = "ping" / { >> stack(); }' >> dtrace: description 'fbt:::return ' matched 25476 probes >> CPU ID FUNCTION:NAME >> 26 7730 ixgbe_mq_start:return >> if_lagg.ko`lagg_transmit+0xc4 >> kernel`ether_output_frame+0x33 >> kernel`ether_output+0x4fe >> kernel`ip_output+0xd74 >> kernel`rip_output+0x229 >> kernel`sosend_generic+0x3f6 >> kernel`kern_sendit+0x1a3 >> kernel`sendit+0xdc >> kernel`sys_sendto+0x4d >> kernel`amd64_syscall+0x5ea >> kernel`0xffffffff80d35667 >>=20 >>=20 >>=20 >> The only way ixgbe_mq_start could return ENOBUFS would be when >> drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet >> probably has no flow id, it should be assigned to a queue based on = curcpu, >> which made me try to pin ping to single cpus to check wether it's = always >> the same tx buf_ring that reports being full. This turned out to be = true: >>=20 >> # cpuset -l 0 ping 10.0.4.5 >> PING 10.0.4.5 (10.0.4.5): 56 data bytes >> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.347 ms >> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.135 ms >>=20 >> # cpuset -l 1 ping 10.0.4.5 >> PING 10.0.4.5 (10.0.4.5): 56 data bytes >> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.184 ms >> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.232 ms >>=20 >> # cpuset -l 2 ping 10.0.4.5 >> PING 10.0.4.5 (10.0.4.5): 56 data bytes >> ping: sendto: No buffer space available >> ping: sendto: No buffer space available >> ping: sendto: No buffer space available >> ping: sendto: No buffer space available >> ping: sendto: No buffer space available >>=20 >> # cpuset -l 3 ping 10.0.4.5 >> PING 10.0.4.5 (10.0.4.5): 56 data bytes >> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.130 ms >> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.126 ms >> [...snip...] >>=20 >> The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which = use >> the third tx buf_ring, ping reliably return ENOBUFS. If ping is run = on any >> other cpu using any other tx queue, it runs without any packet loss. >>=20 >> So, when ENOBUFS is returned, this is not due to an mbuf shortage, = it's >> because the buf_ring is full. Not surprisingly, netstat -m looks = pretty >> normal: >>=20 >> # netstat -m >> 38622/11823/50445 mbufs in use (current/cache/total) >> 32856/11642/44498/132096 mbuf clusters in use = (current/cache/total/max) >> 32824/6344 mbuf+clusters out of packet secondary zone in use >> (current/cache) >> 16/3906/3922/66048 4k (page size) jumbo clusters in use >> (current/cache/total/max) >> 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max) >> 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max) >> 75431K/41863K/117295K bytes allocated to network = (current/cache/total) >> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) >> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) >> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) >> 0/0/0 requests for jumbo clusters denied (4k/9k/16k) >> 0/0/0 sfbufs in use (current/peak/max) >> 0 requests for sfbufs denied >> 0 requests for sfbufs delayed >> 0 requests for I/O initiated by sendfile >> 0 calls to protocol drain routines >>=20 >> In the meantime I've checked the commit log of the ixgbe driver in = HEAD >> and besides there are little differences between HEAD and 9.2, I = don't see >> a commit that fixes anything related to what were seeing... >>=20 >> So, what's the conclusion here? Firmware bug that's only triggered = under >> 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue = stuff >> was added? Jack, how should we proceed? >>=20 >>=20 >> Markus >>=20 >>=20 >>=20 >> On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert >> <markus.gebert@hostpoint.ch>wrote: >>=20 >>>=20 >>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote: >>>=20 >>>> John Baldwin wrote: >>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote: >>>>>> Hi all, >>>>>>=20 >>>>>> I have a weird situation here where I can't get my head around. >>>>>>=20 >>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once = in >>>>>> a while >>>>>> the Linux clients loose their NFS connection: >>>>>>=20 >>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding, >>>>>> timed out >>>>>>=20 >>>>>> Not all boxes, just one out of the cluster. The weird part is = that >>>>>> when I >>>>>> try to ping a Linux client from the FreeBSD box, I have between = 10 >>>>>> and 30% >>>>>> packetloss - all day long, no specific timeframe. If I ping the >>>>>> Linux >>>>>> clients - no loss. If I ping back from the Linux clients to FBSD >>>>>> box - no >>>>>> loss. >>>>>>=20 >>>>>> The errors I get when pinging a Linux client is this one: >>>>>> ping: sendto: File too large >>>=20 >>> We were facing similar problems when upgrading to 9.2 and have = stayed >> with >>> 9.1 on affected systems for now. We've seen this on HP G8 blades = with >>> 82599EB controllers: >>>=20 >>> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 = rev=3D0x01 >>> hdr=3D0x00 >>> vendor =3D 'Intel Corporation' >>> device =3D '82599EB 10 Gigabit Dual Port Backplane Connection' >>> class =3D network >>> subclass =3D ethernet >>>=20 >>> We didn't find a way to trigger the problem reliably. But when it = occurs, >>> it usually affects only one interface. Symptoms include: >>>=20 >>> - socket functions return the 'File too large' error mentioned by = Johan >>> - socket functions return 'No buffer space' available >>> - heavy to full packet loss on the affected interface >>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that = should >>> have timed out stick around forever (socket on the other side could = have >>> been closed ours ago) >>> - userland programs using the corresponding sockets usually got = stuck too >>> (can't find kernel traces right now, but always in network related >> syscalls) >>>=20 >>> Network is only lightly loaded on the affected systems (usually 5-20 >> mbit, >>> capped at 200 mbit, per server), and netstat never showed any = indication >> of >>> ressource shortage (like mbufs). >>>=20 >>> What made the problem go away temporariliy was to ifconfig down/up = the >>> affected interface. >>>=20 >>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not = really >>> stable. Also, we tested a few revisions between 9.1 and 9.2 to find = out >>> when the problem started. Unfortunately, the ixgbe driver turned out = to >> be >>> mostly unstable on our systems between these releases, worse than on = 9.2. >>> The instability was introduced shortly after to 9.1 and fixed only = very >>> shortly before 9.2 release. So no luck there. We ended up using 9.1 = with >>> backports of 9.2 features we really need. >>>=20 >>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe = driver >>> or a combination of both that causes these problems. Unfortunately = we ran >>> out of time (and ideas). >>>=20 >>>=20 >>>>> EFBIG is sometimes used for drivers when a packet takes too many >>>>> scatter/gather entries. Since you mentioned NFS, one thing you = can >>>>> try is to >>>>> disable TSO on the intertface you are using for NFS to see if that >>>>> "fixes" it. >>>>>=20 >>>> And please email if you try it and let us know if it helps. >>>>=20 >>>> I've think I've figured out how 64K NFS read replies can do this, >>>> but I'll admit "ping" is a mystery? (Doesn't it just send a single >>>> packet that would be in a single mbuf?) >>>>=20 >>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I >>>> don't know if it can happen for an mbuf chain with < 32 entries? >>>=20 >>> We don't use the nfs server on our systems, but they're = (new)nfsclients. >>> So I don't think our problem is nfs related, unless the default >> rsize/wsize >>> for client mounts is not 8K, which I thought it was. Can you confirm >> this, >>> Rick? >>>=20 >>> IIRC, disabling TSO did not make any difference in our case. >>>=20 >>>=20 >>> Markus >>>=20 >>>=20 >>=20 >>=20 >>=20 >>=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?02AD7510-C862-4C29-9420-25ABF1A6E744>