Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Mar 2014 16:17:34 -0300
From:      Christopher Forgeron <csforgeron@gmail.com>
To:        freebsd-net@freebsd.org
Subject:   Re: 9.2 ixgbe tx queue hang
Message-ID:  <CAB2_NwDG=gB1WCJ7JKTHpkJCrvPuAhipkn%2BvPyT%2BxXzOBrTGkg@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Hello,



I can report this problem as well on 10.0-RELEASE.



I think it's the same as kern/183390?



I have two physically identical machines, one running 9.2-STABLE, and one
on 10.0-RELEASE.



My 10.0 machine used to be running 9.0-STABLE for over a year without any
problems.



I'm not having the problems with 9.2-STABLE as far as I can tell, but it
does seem to be a load-based issue more than anything. Since my 9.2 system
is in production, I'm unable to load it to see if the problem exists there.
I have a ping_logger.py running on it now to see if it's experiencing
problems briefly or not.



I am able to reproduce it fairly reliably within 15 min of a reboot by
loading the server via NFS with iometer and some large NFS file copies at
the same time. I seem to need to sustain ~2 Gbps for a few minutes.



It will happen with just ix0 (no lagg) or with lagg enabled across ix0 and
ix1.



I've been load-testing new FreeBSD-10.0-RELEASE SAN's for production use
here, so I'm quite willing to put time into this to help find out where
it's coming from.  It took me a day to track down my iometer issues as
being network related, and another day to isolate and write scripts to
reproduce.



The symptom I notice is:

-          A running flood ping (ping -f 172.16.0.31) to the same hardware
(running 9.2) will come back with "ping: sendto: File too large" when the
problem occurs

-          Network connectivity is very spotty during these incidents

-          It can run with sporadic ping errors, or it can run a straight
set of errors for minutes at a time

-          After a long run of ping errors, ESXi will show a disconnect
from the hosted NFS stores on this machine.

-          I've yet to see it happen right after boot. Fastest is around 5
min, normally it's within 15 min.



System Specs:



-          Dell PowerEdge M610x Blade

-          2 Xeon 6600  @ 2.40GHz (24 Cores total)

-          96 Gig RAM

-          35.3 TB ZFS Mirrored pool, lz4 compression on my test pool (ZFS
pool is the latest)

-          Intel 520-DA2 10 Gb dual-port Blade Mezz. Cards



Currently this 10.0 testing machine is clean for all sysctl's other than
hw.intr_storm_threshold=9900. I have the problem if that's set or not, so I
leave it on.



( I used to set manual nmbclusters, etc. as per the Intel Readme.doc, but I
notice that the defaults on the new 10.0 system are larger. I did try using
all of the old sysctl's from an older 9.0-STABLE, and still had the
problem, but it did seem to take longer to occur? I haven't run enough
tests to confirm that time observation is true. )



What logs / info can I provide to help?



I have written a small script called ping_logger.py that pings an IP, and
checks to see if there is an error. On error it will execute and log:

-          netstat -m

-          sysctl hw.ix

-          sysctl dev.ix



then go back to pinging. It will also log those values on the startup of
the script, and every 5 min (so you can see the progression on the system).
I can add any number of things to the reporting, so I'm looking for
suggestions.



This results in some large log files, but I can email a .gz directly to
anyone who need them, or perhaps put it up on a website.



I will also make the ping_logger.py script available if anyone else wants
it.





LASTLY:



The one thing I can see that is different in my 10.0 System and my 9.2 is:



9.2's netstat -m:



37965/16290/54255 mbufs in use (current/cache/total)

4080/8360/12440/524288 mbuf clusters in use (current/cache/total/max)

4080/4751 mbuf+clusters out of packet secondary zone in use (current/cache)

0/452/452/262144 4k (page size) jumbo clusters in use
(current/cache/total/max)

32773/4129/36902/96000 9k jumbo clusters in use (current/cache/total/max)

0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)

312608K/59761K/372369K bytes allocated to network (current/cache/total)

0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)

0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

0/0/0 requests for jumbo clusters denied (4k/9k/16k)

0/0/0 sfbufs in use (current/peak/max)

0 requests for sfbufs denied

0 requests for sfbufs delayed

0 requests for I/O initiated by sendfile

0 calls to protocol drain routines





10.0's netstat -m:



21512/24448/45960 mbufs in use (current/cache/total)

4080/16976/21056/6127254 mbuf clusters in use (current/cache/total/max)

4080/16384 mbuf+clusters out of packet secondary zone in use (current/cache)

0/23/23/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)

16384/158/16542/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

160994K/41578K/202572K bytes allocated to network (current/cache/total)

17488/13290/20464 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)

0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

7/16462/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied

0 requests for sfbufs delayed

0 requests for I/O initiated by sendfile



Way more mbuf clusters in use, but also I never get denied/delayed results
in 9.2 - but I have them in 10.0 right away after a reboot.



Thanks for any help..



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAB2_NwDG=gB1WCJ7JKTHpkJCrvPuAhipkn%2BvPyT%2BxXzOBrTGkg>