From owner-freebsd-net@FreeBSD.ORG  Thu Mar 20 15:34:50 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 387CA857
 for <freebsd-net@freebsd.org>; Thu, 20 Mar 2014 15:34:50 +0000 (UTC)
Received: from mail-qa0-x231.google.com (mail-qa0-x231.google.com
 [IPv6:2607:f8b0:400d:c00::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id EB65B5E6
 for <freebsd-net@freebsd.org>; Thu, 20 Mar 2014 15:34:49 +0000 (UTC)
Received: by mail-qa0-f49.google.com with SMTP id j7so1016144qaq.8
 for <freebsd-net@freebsd.org>; Thu, 20 Mar 2014 08:34:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=gwenBx+mn6mFUGA24eNU9sDezZM6KnwRCv0Er/ThM40=;
 b=Tr240WrFYp30jEDfEAfIY+mA4UqA+85s58jcWaxl5oMdoqEipVonKdj/640I0+jU0y
 HGTYuaCikgYN8fKCfz6lQtpAQbI1TZow2wQytEk7gDX1VoHTJo2x8WCj9qs4AXCBzGMz
 cW7UN1luq9Na6TTi6364df6KkDJFek10Dltmyy10LWuLaK2MNS+EAljvaTI9gmo37dEX
 TwFpE5AQIwUEhlt0RoEeQ4UStVQfkMovvbffEHlRO/dmzQ/ZV1wzShuD2IKY3Mzt+Mxg
 TaVpXiWU9b+bMOusMk7M8XzciF0YI6YMXJfyiyZZ5g7mK3XC+pn0zmn09FlahufPloUy
 ww6g==
MIME-Version: 1.0
X-Received: by 10.224.22.147 with SMTP id n19mr3471628qab.93.1395329689168;
 Thu, 20 Mar 2014 08:34:49 -0700 (PDT)
Received: by 10.96.79.97 with HTTP; Thu, 20 Mar 2014 08:34:49 -0700 (PDT)
In-Reply-To: <FA262955-B3A9-48EC-828B-FF0D4D5D0498@hostpoint.ch>
References: <CAB2_NwDG=gB1WCJ7JKTHpkJCrvPuAhipkn+vPyT+xXzOBrTGkg@mail.gmail.com>
 <FA262955-B3A9-48EC-828B-FF0D4D5D0498@hostpoint.ch>
Date: Thu, 20 Mar 2014 12:34:49 -0300
Message-ID: <CAB2_NwDGb=NS8ghWfcuB7mrmr9_VzRnZ_yg9M-qAGESCShB4VQ@mail.gmail.com>
Subject: Re: 9.2 ixgbe tx queue hang
From: Christopher Forgeron <csforgeron@gmail.com>
To: Markus Gebert <markus.gebert@hostpoint.ch>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.17
Cc: freebsd-net@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>,
 Jack Vogel <jfvogel@gmail.com>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 20 Mar 2014 15:34:50 -0000

On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert
<markus.gebert@hostpoint.ch>wrote:

>
>
> Possible. We still see this on nfsclients only, but I'm not convinced that
> nfs is the only trigger.
>
>
Just to clarify, I'm experiencing this error with NFS, but also with iSCSI
- I turned off my NFS server in rc.conf and rebooted, and I'm still able to
create the error.  This is not just a NFS issue on my machine.

I our case, when it happens, the problem persists for quite some time
> (minutes or hours) if we don't interact (ifconfig or reboot).
>
> The first few times that I ran into it, I had similar issues - Because I
was keeping my system up and treating it like a temporary problem/issue.
Worst case scenario resulted in reboots to reset the NIC. Then again, I
find the ix's to be cranky if you ifconfig them too much.

Now, I'm trying to find a root cause, so as soon as I start seeing any
errors, I abort and reboot the machine to test the next theory.

Additionally, I'm often able to create the problem with just 1 VM running
iometer on the SAN storage. When the problem occurs, that connection is
broken temporarily, taking network load off the SAN - That may improve my
chances of keeping this running.


>
> > I am able to reproduce it fairly reliably within 15 min of a reboot by
> > loading the server via NFS with iometer and some large NFS file copies at
> > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
>
> That's probably why we can't reproduce it reliably here. Although having
> 10gig cards in our blade servers, the ones affected are connected to a 1gig
> switch.
>
>
It seems that it needs a lot of traffic. I have a 10 gig backbone between
my SANs and my ESXi machines, so I can saturate quite quickly (just now I
hit a record.. the error occurred within ~5 min of reboot and testing). In
your case, I recommend firing up multiple VM's running iometer on different
1 gig connections and see if you can make it pop. I also often turn off ix1
to drive all traffic through ix0 - I've noticed it happens faster this way,
but once again I'm not taking enough observations to make decent time
predictions.


>
>
> Can you try this when the problem occurs?
>
> for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2
> -W 1 10.0.0.1 | grep sendto; done
>
> It will tie ping to certain cpus to test the different tx queues of your
> ix interface. If the pings reliably fail only on some queues, then your
> problem is more likely to be the same as ours.
>
> Also, if you have dtrace available:
>
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack();
> }'
>
> while you run pings over the interface affected. This will give you hints
> about where the EFBIG error comes from.
>
> > [...]
>
>
> Markus
>
>
Will do. I'm not sure what shell the first script was written for, it's not
working in csh, here's a re-write that does work in csh in case others are
using the default shell:

#!/bin/csh
foreach CPU (`seq 0 23`)
  echo "CPU$CPU";
  cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto;
end

Thanks for your input. I should have results to post to the list shortly.