Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Oct 2006 14:52:19 -0700
From:      "Jack Vogel" <jfvogel@gmail.com>
To:        "Bill Paul" <wpaul@freebsd.org>
Cc:        freebsd-stable@freebsd.org, kris@obsecurity.org
Subject:   Re: em network issues
Message-ID:  <2a41acea0610201452v22f2bae9mcc0e71d2157d8bbb@mail.gmail.com>
In-Reply-To: <20061020212138.5893C16A492@hub.freebsd.org>
References:  <4538FF57.1070109@samsco.org> <20061020212138.5893C16A492@hub.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 10/20/06, Bill Paul <wpaul@freebsd.org> wrote:

> > This is exactly the test that Andre and I were running, though only in
> > one direction (I think due to lack of hardware for a full test).
>
> Yes, but did you do it with a Smartbits though, or just with a couple of
> other FreeBSD machines? Unfortunately, a typical FreeBSD system on its own
> won't generate frames anywhere near fast enough to really torture test a
> gigE interface. At best you might hit around 200000 to 300000 frames/sec.
>
> A given Smartbits system doesn't need special hardware to run a
> bi-directional forwarding test. If you're using SmartApps, you just
> have to click the "Bi-Directional" checkbox on the main setup window.
> (At least, that's how it is with the ones at work.)
>
> Being able to flood the link with the Smartbits is also handy for
> provoking error conditions (RX overruns and TX underruns, mostly), which
> shows you how well (or not) the driver's error recovery works.
>
> In the past I considered creating a kernel module that would grab hold
> of a given interface and blast traffic through it with as little software
> overhead as possible (e.g. sending the same mbuf over and over) in order
> to create my own test system that could hopefully rival the Smartbits,
> but I never got around to it. I'm not sure that it's really possible
> without custom hardware though.

Our Linux team has this, as far as I know its only been used by our
internal test types though, I have not seen the code, but I take this
as evidence that it IS doable :)

> > Prior to the INTR_FAST change, the machine would live-lock.  Now it
> > survives, stays responsive, and drops packets as needed.
>
> The wide range of failures people seem to be reporting might mean that
> the driver code itself is not the issue, but that there's an interaction
> with some other part of the system. This means torture testing the driver
> itself might not be enough to provoke the problems.
>
> Unfortunately, nobody seems to have nailed down a good test case for
> any of these failures. I strongly suspect people are leaving out details
> which seem obvious and/or trivial to them, but which are critical to
> finding the problem. ("Oh, I was using SCHED_ULE... was I not supposed
> to do that? Tee-hee. *curls finger in blonde hair*)
>
> Another thing that might be handy is improving the watchdog timeout
> message so that it dumps the state of the ICR and ICM registers (and
> maybe some other interesting driver and/or device state). The timeout
> implies no interrupts were delivered for a Long Time (tm). If the
> ICM register indicates interrupts have been masked, then that means
> em_intr_fast() was triggered by and interrupt and it scheduled work,
> but that work never executed. If that really is what happened, then
> I can understand the watchdog error occuring. If that's _not_ what
> happened, them something else is screwed up.

Jesse Brandeburg just did an interesting hack for the Linux driver, I
was considering trying to code an equivalent thing up for us. We
have evidence that on some AMD based systems there are writebacks
that get lost, since the TX cleanup relies on the DD being set you
are hosed when this happens. What he did was make a cleanup
routine that ONLY uses the head and tail pointers and NOT the done
bit. Then, in the watchdog routine, if there is evidence of this problem
it will switch the cleanup function pointer to this alternate clean code.

At least one user that was having a problem has reported this solved
it. It may be one of the issues hitting us as well.

Jack



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2a41acea0610201452v22f2bae9mcc0e71d2157d8bbb>