Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 9 Aug 2012 08:25:35 -0700 (PDT)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        John Baldwin <jhb@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>
Cc:        jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org
Subject:   Re: 82574L hangs (with r233708 e1000 driver).
Message-ID:  <1344525935.85341.YahooMailClassic@web121605.mail.ne1.yahoo.com>
In-Reply-To: <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Fri, 5/11/12, Barney Cordoba <barney_cordoba@yahoo.com> wrote:=
=0A=0A> From: Barney Cordoba <barney_cordoba@yahoo.com>=0A> Subject: Re: 82=
574L hangs (with r233708 e1000 driver).=0A> To: "John Baldwin" <jhb@freebsd=
.org>, "Konstantin Belousov" <kostikbel@gmail.com>=0A> Cc: jfv@freebsd.org,=
 "Jack Vogel" <jfvogel@gmail.com>, net@freebsd.org=0A> Date: Friday, May 11=
, 2012, 6:24 PM=0A> =0A> =0A> --- On Tue, 5/8/12, Konstantin Belousov <kost=
ikbel@gmail.com>=0A> wrote:=0A> =0A> > From: Konstantin Belousov <kostikbel=
@gmail.com>=0A> > Subject: Re: 82574L hangs (with r233708 e1000 driver).=0A=
> > To: "John Baldwin" <jhb@freebsd.org>=0A> > Cc: jfv@freebsd.org,=0A> "Ja=
ck Vogel" <jfvogel@gmail.com>,=0A> net@freebsd.org=0A> > Date: Tuesday, May=
 8, 2012, 4:24 AM=0A> > On Mon, May 07, 2012 at 01:44:57PM=0A> > -0400, Joh=
n Baldwin wrote:=0A> > > On Friday, May 04, 2012 6:18:19 pm Konstantin=0A> =
Belousov=0A> > wrote:=0A> > > > On Fri, May 04, 2012 at 11:30:22AM -0400,=
=0A> John=0A> > Baldwin wrote:=0A> > > > > On Tuesday, May 01, 2012 12:21:2=
1 pm=0A> > Konstantin Belousov wrote:=0A> > > > > > On Thu, Apr 12, 2012 at=
 09:38:49PM=0A> > +0300, Konstantin Belousov wrote:=0A> > > > > > > On Mon,=
 Apr 09, 2012 at=0A> 12:19:39PM=0A> > -0400, John Baldwin wrote:=0A> > > > =
> > > > On Sunday, April 08,=0A> 2012=0A> > 1:11:25 am Konstantin Belousov =
wrote:=0A> > > > > > > > > On Sat, Apr 07, 2012=0A> at=0A> > 04:22:07PM -07=
00, Jack Vogel wrote:=0A> > > > > > > > > > Make sure you=0A> have=0A> > an=
y firmware up to the latest available, if that =0A> > > > > doesn't=0A> > >=
 > > > > > > > help=0A> > > > > > > > > > let me know and=0A> I'll=0A> > ch=
eck internally to see if there are any =0A> > > > > outstanding=0A> > > > >=
 > > > > > issues=0A> > > > > > > > > > in shared=0A> > code,=A0 that will =
be after the weekend.=0A> > > > > > > > > =0A> > > > > > > > > I had BIOS r=
ev.=0A> 151,=0A> > after you hint I found rev. 154 on the site.=0A> > > > >=
 > > > > Now BIOS reports=0A> itself=0A> > as MTCDT10N.86A.0154.2012.0323.1=
601,=0A> > > > > > > > > March 23.=0A> > > > > > > > > =0A> > > > > > > > >=
 Unfortunately,=0A> upgrade=0A> > did not changed anything in regard of han=
ging=0A> > > > > > > > > interface.=0A> > > > > > > > =0A> > > > > > > > Do=
es reverting 233708=0A> make any=0A> > difference?=A0 Have you tried futzin=
g =0A> > > > > around with=0A> > > > > > > > kgdb when it is hung to=0A> se=
e=0A> > what state the device is in (software state =0A> > > > > at=0A> > >=
 > > > > > least)?=0A> > > > > > > It does, in a sense that=0A> without=0A>=
 > r233708 the interface becomes stuck=0A> > > > > > > almost immediately. =
I just=0A> upgraded=0A> > to the e1000@r234154, which does not=0A> > > > > =
> > change much.=0A> > > > > > > =0A> > > > > > > I fiddled with the adapte=
r=0A> state=0A> > after the hang in kgdb more, and I=0A> > > > > > > noted =
something interesting.=0A> > Apparently, tx works. When I ping the remote=
=0A> > > > > > > host from my suffering atom=0A> > machine, remote host see=
s the packet. Also=0A> > > > > > > remote machine sees some udp=0A> > traff=
ic originating from the tom, like=0A> > > > > > > ntp queries.=0A> > > > > =
> > =0A> > > > > > > And, on receive, the atom=0A> board=0A> > does receive=
 interrupts, em0:rx 0 counter=0A> > > > > > > in vmstat -i increases. Even=
=0A> more=0A> > fun, the sysctl dev.em.0.debug=0A> > > > > > > shows increa=
sing hw rdh (as I=0A> > understand, this is hardware 'last=0A> > > > > > > =
received' packet pointer for=0A> rx=0A> > ring). So I looked at the packet=
=0A> > > > > > > descriptor at hw rdt index,=0A> and=0A> > there I see=0A> =
> > > > > > (kgdb) p/x ((struct adapter=0A> > *)0xffffff80010e4000)->rx_rin=
gs->rx_base[78]=0A> > > > > > > $11 =3D {buffer_addr =3D=0A> 0x12a128800,=
=0A> > length =3D 0x5ea, csum =3D 0x3c2b, status =3D =0A> > > > > 0x0, =0A>=
 > > > > > >=A0=A0=A0errors =3D 0x0,=0A> > special =3D 0x0}=0A> > > > > > >=
 =0A> > > > > > > Apparently, the Descriptor=0A> Done bit=0A> > is clear, s=
o the em_rxeof() function=0A> > > > > > > breaks from the loop, not=0A> con=
suming=0A> > the current packet. Also, it returns=0A> > > > > > > false due=
 to DD bit clear.=0A> This=0A> > prevents em_msix_rx() from scheduling=0A> =
> > > > > > taskqueue for processing. So=0A> > apparent cause for the hang =
is missing=0A> > > > > > > DD bit in descriptor.=0A> > > > > > > =0A> > > >=
 > > > I am not sure isn't all this=0A> is=0A> > obvious for anybody who kn=
ows em=0A> > > > > > > internals, and were to go=0A> from=0A> > there.=0A> =
> > > > > =0A> > > > > > Ok, nobody cares.=0A> > > > > > =0A> > > > > > Bel=
ow is the workaround I use to=0A> prevent=0A> > the interface wedging.=0A> =
> > > > > It seems that the sole PCI register=0A> read=0A> > (namely, the r=
x ring head read)=0A> > > > > > and consequent recheck of the=0A> descripto=
r=0A> > status greatly reduce the=0A> > > > > > likelihood of the issue.=0A=
> Unfortunately,=0A> > the read does not eliminate=0A> > > > > > the hang c=
ompletely. So it is not=0A> some=0A> > PCIe coherency problem.=0A> > > > > =
> =0A> > > > > > With the patch applied, I am able=0A> to=0A> > copy around=
 blu-ray images, while=0A> > > > > > previously the interface hang in=0A> 2=
0-30=0A> > seconds of 100Mbit/s traffic.=0A> > > > > > Sometimes the messag=
es are=0A> printed:=0A> > > > > > em0: Workaround: head 1018 tail=0A> 1002 =
cur=0A> > 1010=0A> > > > > > em0: Workaround: head 976 tail 973=0A> cur=0A>=
 > 974=0A> > > > > > em0: Workaround: head 950 tail 939=0A> cur=0A> > 946=
=0A> > > > > > em0: Workaround: head 435 tail 419=0A> cur=0A> > 426=0A> > >=
 > > > =0A> > > > > > Machine is still dead due to=0A> random=0A> > memory =
corruption which I see, in=0A> > > > > > particular, pmap sometimes read=0A=
> garbage=0A> > from PTEs. I have no idea is=0A> > > > > > it related to em=
0 rx descriptor=0A> missed=0A> > writes, or is a different issue.=0A> > > >=
 > =0A> > > > > Humm, so if I'm reading this correctly,=0A> the=0A> > card =
"skips" a receive=0A> > > > > descriptor and stores a packet at the=0A> nex=
t=0A> > descriptor?=A0 That's just=0A> > > > > bizarre.=0A> > > > Either th=
is, or it does store the packet but=0A> > 'forgots' to update the=0A> > > >=
 rx descriptor. I think that your=0A> interpretation is=0A> > closer to rea=
lity,=0A> > > > since I get sustained 20MB/s over ssh with=0A> the=0A> > pa=
tch even when workaround=0A> > > > activates. The lost packets probably sho=
uld=0A> cause=0A> > retransmit and speed=0A> > > > drop.=0A> > > =0A> > > T=
his is just weird.=A0 I wonder if there is a=0A> known=0A> > errata for thi=
s?=0A> > > This really seems to be broken hardware and not a=0A> > driver i=
ssue.=0A> > I was not able to find anything even remotely=0A> resembling th=
e=0A> > described=0A> > behaviour, in the publically available 82574L=0A> s=
pecification=0A> > update. I looked=0A> > at rev. 3.5, dated January 2012.=
=0A> > =0A> > I may indeed give up and relocate the hardware into=0A> trash=
,=0A> > but it would be=0A> > pity, since this is new shiny Intel Atom 2800=
 m/b. I am=0A> not=0A> > sure I can give=0A> > convincing arguments to supp=
lier for warranty=0A> replacement.=0A> > =0A> > And, while I booted Debian =
to apply f/w fix Jack=0A> > recommended, I did=0A> > quick test and interfa=
ce looked stable.=0A> > =0A> > =0A> =0A> FWIW, I've got an X7SPE-HF-D525 MB=
 with 82574L running on a=0A> 7.0 driver=0A> that seems to work pretty well=
. It panics once in a blue=0A> moon when we=0A> overload it (like 200Mb/s o=
f traffic) but it generally works=0A> ok.=0A> =0A> BC=0A=0AHas anything bee=
n done or patched regarding this problem?=0A=0ABC



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1344525935.85341.YahooMailClassic>