Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 May 2012 15:24:29 -0700 (PDT)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        John Baldwin <jhb@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>
Cc:        jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org
Subject:   Re: 82574L hangs (with r233708 e1000 driver).
Message-ID:  <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com>
In-Reply-To: <20120508082403.GS2358@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Tue, 5/8/12, Konstantin Belousov <kostikbel@gmail.com> wrote:=
=0A=0A> From: Konstantin Belousov <kostikbel@gmail.com>=0A> Subject: Re: 82=
574L hangs (with r233708 e1000 driver).=0A> To: "John Baldwin" <jhb@freebsd=
.org>=0A> Cc: jfv@freebsd.org, "Jack Vogel" <jfvogel@gmail.com>, net@freebs=
d.org=0A> Date: Tuesday, May 8, 2012, 4:24 AM=0A> On Mon, May 07, 2012 at 0=
1:44:57PM=0A> -0400, John Baldwin wrote:=0A> > On Friday, May 04, 2012 6:18=
:19 pm Konstantin Belousov=0A> wrote:=0A> > > On Fri, May 04, 2012 at 11:30=
:22AM -0400, John=0A> Baldwin wrote:=0A> > > > On Tuesday, May 01, 2012 12:=
21:21 pm=0A> Konstantin Belousov wrote:=0A> > > > > On Thu, Apr 12, 2012 at=
 09:38:49PM=0A> +0300, Konstantin Belousov wrote:=0A> > > > > > On Mon, Apr=
 09, 2012 at 12:19:39PM=0A> -0400, John Baldwin wrote:=0A> > > > > > > On S=
unday, April 08, 2012=0A> 1:11:25 am Konstantin Belousov wrote:=0A> > > > >=
 > > > On Sat, Apr 07, 2012 at=0A> 04:22:07PM -0700, Jack Vogel wrote:=0A> =
> > > > > > > > Make sure you have=0A> any firmware up to the latest availa=
ble, if that =0A> > > > doesn't=0A> > > > > > > > > help=0A> > > > > > > > =
> let me know and I'll=0A> check internally to see if there are any =0A> > =
> > outstanding=0A> > > > > > > > > issues=0A> > > > > > > > > in shared=0A=
> code,=A0 that will be after the weekend.=0A> > > > > > > > =0A> > > > > >=
 > > I had BIOS rev. 151,=0A> after you hint I found rev. 154 on the site.=
=0A> > > > > > > > Now BIOS reports itself=0A> as MTCDT10N.86A.0154.2012.03=
23.1601,=0A> > > > > > > > March 23.=0A> > > > > > > > =0A> > > > > > > > U=
nfortunately, upgrade=0A> did not changed anything in regard of hanging=0A>=
 > > > > > > > interface.=0A> > > > > > > =0A> > > > > > > Does reverting 2=
33708 make any=0A> difference?=A0 Have you tried futzing =0A> > > > around =
with=0A> > > > > > > kgdb when it is hung to see=0A> what state the device =
is in (software state =0A> > > > at=0A> > > > > > > least)?=0A> > > > > > I=
t does, in a sense that without=0A> r233708 the interface becomes stuck=0A>=
 > > > > > almost immediately. I just upgraded=0A> to the e1000@r234154, wh=
ich does not=0A> > > > > > change much.=0A> > > > > > =0A> > > > > > I fidd=
led with the adapter state=0A> after the hang in kgdb more, and I=0A> > > >=
 > > noted something interesting.=0A> Apparently, tx works. When I ping the=
 remote=0A> > > > > > host from my suffering atom=0A> machine, remote host =
sees the packet. Also=0A> > > > > > remote machine sees some udp=0A> traffi=
c originating from the tom, like=0A> > > > > > ntp queries.=0A> > > > > > =
=0A> > > > > > And, on receive, the atom board=0A> does receive interrupts,=
 em0:rx 0 counter=0A> > > > > > in vmstat -i increases. Even more=0A> fun, =
the sysctl dev.em.0.debug=0A> > > > > > shows increasing hw rdh (as I=0A> u=
nderstand, this is hardware 'last=0A> > > > > > received' packet pointer fo=
r rx=0A> ring). So I looked at the packet=0A> > > > > > descriptor at hw rd=
t index, and=0A> there I see=0A> > > > > > (kgdb) p/x ((struct adapter=0A> =
*)0xffffff80010e4000)->rx_rings->rx_base[78]=0A> > > > > > $11 =3D {buffer_=
addr =3D 0x12a128800,=0A> length =3D 0x5ea, csum =3D 0x3c2b, status =3D =0A=
> > > > 0x0, =0A> > > > > >=A0=A0=A0errors =3D 0x0,=0A> special =3D 0x0}=0A=
> > > > > > =0A> > > > > > Apparently, the Descriptor Done bit=0A> is clear=
, so the em_rxeof() function=0A> > > > > > breaks from the loop, not consum=
ing=0A> the current packet. Also, it returns=0A> > > > > > false due to DD =
bit clear. This=0A> prevents em_msix_rx() from scheduling=0A> > > > > > tas=
kqueue for processing. So=0A> apparent cause for the hang is missing=0A> > =
> > > > DD bit in descriptor.=0A> > > > > > =0A> > > > > > I am not sure is=
n't all this is=0A> obvious for anybody who knows em=0A> > > > > > internal=
s, and were to go from=0A> there.=0A> > > > > =0A> > > > > Ok, nobody cares=
.=0A> > > > > =0A> > > > > Below is the workaround I use to prevent=0A> the=
 interface wedging.=0A> > > > > It seems that the sole PCI register read=0A=
> (namely, the rx ring head read)=0A> > > > > and consequent recheck of the=
 descriptor=0A> status greatly reduce the=0A> > > > > likelihood of the iss=
ue. Unfortunately,=0A> the read does not eliminate=0A> > > > > the hang com=
pletely. So it is not some=0A> PCIe coherency problem.=0A> > > > > =0A> > >=
 > > With the patch applied, I am able to=0A> copy around blu-ray images, w=
hile=0A> > > > > previously the interface hang in 20-30=0A> seconds of 100M=
bit/s traffic.=0A> > > > > Sometimes the messages are printed:=0A> > > > > =
em0: Workaround: head 1018 tail 1002 cur=0A> 1010=0A> > > > > em0: Workarou=
nd: head 976 tail 973 cur=0A> 974=0A> > > > > em0: Workaround: head 950 tai=
l 939 cur=0A> 946=0A> > > > > em0: Workaround: head 435 tail 419 cur=0A> 42=
6=0A> > > > > =0A> > > > > Machine is still dead due to random=0A> memory c=
orruption which I see, in=0A> > > > > particular, pmap sometimes read garba=
ge=0A> from PTEs. I have no idea is=0A> > > > > it related to em0 rx descri=
ptor missed=0A> writes, or is a different issue.=0A> > > > =0A> > > > Humm,=
 so if I'm reading this correctly, the=0A> card "skips" a receive=0A> > > >=
 descriptor and stores a packet at the next=0A> descriptor?=A0 That's just=
=0A> > > > bizarre.=0A> > > Either this, or it does store the packet but=0A=
> 'forgots' to update the=0A> > > rx descriptor. I think that your interpre=
tation is=0A> closer to reality,=0A> > > since I get sustained 20MB/s over =
ssh with the=0A> patch even when workaround=0A> > > activates. The lost pac=
kets probably should cause=0A> retransmit and speed=0A> > > drop.=0A> > =0A=
> > This is just weird.=A0 I wonder if there is a known=0A> errata for this=
?=0A> > This really seems to be broken hardware and not a=0A> driver issue.=
=0A> I was not able to find anything even remotely resembling the=0A> descr=
ibed=0A> behaviour, in the publically available 82574L specification=0A> up=
date. I looked=0A> at rev. 3.5, dated January 2012.=0A> =0A> I may indeed g=
ive up and relocate the hardware into trash,=0A> but it would be=0A> pity, =
since this is new shiny Intel Atom 2800 m/b. I am not=0A> sure I can give=
=0A> convincing arguments to supplier for warranty replacement.=0A> =0A> An=
d, while I booted Debian to apply f/w fix Jack=0A> recommended, I did=0A> q=
uick test and interface looked stable.=0A> =0A> =0A=0AFWIW, I've got an X7S=
PE-HF-D525 MB with 82574L running on a 7.0 driver=0Athat seems to work pret=
ty well. It panics once in a blue moon when we=0Aoverload it (like 200Mb/s =
of traffic) but it generally works ok.=0A=0ABC



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1336775069.17927.YahooMailClassic>