Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 May 2012 11:24:03 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org
Subject:   Re: 82574L hangs (with r233708 e1000 driver).
Message-ID:  <20120508082403.GS2358@deviant.kiev.zoral.com.ua>
In-Reply-To: <201205071344.58041.jhb@freebsd.org>
References:  <20120407133715.GU2358@deviant.kiev.zoral.com.ua> <201205041130.22202.jhb@freebsd.org> <20120504221819.GS2358@deviant.kiev.zoral.com.ua> <201205071344.58041.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--LIxqaT8ihIAy1Ixa
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, May 07, 2012 at 01:44:57PM -0400, John Baldwin wrote:
> On Friday, May 04, 2012 6:18:19 pm Konstantin Belousov wrote:
> > On Fri, May 04, 2012 at 11:30:22AM -0400, John Baldwin wrote:
> > > On Tuesday, May 01, 2012 12:21:21 pm Konstantin Belousov wrote:
> > > > On Thu, Apr 12, 2012 at 09:38:49PM +0300, Konstantin Belousov wrote:
> > > > > On Mon, Apr 09, 2012 at 12:19:39PM -0400, John Baldwin wrote:
> > > > > > On Sunday, April 08, 2012 1:11:25 am Konstantin Belousov wrote:
> > > > > > > On Sat, Apr 07, 2012 at 04:22:07PM -0700, Jack Vogel wrote:
> > > > > > > > Make sure you have any firmware up to the latest available,=
 if that=20
> > > doesn't
> > > > > > > > help
> > > > > > > > let me know and I'll check internally to see if there are a=
ny=20
> > > outstanding
> > > > > > > > issues
> > > > > > > > in shared code,  that will be after the weekend.
> > > > > > >=20
> > > > > > > I had BIOS rev. 151, after you hint I found rev. 154 on the s=
ite.
> > > > > > > Now BIOS reports itself as MTCDT10N.86A.0154.2012.0323.1601,
> > > > > > > March 23.
> > > > > > >=20
> > > > > > > Unfortunately, upgrade did not changed anything in regard of =
hanging
> > > > > > > interface.
> > > > > >=20
> > > > > > Does reverting 233708 make any difference?  Have you tried futz=
ing=20
> > > around with
> > > > > > kgdb when it is hung to see what state the device is in (softwa=
re state=20
> > > at
> > > > > > least)?
> > > > > It does, in a sense that without r233708 the interface becomes st=
uck
> > > > > almost immediately. I just upgraded to the e1000@r234154, which d=
oes not
> > > > > change much.
> > > > >=20
> > > > > I fiddled with the adapter state after the hang in kgdb more, and=
 I
> > > > > noted something interesting. Apparently, tx works. When I ping th=
e remote
> > > > > host from my suffering atom machine, remote host sees the packet.=
 Also
> > > > > remote machine sees some udp traffic originating from the tom, li=
ke
> > > > > ntp queries.
> > > > >=20
> > > > > And, on receive, the atom board does receive interrupts, em0:rx 0=
 counter
> > > > > in vmstat -i increases. Even more fun, the sysctl dev.em.0.debug
> > > > > shows increasing hw rdh (as I understand, this is hardware 'last
> > > > > received' packet pointer for rx ring). So I looked at the packet
> > > > > descriptor at hw rdt index, and there I see
> > > > > (kgdb) p/x ((struct adapter *)0xffffff80010e4000)->rx_rings->rx_b=
ase[78]
> > > > > $11 =3D {buffer_addr =3D 0x12a128800, length =3D 0x5ea, csum =3D =
0x3c2b, status =3D=20
> > > 0x0,=20
> > > > >   errors =3D 0x0, special =3D 0x0}
> > > > >=20
> > > > > Apparently, the Descriptor Done bit is clear, so the em_rxeof() f=
unction
> > > > > breaks from the loop, not consuming the current packet. Also, it =
returns
> > > > > false due to DD bit clear. This prevents em_msix_rx() from schedu=
ling
> > > > > taskqueue for processing. So apparent cause for the hang is missi=
ng
> > > > > DD bit in descriptor.
> > > > >=20
> > > > > I am not sure isn't all this is obvious for anybody who knows em
> > > > > internals, and were to go from there.
> > > >=20
> > > > Ok, nobody cares.
> > > >=20
> > > > Below is the workaround I use to prevent the interface wedging.
> > > > It seems that the sole PCI register read (namely, the rx ring head =
read)
> > > > and consequent recheck of the descriptor status greatly reduce the
> > > > likelihood of the issue. Unfortunately, the read does not eliminate
> > > > the hang completely. So it is not some PCIe coherency problem.
> > > >=20
> > > > With the patch applied, I am able to copy around blu-ray images, wh=
ile
> > > > previously the interface hang in 20-30 seconds of 100Mbit/s traffic.
> > > > Sometimes the messages are printed:
> > > > em0: Workaround: head 1018 tail 1002 cur 1010
> > > > em0: Workaround: head 976 tail 973 cur 974
> > > > em0: Workaround: head 950 tail 939 cur 946
> > > > em0: Workaround: head 435 tail 419 cur 426
> > > >=20
> > > > Machine is still dead due to random memory corruption which I see, =
in
> > > > particular, pmap sometimes read garbage from PTEs. I have no idea is
> > > > it related to em0 rx descriptor missed writes, or is a different is=
sue.
> > >=20
> > > Humm, so if I'm reading this correctly, the card "skips" a receive
> > > descriptor and stores a packet at the next descriptor?  That's just
> > > bizarre.
> > Either this, or it does store the packet but 'forgots' to update the
> > rx descriptor. I think that your interpretation is closer to reality,
> > since I get sustained 20MB/s over ssh with the patch even when workarou=
nd
> > activates. The lost packets probably should cause retransmit and speed
> > drop.
>=20
> This is just weird.  I wonder if there is a known errata for this?
> This really seems to be broken hardware and not a driver issue.
I was not able to find anything even remotely resembling the described
behaviour, in the publically available 82574L specification update. I looked
at rev. 3.5, dated January 2012.

I may indeed give up and relocate the hardware into trash, but it would be
pity, since this is new shiny Intel Atom 2800 m/b. I am not sure I can give
convincing arguments to supplier for warranty replacement.

And, while I booted Debian to apply f/w fix Jack recommended, I did
quick test and interface looked stable.


--LIxqaT8ihIAy1Ixa
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk+o2CIACgkQC3+MBN1Mb4jQkgCgoldYCpHlzdXNuqyTPYRBjD3+
WiMAoONVAJZ5WHqC+AqjZyjCvQ9zquj0
=NXBw
-----END PGP SIGNATURE-----

--LIxqaT8ihIAy1Ixa--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120508082403.GS2358>