From owner-freebsd-net@FreeBSD.ORG Tue May 8 08:24:22 2012 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 57CA0106566B; Tue, 8 May 2012 08:24:22 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id DFEB68FC0C; Tue, 8 May 2012 08:24:21 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q488O3Pm045704; Tue, 8 May 2012 11:24:03 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q488O3at074628; Tue, 8 May 2012 11:24:03 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q488O37H074627; Tue, 8 May 2012 11:24:03 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 8 May 2012 11:24:03 +0300 From: Konstantin Belousov To: John Baldwin Message-ID: <20120508082403.GS2358@deviant.kiev.zoral.com.ua> References: <20120407133715.GU2358@deviant.kiev.zoral.com.ua> <201205041130.22202.jhb@freebsd.org> <20120504221819.GS2358@deviant.kiev.zoral.com.ua> <201205071344.58041.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LIxqaT8ihIAy1Ixa" Content-Disposition: inline In-Reply-To: <201205071344.58041.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: jfv@freebsd.org, Jack Vogel , net@freebsd.org Subject: Re: 82574L hangs (with r233708 e1000 driver). X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 May 2012 08:24:22 -0000 --LIxqaT8ihIAy1Ixa Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, May 07, 2012 at 01:44:57PM -0400, John Baldwin wrote: > On Friday, May 04, 2012 6:18:19 pm Konstantin Belousov wrote: > > On Fri, May 04, 2012 at 11:30:22AM -0400, John Baldwin wrote: > > > On Tuesday, May 01, 2012 12:21:21 pm Konstantin Belousov wrote: > > > > On Thu, Apr 12, 2012 at 09:38:49PM +0300, Konstantin Belousov wrote: > > > > > On Mon, Apr 09, 2012 at 12:19:39PM -0400, John Baldwin wrote: > > > > > > On Sunday, April 08, 2012 1:11:25 am Konstantin Belousov wrote: > > > > > > > On Sat, Apr 07, 2012 at 04:22:07PM -0700, Jack Vogel wrote: > > > > > > > > Make sure you have any firmware up to the latest available,= if that=20 > > > doesn't > > > > > > > > help > > > > > > > > let me know and I'll check internally to see if there are a= ny=20 > > > outstanding > > > > > > > > issues > > > > > > > > in shared code, that will be after the weekend. > > > > > > >=20 > > > > > > > I had BIOS rev. 151, after you hint I found rev. 154 on the s= ite. > > > > > > > Now BIOS reports itself as MTCDT10N.86A.0154.2012.0323.1601, > > > > > > > March 23. > > > > > > >=20 > > > > > > > Unfortunately, upgrade did not changed anything in regard of = hanging > > > > > > > interface. > > > > > >=20 > > > > > > Does reverting 233708 make any difference? Have you tried futz= ing=20 > > > around with > > > > > > kgdb when it is hung to see what state the device is in (softwa= re state=20 > > > at > > > > > > least)? > > > > > It does, in a sense that without r233708 the interface becomes st= uck > > > > > almost immediately. I just upgraded to the e1000@r234154, which d= oes not > > > > > change much. > > > > >=20 > > > > > I fiddled with the adapter state after the hang in kgdb more, and= I > > > > > noted something interesting. Apparently, tx works. When I ping th= e remote > > > > > host from my suffering atom machine, remote host sees the packet.= Also > > > > > remote machine sees some udp traffic originating from the tom, li= ke > > > > > ntp queries. > > > > >=20 > > > > > And, on receive, the atom board does receive interrupts, em0:rx 0= counter > > > > > in vmstat -i increases. Even more fun, the sysctl dev.em.0.debug > > > > > shows increasing hw rdh (as I understand, this is hardware 'last > > > > > received' packet pointer for rx ring). So I looked at the packet > > > > > descriptor at hw rdt index, and there I see > > > > > (kgdb) p/x ((struct adapter *)0xffffff80010e4000)->rx_rings->rx_b= ase[78] > > > > > $11 =3D {buffer_addr =3D 0x12a128800, length =3D 0x5ea, csum =3D = 0x3c2b, status =3D=20 > > > 0x0,=20 > > > > > errors =3D 0x0, special =3D 0x0} > > > > >=20 > > > > > Apparently, the Descriptor Done bit is clear, so the em_rxeof() f= unction > > > > > breaks from the loop, not consuming the current packet. Also, it = returns > > > > > false due to DD bit clear. This prevents em_msix_rx() from schedu= ling > > > > > taskqueue for processing. So apparent cause for the hang is missi= ng > > > > > DD bit in descriptor. > > > > >=20 > > > > > I am not sure isn't all this is obvious for anybody who knows em > > > > > internals, and were to go from there. > > > >=20 > > > > Ok, nobody cares. > > > >=20 > > > > Below is the workaround I use to prevent the interface wedging. > > > > It seems that the sole PCI register read (namely, the rx ring head = read) > > > > and consequent recheck of the descriptor status greatly reduce the > > > > likelihood of the issue. Unfortunately, the read does not eliminate > > > > the hang completely. So it is not some PCIe coherency problem. > > > >=20 > > > > With the patch applied, I am able to copy around blu-ray images, wh= ile > > > > previously the interface hang in 20-30 seconds of 100Mbit/s traffic. > > > > Sometimes the messages are printed: > > > > em0: Workaround: head 1018 tail 1002 cur 1010 > > > > em0: Workaround: head 976 tail 973 cur 974 > > > > em0: Workaround: head 950 tail 939 cur 946 > > > > em0: Workaround: head 435 tail 419 cur 426 > > > >=20 > > > > Machine is still dead due to random memory corruption which I see, = in > > > > particular, pmap sometimes read garbage from PTEs. I have no idea is > > > > it related to em0 rx descriptor missed writes, or is a different is= sue. > > >=20 > > > Humm, so if I'm reading this correctly, the card "skips" a receive > > > descriptor and stores a packet at the next descriptor? That's just > > > bizarre. > > Either this, or it does store the packet but 'forgots' to update the > > rx descriptor. I think that your interpretation is closer to reality, > > since I get sustained 20MB/s over ssh with the patch even when workarou= nd > > activates. The lost packets probably should cause retransmit and speed > > drop. >=20 > This is just weird. I wonder if there is a known errata for this? > This really seems to be broken hardware and not a driver issue. I was not able to find anything even remotely resembling the described behaviour, in the publically available 82574L specification update. I looked at rev. 3.5, dated January 2012. I may indeed give up and relocate the hardware into trash, but it would be pity, since this is new shiny Intel Atom 2800 m/b. I am not sure I can give convincing arguments to supplier for warranty replacement. And, while I booted Debian to apply f/w fix Jack recommended, I did quick test and interface looked stable. --LIxqaT8ihIAy1Ixa Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk+o2CIACgkQC3+MBN1Mb4jQkgCgoldYCpHlzdXNuqyTPYRBjD3+ WiMAoONVAJZ5WHqC+AqjZyjCvQ9zquj0 =NXBw -----END PGP SIGNATURE----- --LIxqaT8ihIAy1Ixa--