Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Mar 2017 16:41:01 +0100
From:      "O. Hartmann" <ohartmann@walstatt.org>
To:        Alexander Leidinger <Alexander@leidinger.net>
Cc:        freebsd-current@freebsd.org, sbruno@freebsd.org, mmacy@nextbsd.org
Subject:   Re: CURRENT: massive em0 NIC problems since IFLIB changes/introduction
Message-ID:  <20170317164101.0518ac67@thor.intern.walstatt.dynvpn.de>
In-Reply-To: <20170317141501.Horde.YTCr8GuMV2yI1YaUkdRTLlu@webmail.leidinger.net>
References:  <20170317122018.21384497@freyja.zeit4.iv.bundesimmobilien.de> <20170317141501.Horde.YTCr8GuMV2yI1YaUkdRTLlu@webmail.leidinger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/+oGlN1nCeMKHKa.CVT9+i2d
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Am Fri, 17 Mar 2017 14:15:01 +0100
Alexander Leidinger <Alexander@leidinger.net> schrieb:

> Quoting "O. Hartmann" <ohartmann@walstatt.org> (from Fri, 17 Mar 2017 =20
> 12:20:18 +0100):
>=20
> > Since the introduction of the IFLIB changes, I realise severe problems =
on
> > CURRENT. =20
>=20
> I already reported something like this to sbruno@ and M. Macy (in copy).
>=20
> > Running the most recent CURRENT (FreeBSD 12.0-CURRENT #27 r315442: Fri =
Mar 17
> > 10:46:04 CET 2017  amd64), the problems on a workstation got severe =20
> > within the
> > past two days:
> >
> > since a couple of weeks the em0 NIC (Intel i217-LM, see below) dies on =
heavy
> > I/O. I realised this first when "rsync"ing poudriere repositories to a =
remote
> > NFSv4 (automounted) folder. The em0 device could be revived by =20
> > ifconfig down/up
> > procedure.
> > But not the i217-LM chip is affected. On another box equipted with a =20
> > i350 dual
> > port GBit NIC I observed a similar behaviour under (artificially) =20
> > high I/O load
> > (but I didn't investigate that further since it occured very seldom). =
=20
>=20
> It's not only those chipsets.
>=20
> It may be beneficial if you could provide the pciconf output for those =20
> devices. Mine is:
> ---snip---
> em0@pci0:2:6:0: class=3D0x020000 card=3D0x13768086 chip=3D0x107c8086 =20
> rev=3D0x05 hdr=3D0x00
>      vendor     =3D 'Intel Corporation'
>      device     =3D '82541PI Gigabit Ethernet Controller'
> ---snip---
>=20
> > Now, since around yesterday, the i217-LM dies without being reviveable =
with
> > ifconfig down/up: Doing so, my FreeBSD CURRENT machine (Fujitsu Celsius=
 M740) =20
>=20
> I don't know if for the chip I see this issue with a simple down/up =20
> would help (it's a headless server in a remote datacenter). For the =20
> moment I'm using the workaround of something like "ping -C 1 <gateway> =20
> || shutdown -r now" in crontab.
>=20
> The system in question is at r314137.
>=20
> > remains with a dead em0 device, reporting "no route" in some occasions =
but
> > stuck in the dead state. Every attempt to establish manually the route =
again
> > fails, only rebooting the box gives some relief.
> >
> > On the console, I have some very strange reports:
> >
> > - ping reports suddenly about no buffer space
> > - or I see sometimes massive occurences of "em0: TX(0) desc avail =3D =
=20
> > 1024, pidx
> >   =3D 0" on the console =20
>=20
> I don't see this in messages or console log, but I see that ntpd can't =20
> resolve hostnames in the logs.
>=20
> > Either way, sending/receiving large files on an established network GBi=
t line
> > which could be saturated by approx 100 MBytes/s tend to make the NIC fa=
il. =20
>=20
> I can report that the "svnlite update" on the box of of the FreeBSD =20
> src tree is able to trigger the issue in my case.
>=20
> I have to add that before the iflib changes I've seen frequent =20
> em-watchdog timeouts in the logs / dmesg. So for me we have two issues =20
> here:
>   - the hardware wasn't 100% supported before the iflib changes (it seems)
>   - the iflib changes have lost some watchdog functionality / =20
> auto-failure-recovery feature
>=20
> Bye,
> Alexander.
>=20

In January (18.01.2017), I reported Sean Bruno some strange behaviour of th=
e same box
alongside with some details (I forgort to send in the Email you're reposndi=
ng to, sorry)
of the hardware, so here it is again:

[...]
Again, here is the pciconf output of the device:=20

em0@pci0:0:25:0:        class=3D0x020000 card=3D0x11ed1734 chip=3D0x153a8086
rev=3D0x05 hdr=3D0x00 vendor     =3D 'Intel Corporation'
    device     =3D 'Ethernet Connection I217-LM'
    class      =3D network
    subclass   =3D ethernet
    bar   [10] =3D type Memory, range 32, base 0xfb300000, size 131072, ena=
bled
    bar   [14] =3D type Memory, range 32, base 0xfb339000, size 4096, enabl=
ed
    bar   [18] =3D type I/O Port, range 32, base 0xf020, size 32, enabled

[...]
The problem has become a severe state within the past two days. I did on a =
daily basis
CURRENT buildwords, did poudriere builds several times and tried to sync th=
em to the
package repository server - and that failed dramatically as described above=
 starting with
yesterday.

--=20
O. Hartmann

Ich widerspreche der Nutzung oder =C3=9Cbermittlung meiner Daten f=C3=BCr
Werbezwecke oder f=C3=BCr die Markt- oder Meinungsforschung (=C2=A7 28 Abs.=
 4 BDSG).

--Sig_/+oGlN1nCeMKHKa.CVT9+i2d
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----

iLUEARMKAB0WIQQZVZMzAtwC2T/86TrS528fyFhYlAUCWMwDjQAKCRDS528fyFhY
lBcyAf9I2Yyk7obblmKOyhvrIYxhWGkb+gpFXtkIlv9fi3SBy/YLbQZqbigI6eEU
U1WoyR3CBV+vbhed5ZWC9gjfc7XfAf4/wPymjNpdBe+7IjO3ErstaWfM+LrDVbYU
j61RoJEwG9S67gMzVJmjud+IOWtid/Tmr/OuTRmMPD9hwYJy0iLD
=VsU/
-----END PGP SIGNATURE-----

--Sig_/+oGlN1nCeMKHKa.CVT9+i2d--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170317164101.0518ac67>