Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Jan 2017 17:57:06 +0000
From:      Eric Joyner <erj@freebsd.org>
To:        Daniel Genis <daniel.genis@gmx.de>, freebsd-stable@freebsd.org
Subject:   Re: intel 10gbe nic bug in 10.3 - no carrier
Message-ID:  <CA%2Bb0zg8pu7_K1Aqj3wifZb%2BzcB5iGshq7eVEMHKu30y994QzaA@mail.gmail.com>
In-Reply-To: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de>
References:  <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 10, 2017 at 2:38 AM Daniel Genis <daniel.genis@gmx.de> wrote:

> Hello everyone,
>
> we're trying to tackle a rare bug that is very hard to debug.
>
> Our 10.3-RELEASE servers can panic boot and subsequently can come up
> without network (2x - no carrier). We've seen this on 10.3-RELEASE-p0 we
> have never seen this before.
>
> root@storage ~ # pciconf -lv | grep -B3 network
> ix0@pci0:2:0:0:    class=0x020000 card=0xd10f19e5 chip=0x10fb8086
> rev=0x01 hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
>     class      = network
> --
> ix1@pci0:2:0:1:    class=0x020000 card=0xd10f19e5 chip=0x10fb8086
> rev=0x01 hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
>     class      = network
>
> Our network is configured as active/passive using lagg. (/etc/rc.conf):
>
> ifconfig_ix0="up"
> ifconfig_ix1="up"
> cloned_interfaces="lagg0"
> ifconfig_lagg0="laggproto failover laggport ix0 laggport ix1 10.1.2.31/16"
>
> After panic boot the network show up like this:
>
> ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>
> options=8407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
>     ether 60:08:10:d0:4e:9f
>     nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
>     media: <unknown type> (autoselect)
>     status: no carrier
> ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>
> options=8407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
>     ether 60:08:10:d0:4e:9f
>     nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
>     media: <unknown type> (autoselect)
>     status: no carrier
>
> The network switch sees the connection as online. The LED's of the nic's
> suggest the same, they see the network as online (led's are on like in
> normal operation). Unplugging/replugging the network cable will get the
> network online. Shutting the port on the switch and reenabling it wil
> also get the network online. However another reboot will return the
> machine into the no-carrier state.
>
> I've built various kernels trying to find where the regression is
> without success. I tried porting the 10.2 nic driver (2.8.3) to 10.3 and
> subsequently the lagg code as well. I ported nic driver 3.1.14 from
> pfsense into 10.3-STABLE (2 december kernel) to no avail, also porting
> lagg code from 10.2 did not make any difference. Rebooting with those
> kernels the server remains in the no carrier state.
>
> We install our systems using mfsbsd for PXE boot. If I boot a machine
> which has the "no carrier" state using the 10.3 PXE boot, both nic's
> come online. If I then boot from disk again the machine returns into the
> "no carrier" state.
>
> Recently we got some new machines, so we can shuffle more around and
> also to try to debug this. We baseinstalled it using mfsbsd 10.3 pxe and
> configured it like always. Here interestingly one of the two nic's
> entered the "no carrier" state, the other nic remained operational. This
> remained persistent across reboots.
>
> The issue disappears after many reboots but it's not conclusive. I've
> had 2 machines with which I could experiment with.
>
> On one the problem it disappeared on it's own after a reboot (kernel
> 10.3-STABLE git hash d99ba5c aka r299900(?)).
>
> On the other one I pxe booted 10.1 live environment and then
> subsequently I booted into kernel 10.3-STABLE git hash 3673260fc9 aka
> r308456(?)). But I don't think anything can be concluded from that. That
> was the machine which had both nic's online after booting into the 10.3
> pxe environment but subsequently returned into no carrier state when
> booting 10.3 from disk.
>
> We also tried many sysctl flags (and many reboots), but without success.
> For example: hw.ix.enable_msix=0 and hw.ix.enable_msi=0
>
> At the moment I have no spare/empty machine in this state, we will empty
> one machine though which currently has this state (but is in production
> right now).
> I don't know how to trigger this state manually, which doesn't help for
> debugging.
>
> I could link reference where others report similar issues, for example
> https://www.reddit.com/r/PFSENSE/comments/45bcuq/10_gig_woes/
> Here they suggest that the new intel nic driver 3.1.14 fixes it. Though
> I was not able to resolve the state by booting into a kernel with this
> driver.
>
> If I can provide any additional information please do not hesitate to ask.
>
> Any tips and suggestions for debugging are most welcome!
>
> With kind regards,
>
> Daniel
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
>

This is a late follow-up, but could you file this as a bug on
bugs.freebsd.org?

- Eric



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2Bb0zg8pu7_K1Aqj3wifZb%2BzcB5iGshq7eVEMHKu30y994QzaA>