From owner-freebsd-stable@freebsd.org Thu Jan 26 16:51:10 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 345EECC3607 for ; Thu, 26 Jan 2017 16:51:10 +0000 (UTC) (envelope-from daniel.genis@gmx.de) Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9C1051AC3; Thu, 26 Jan 2017 16:51:09 +0000 (UTC) (envelope-from daniel.genis@gmx.de) Received: from [192.168.101.182] ([80.113.31.106]) by mail.gmx.com (mrgmx003 [212.227.17.190]) with ESMTPSA (Nemesis) id 0LaWlT-1c57em2VcF-00mHLC; Thu, 26 Jan 2017 17:45:47 +0100 Subject: Re: intel 10gbe nic bug in 10.3 - no carrier To: Eric Joyner , freebsd-stable@freebsd.org, s.munaut@whatever-company.com References: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de> From: Daniel Genis Message-ID: Date: Thu, 26 Jan 2017 17:45:47 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:XuNhraaRCZF40zeeUQPGDPkIfV+d9AvqQ7MJr3tjPS6MZ1b5Lte n6mWBP0RiYJx5bJ5L+7Kc7DXjavLR7gl6sHPlj/DgJy2ZOjnSAB3wyyDPhtl6kw9JsnBoI9 zCf0C+gQS/+P+Bj1/VaZHpfFiqDe9GGiAV00D0BZswAszObBQj6pOLDp9RmoBRNgrT+hqGp a/v0fbtXRfoZmLQNqv0Ng== X-UI-Out-Filterresults: notjunk:1;V01:K0:rUoj9PlCqZg=:DZZvHokQuTMk/cpEBU6M7W 7DfUQkAdP0iPGklmQ4Yz84bRai4hCKdRmrsfg/pm06Tq9Aj0gVhk4QR9FpJpmAbVvbGs6gpRI fEiJC5ucUJHd5DD0iWpi3Bn1QuBmDp3vKEwo9RVvcrKbfnGOfduiyOhpGTdOy6ILRKvF9Ks7o G/u4DxkeGGtr2agJo01bZzZi4aaDciuseeQiY4IzfgTh37fBxsFJ6mMhmL6Gd4KTlk2hIH8C2 DGw4CHza3poyKO3wvvClbR1RnMd5IF5DbgChgpC8p2VR44P1tQieFw9cD9Y7iU/DYnw+V5Ebb TadouaUnx5gzL/SAmRD/TlPcEMwmLIPgvxL/VKNQZptYK4eYN/ihMGSwm+MSu1PUBAv/Rm4/V bvY6G6U9L6TqarMSNWOvjU0VZW6OTTjpVZArgjZfaonPkTAxegfV5RCdLb3N5gdVVvWaX+ik8 fQgt/+J/qcT2w+sSNy3o5O7w2/SpLaGQFKHcycyj/kZyvhyZe2dQ0bQ3yvOnEY148PaC0uLKc cpW0i9cuHUU3iNLgCgIJHpz1RzPsqu0dr3g9Y0Z4ufAGcCX7hcXTIZKYiyIQuW38EJchSY2SF KxL5gRe1uw3C4CbI5fLQFeCgVWob+XJfOoOK7mvnZy4rFj7xXRihotKMAApZ6U2oLp39oqjDq dzTggf8Lx5ix6C6GH8tYFZQnuqtzBu1znBZumQ9BvZAOuMOPe0cV4qN0Zpjv66L7uePwO4O3c WoRb6cfelswzxkMHx4fgJ8DL3Kik4KHK8bIKK7tqp/mffJiv8KHn9SNMNPMakrSgE4236cmXh sW8Yk3S X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Jan 2017 16:51:10 -0000 Hi Eric and Todd, Thanks for your answers. I will try the linux boot, that's an excellent suggestion! I will file a bugreport once I have a bit more info. The issue sounds very similar compared to the Re: [E1000-devel] X550EM_X_SFP "lockup" / link issues in this thread. The problem is not always present and I'm unable to trigger it manually. Which makes it tough to debug :-) Slowly but surely I'm getting closer to triaging the bug. It sounds ooh so erily similar to the "lockup" issue Sylvain is describing. Next week we'll empty a machine which is in this "state", then I have another test machine to play with. For the record, we have multiple link partners (2 juniper models, and one Huawei model). I've seen this issue with all link partners. Maybe never had this issue on 10.1/10.2-STABLE because those kernels never panicked on us. With kind regards, Daniel On 01/24/2017 06:57 PM, Eric Joyner wrote: > On Tue, Jan 10, 2017 at 2:38 AM Daniel Genis wrote: > >> Hello everyone, >> >> we're trying to tackle a rare bug that is very hard to debug. >> >> Our 10.3-RELEASE servers can panic boot and subsequently can come up >> without network (2x - no carrier). We've seen this on 10.3-RELEASE-p0 we >> have never seen this before. >> >> root@storage ~ # pciconf -lv | grep -B3 network >> ix0@pci0:2:0:0: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 >> rev=0x01 hdr=0x00 >> vendor = 'Intel Corporation' >> device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' >> class = network >> -- >> ix1@pci0:2:0:1: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 >> rev=0x01 hdr=0x00 >> vendor = 'Intel Corporation' >> device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' >> class = network >> >> Our network is configured as active/passive using lagg. (/etc/rc.conf): >> >> ifconfig_ix0="up" >> ifconfig_ix1="up" >> cloned_interfaces="lagg0" >> ifconfig_lagg0="laggproto failover laggport ix0 laggport ix1 10.1.2.31/16" >> >> After panic boot the network show up like this: >> >> ix0: flags=8843 metric 0 mtu 1500 >> >> options=8407bb >> ether 60:08:10:d0:4e:9f >> nd6 options=29 >> media: (autoselect) >> status: no carrier >> ix1: flags=8843 metric 0 mtu 1500 >> >> options=8407bb >> ether 60:08:10:d0:4e:9f >> nd6 options=29 >> media: (autoselect) >> status: no carrier >> >> The network switch sees the connection as online. The LED's of the nic's >> suggest the same, they see the network as online (led's are on like in >> normal operation). Unplugging/replugging the network cable will get the >> network online. Shutting the port on the switch and reenabling it wil >> also get the network online. However another reboot will return the >> machine into the no-carrier state. >> >> I've built various kernels trying to find where the regression is >> without success. I tried porting the 10.2 nic driver (2.8.3) to 10.3 and >> subsequently the lagg code as well. I ported nic driver 3.1.14 from >> pfsense into 10.3-STABLE (2 december kernel) to no avail, also porting >> lagg code from 10.2 did not make any difference. Rebooting with those >> kernels the server remains in the no carrier state. >> >> We install our systems using mfsbsd for PXE boot. If I boot a machine >> which has the "no carrier" state using the 10.3 PXE boot, both nic's >> come online. If I then boot from disk again the machine returns into the >> "no carrier" state. >> >> Recently we got some new machines, so we can shuffle more around and >> also to try to debug this. We baseinstalled it using mfsbsd 10.3 pxe and >> configured it like always. Here interestingly one of the two nic's >> entered the "no carrier" state, the other nic remained operational. This >> remained persistent across reboots. >> >> The issue disappears after many reboots but it's not conclusive. I've >> had 2 machines with which I could experiment with. >> >> On one the problem it disappeared on it's own after a reboot (kernel >> 10.3-STABLE git hash d99ba5c aka r299900(?)). >> >> On the other one I pxe booted 10.1 live environment and then >> subsequently I booted into kernel 10.3-STABLE git hash 3673260fc9 aka >> r308456(?)). But I don't think anything can be concluded from that. That >> was the machine which had both nic's online after booting into the 10.3 >> pxe environment but subsequently returned into no carrier state when >> booting 10.3 from disk. >> >> We also tried many sysctl flags (and many reboots), but without success. >> For example: hw.ix.enable_msix=0 and hw.ix.enable_msi=0 >> >> At the moment I have no spare/empty machine in this state, we will empty >> one machine though which currently has this state (but is in production >> right now). >> I don't know how to trigger this state manually, which doesn't help for >> debugging. >> >> I could link reference where others report similar issues, for example >> https://www.reddit.com/r/PFSENSE/comments/45bcuq/10_gig_woes/ >> Here they suggest that the new intel nic driver 3.1.14 fixes it. Though >> I was not able to resolve the state by booting into a kernel with this >> driver. >> >> If I can provide any additional information please do not hesitate to ask. >> >> Any tips and suggestions for debugging are most welcome! >> >> With kind regards, >> >> Daniel >> _______________________________________________ >> freebsd-stable@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >> > This is a late follow-up, but could you file this as a bug on > bugs.freebsd.org? > > - Eric >