From owner-freebsd-stable@FreeBSD.ORG Wed Feb 11 16:55:23 2015 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0DCCDD09 for ; Wed, 11 Feb 2015 16:55:23 +0000 (UTC) Received: from mx0.gentlemail.de (mx0.gentlemail.de [IPv6:2a00:e10:2800::a130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 881F3B5A for ; Wed, 11 Feb 2015 16:55:22 +0000 (UTC) Received: from mh0.gentlemail.de (ezra.dcm1.omnilan.net [IPv6:2a00:e10:2800::a135]) by mx0.gentlemail.de (8.14.5/8.14.5) with ESMTP id t1BGtI0P057265; Wed, 11 Feb 2015 17:55:18 +0100 (CET) (envelope-from h.schmalzbauer@omnilan.de) Received: from titan.inop.mo1.omnilan.net (titan.inop.mo1.omnilan.net [IPv6:2001:a60:f0bb:1::3:1]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mh0.gentlemail.de (Postfix) with ESMTPSA id 54391A86; Wed, 11 Feb 2015 17:55:18 +0100 (CET) Message-ID: <54DB8975.2030001@omnilan.de> Date: Wed, 11 Feb 2015 17:55:17 +0100 From: Harald Schmalzbauer Organization: OmniLAN User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; de-DE; rv:1.9.2.8) Gecko/20100906 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: Jack Vogel Subject: Re: igb(4) watchdog timeout, lagg(4) fails References: <54ACC6A2.1050400@omnilan.de> <54AE565D.50208@omnilan.de> <54AE5A6B.7040601@omnilan.de> <54AFA784.6020102@omnilan.de> <54B10432.8050909@omnilan.de> In-Reply-To: <54B10432.8050909@omnilan.de> X-Enigmail-Version: 1.1.2 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig52B9373E4462A86CE9ABD566" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (mx0.gentlemail.de [IPv6:2a00:e10:2800::a130]); Wed, 11 Feb 2015 17:55:18 +0100 (CET) X-Milter: Spamilter (Reciever: mx0.gentlemail.de; Sender-ip: ; Sender-helo: mh0.gentlemail.de; ) Cc: FreeBSD Stable X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Feb 2015 16:55:23 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig52B9373E4462A86CE9ABD566 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Bez=FCglich Harald Schmalzbauer's Nachricht vom 10.01.2015 11:51 (localtime): > Bez=FCglich Jack Vogel's Nachricht vom 09.01.2015 18:46 (localtime): >> The tuneable interrupt rate code is not mine, and looking at it I'm no= t >> entirely >> sure it works. Why are you focused on the interrupt rate anyway, do yo= u have >> some reason to tie it to the watchdog? >> >> You could turn AIM off (enable_aim) and see if that changed anything? >> >> It seems most the time problems show up they involve the use of lagg, = if you >> take it out of the mix does the problem go away? =85 > Is there a way to reset the interface without rebooting the machine? Th= e > watchdog doesn't really reset the device, it's in non-operating state > afterwards. I need to 'ifconfig down' it for bringin lagg(4) back into > operational state. > Some kind of D3D0-state switch for a single address? kldunloading would= > destroy the remaining interface too=85 I could isolate the igb watchdog timeout problem a bit. It only happens on nics which take the PCH-PCIe route. Nics that are connected to the CPU's PCIe root complex never show this issue. Currently, I suffer from one unresponsible nic which shows the following states: dev.igb.1.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0 dev.igb.1.%driver: igb dev.igb.1.%location: slot=3D0 function=3D0 handle=3D\_SB_.PCI0.PE70.S1F0 dev.igb.1.%pnpinfo: vendor=3D0x8086 device=3D0x10c9 subvendor=3D0x8086 subdevice=3D0xa03c class=3D0x020000 dev.igb.1.%parent: pci11 dev.igb.1.nvm: -1 dev.igb.1.enable_aim: 1 dev.igb.1.fc: 3 dev.igb.1.rx_processing_limit: 250 dev.igb.1.link_irq: 848 ^^^^^^^^^^^^^^ 848??? dev.igb.1.dropped: 0 dev.igb.1.tx_dma_fail: 0 dev.igb.1.rx_overruns: 0 dev.igb.1.watchdog_timeouts: 414 dev.igb.1.device_control: 1488978497 dev.igb.1.rx_control: 67272738 dev.igb.1.interrupt_mask: 4 dev.igb.1.extended_int_mask: 2147483655 dev.igb.1.tx_buf_alloc: 0 dev.igb.1.rx_buf_alloc: 0 dev.igb.1.fc_high_water: 47488 dev.igb.1.fc_low_water: 47472 dev.igb.1.queue0.interrupt_rate: 0 dev.igb.1.queue0.txd_head: 0 dev.igb.1.queue0.txd_tail: 0 dev.igb.1.queue0.no_desc_avail: 2520 dev.igb.1.queue0.tx_packets: 43894 dev.igb.1.queue0.rxd_head: 0 dev.igb.1.queue0.rxd_tail: 0 dev.igb.1.queue0.rx_packets: 1918054 dev.igb.1.queue0.rx_bytes: 0 dev.igb.1.queue0.lro_queued: 0 dev.igb.1.queue0.lro_flushed: 0 dev.igb.1.queue1.interrupt_rate: 0 dev.igb.1.queue1.txd_head: 0 dev.igb.1.queue1.txd_tail: 0 dev.igb.1.queue1.no_desc_avail: 17 dev.igb.1.queue1.tx_packets: 36813 dev.igb.1.queue1.rxd_head: 0 dev.igb.1.queue1.rxd_tail: 0 dev.igb.1.queue1.rx_packets: 63738 dev.igb.1.queue1.rx_bytes: 0 dev.igb.1.queue1.lro_queued: 0 dev.igb.1.queue1.lro_flushed: 0 =85 dev.igb.1.interrupts.asserts: 5890499 dev.igb.1.interrupts.rx_pkt_timer: 2103707 dev.igb.1.interrupts.rx_abs_timer: 0 dev.igb.1.interrupts.tx_pkt_timer: 0 dev.igb.1.interrupts.tx_abs_timer: 1983448 dev.igb.1.interrupts.tx_queue_empty: 50493 dev.igb.1.interrupts.tx_queue_min_thresh: 0 dev.igb.1.interrupts.rx_desc_min_thresh: 0 dev.igb.1.interrupts.rx_overrun: 0 The dev.igb.1.link_irq value doesn't really make sense, does it? The rest isn't unusual, just shows the watchdog timeout problem becaus of dev.igb.1.queue0.no_desc_avail I guess. I manually adjusted: hw.igb.num_queues: 2 hw.igb.rx_process_limit: 250 hw.igb.rxd: 4096 hw.igb.txd: 4096 Like mentioned, the nics not going through PCH-PCIe don't show this problem, falsified. This is the regular timeout interval for the last 24h (~3 minutes): Feb 11 17:26:53 vega kernel: igb1: Watchdog timeout -- resetting Feb 11 17:26:53 vega kernel: igb1: Queue(911600000) tdh =3D 2068077355, h= w tdt =3D 397078446 Feb 11 17:26:53 vega kernel: igb1: TX(911600000) desc avail =3D 0,Next TX= to Clean =3D 0 Feb 11 17:26:53 vega kernel: igb1: link state changed to DOWN Feb 11 17:26:56 vega kernel: igb1: link state changed to UP Feb 11 17:26:56 vega devd: Executing '/etc/rc.d/dhclient quietstart igb1'= Feb 11 17:30:10 vega kernel: igb1: Watchdog timeout -- resetting Feb 11 17:30:10 vega kernel: igb1: Queue(911600000) tdh =3D 2068077355, h= w tdt =3D 397078446 Feb 11 17:30:10 vega kernel: igb1: TX(911600000) desc avail =3D 0,Next TX= to Clean =3D 0 Feb 11 17:30:10 vega kernel: igb1: link state changed to DOWN Feb 11 17:30:13 vega kernel: igb1: link state changed to UP But these resets don't bring the interface back into a working state :-( "Queue" value remains constant, but "tdh" and "tdt" vary from time to time, for example: igb1: Queue(911600000) tdh =3D -337225283, hw tdt =3D 398180458 Unfortunately I don't know what they stand for. Is there an explanation for people who don't just look for it in the drivers code? Any idea where the problem could be? Thanks, -Harry --------------enig52B9373E4462A86CE9ABD566 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (FreeBSD) iEYEARECAAYFAlTbiXUACgkQLDqVQ9VXb8jktgCgxQgBLy0fLL1lIRhwHEHcS6NA OKUAoKE3Unzf0vukXjy7N/oJWA+h3KH1 =Rw5U -----END PGP SIGNATURE----- --------------enig52B9373E4462A86CE9ABD566--