From owner-freebsd-stable@FreeBSD.ORG Wed Feb 11 17:31:09 2015 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2D4BB8D1 for ; Wed, 11 Feb 2015 17:31:09 +0000 (UTC) Received: from mail-we0-x22a.google.com (mail-we0-x22a.google.com [IPv6:2a00:1450:400c:c03::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A7C6BED1 for ; Wed, 11 Feb 2015 17:31:08 +0000 (UTC) Received: by mail-we0-f170.google.com with SMTP id q59so4958302wes.1 for ; Wed, 11 Feb 2015 09:31:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=3FejLpDMHt38V6CGkT4oWkcjP/sCju87UePygT9HJhU=; b=ZaFnQ7RIcx068xPTk/2D47QfgGnHob9NLetqg9uXYQq1B1jtGBWKqBG2weuyj54+Ea z0o2LKjh5EXzFjlsnPTY4F8oh9b57peKvQ7ldSoVfau00fObNlfJ+R16MiQREAvxhHMh g/jGh1IrPP1hm/tk/Q5SznskmQduQhr7GsPFIGHuk1r0fXZvrlDlt63F2chns23PJg/w sf6yDTGtPFiHlCe0d3+/9ICDOS/aisbY9GIDl0iQHOY+SChe9lPS5rlv7Zc7A3BpnaPm BIr9fIqRI6N88i1X/qSintP0vEUBbvzpYD9GhpD6So6YZtMI2ipZd/Tw3jE8wq4cfYyE dgoA== MIME-Version: 1.0 X-Received: by 10.194.5.168 with SMTP id t8mr65630113wjt.150.1423675866812; Wed, 11 Feb 2015 09:31:06 -0800 (PST) Received: by 10.194.101.106 with HTTP; Wed, 11 Feb 2015 09:31:06 -0800 (PST) In-Reply-To: <54DB8975.2030001@omnilan.de> References: <54ACC6A2.1050400@omnilan.de> <54AE565D.50208@omnilan.de> <54AE5A6B.7040601@omnilan.de> <54AFA784.6020102@omnilan.de> <54B10432.8050909@omnilan.de> <54DB8975.2030001@omnilan.de> Date: Wed, 11 Feb 2015 09:31:06 -0800 Message-ID: Subject: Re: igb(4) watchdog timeout, lagg(4) fails From: Jack Vogel To: Harald Schmalzbauer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: FreeBSD Stable X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Feb 2015 17:31:09 -0000 tdh and tdt mean the head and tail indices of the ring, and these values are obviously severely borked :) I'm buried with some other issues, but I'll try and find some time to look at this a bit more. Jack On Wed, Feb 11, 2015 at 8:55 AM, Harald Schmalzbauer < h.schmalzbauer@omnilan.de> wrote: > Bez=FCglich Harald Schmalzbauer's Nachricht vom 10.01.2015 11:51 > (localtime): > > Bez=FCglich Jack Vogel's Nachricht vom 09.01.2015 18:46 (localtime): > >> The tuneable interrupt rate code is not mine, and looking at it I'm no= t > >> entirely > >> sure it works. Why are you focused on the interrupt rate anyway, do yo= u > have > >> some reason to tie it to the watchdog? > >> > >> You could turn AIM off (enable_aim) and see if that changed anything? > >> > >> It seems most the time problems show up they involve the use of lagg, > if you > >> take it out of the mix does the problem go away? > ... > > > Is there a way to reset the interface without rebooting the machine? Th= e > > watchdog doesn't really reset the device, it's in non-operating state > > afterwards. I need to 'ifconfig down' it for bringin lagg(4) back into > > operational state. > > Some kind of D3D0-state switch for a single address? kldunloading would > > destroy the remaining interface too... > > I could isolate the igb watchdog timeout problem a bit. > It only happens on nics which take the PCH-PCIe route. Nics that are > connected to the CPU's PCIe root complex never show this issue. > > Currently, I suffer from one unresponsible nic which shows the following > states: > dev.igb.1.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0 > dev.igb.1.%driver: igb > dev.igb.1.%location: slot=3D0 function=3D0 handle=3D\_SB_.PCI0.PE70.S1F0 > dev.igb.1.%pnpinfo: vendor=3D0x8086 device=3D0x10c9 subvendor=3D0x8086 > subdevice=3D0xa03c class=3D0x020000 > dev.igb.1.%parent: pci11 > dev.igb.1.nvm: -1 > dev.igb.1.enable_aim: 1 > dev.igb.1.fc: 3 > dev.igb.1.rx_processing_limit: 250 > dev.igb.1.link_irq: 848 > ^^^^^^^^^^^^^^ 848??? > dev.igb.1.dropped: 0 > dev.igb.1.tx_dma_fail: 0 > dev.igb.1.rx_overruns: 0 > dev.igb.1.watchdog_timeouts: 414 > dev.igb.1.device_control: 1488978497 > dev.igb.1.rx_control: 67272738 > dev.igb.1.interrupt_mask: 4 > dev.igb.1.extended_int_mask: 2147483655 > dev.igb.1.tx_buf_alloc: 0 > dev.igb.1.rx_buf_alloc: 0 > dev.igb.1.fc_high_water: 47488 > dev.igb.1.fc_low_water: 47472 > dev.igb.1.queue0.interrupt_rate: 0 > dev.igb.1.queue0.txd_head: 0 > dev.igb.1.queue0.txd_tail: 0 > dev.igb.1.queue0.no_desc_avail: 2520 > dev.igb.1.queue0.tx_packets: 43894 > dev.igb.1.queue0.rxd_head: 0 > dev.igb.1.queue0.rxd_tail: 0 > dev.igb.1.queue0.rx_packets: 1918054 > dev.igb.1.queue0.rx_bytes: 0 > dev.igb.1.queue0.lro_queued: 0 > dev.igb.1.queue0.lro_flushed: 0 > dev.igb.1.queue1.interrupt_rate: 0 > dev.igb.1.queue1.txd_head: 0 > dev.igb.1.queue1.txd_tail: 0 > dev.igb.1.queue1.no_desc_avail: 17 > dev.igb.1.queue1.tx_packets: 36813 > dev.igb.1.queue1.rxd_head: 0 > dev.igb.1.queue1.rxd_tail: 0 > dev.igb.1.queue1.rx_packets: 63738 > dev.igb.1.queue1.rx_bytes: 0 > dev.igb.1.queue1.lro_queued: 0 > dev.igb.1.queue1.lro_flushed: 0 > ... > dev.igb.1.interrupts.asserts: 5890499 > dev.igb.1.interrupts.rx_pkt_timer: 2103707 > dev.igb.1.interrupts.rx_abs_timer: 0 > dev.igb.1.interrupts.tx_pkt_timer: 0 > dev.igb.1.interrupts.tx_abs_timer: 1983448 > dev.igb.1.interrupts.tx_queue_empty: 50493 > dev.igb.1.interrupts.tx_queue_min_thresh: 0 > dev.igb.1.interrupts.rx_desc_min_thresh: 0 > dev.igb.1.interrupts.rx_overrun: 0 > > The dev.igb.1.link_irq value doesn't really make sense, does it? > > The rest isn't unusual, just shows the watchdog timeout problem becaus > of dev.igb.1.queue0.no_desc_avail I guess. > > I manually adjusted: > hw.igb.num_queues: 2 > hw.igb.rx_process_limit: 250 > hw.igb.rxd: 4096 > hw.igb.txd: 4096 > > Like mentioned, the nics not going through PCH-PCIe don't show this > problem, falsified. > > This is the regular timeout interval for the last 24h (~3 minutes): > Feb 11 17:26:53 vega kernel: igb1: Watchdog timeout -- resetting > Feb 11 17:26:53 vega kernel: igb1: Queue(911600000) tdh =3D 2068077355, h= w > tdt =3D 397078446 > Feb 11 17:26:53 vega kernel: igb1: TX(911600000) desc avail =3D 0,Next TX > to Clean =3D 0 > Feb 11 17:26:53 vega kernel: igb1: link state changed to DOWN > Feb 11 17:26:56 vega kernel: igb1: link state changed to UP > Feb 11 17:26:56 vega devd: Executing '/etc/rc.d/dhclient quietstart igb1' > Feb 11 17:30:10 vega kernel: igb1: Watchdog timeout -- resetting > Feb 11 17:30:10 vega kernel: igb1: Queue(911600000) tdh =3D 2068077355, h= w > tdt =3D 397078446 > Feb 11 17:30:10 vega kernel: igb1: TX(911600000) desc avail =3D 0,Next TX > to Clean =3D 0 > Feb 11 17:30:10 vega kernel: igb1: link state changed to DOWN > Feb 11 17:30:13 vega kernel: igb1: link state changed to UP > > But these resets don't bring the interface back into a working state :-( > "Queue" value remains constant, but "tdh" and "tdt" vary from time to > time, for example: > igb1: Queue(911600000) tdh =3D -337225283, hw tdt =3D 398180458 > > Unfortunately I don't know what they stand for. Is there an explanation > for people who don't just look for it in the drivers code? > Any idea where the problem could be? > > Thanks, > > -Harry > >