From owner-freebsd-net@FreeBSD.ORG Tue Feb 23 13:26:22 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E47A81065670 for ; Tue, 23 Feb 2010 13:26:22 +0000 (UTC) (envelope-from voovoos-fnet@killfile.pl) Received: from mailhub.media4u.pl (mailhub.media4u.pl [194.79.24.10]) by mx1.freebsd.org (Postfix) with ESMTP id 83D9D8FC08 for ; Tue, 23 Feb 2010 13:26:22 +0000 (UTC) Received: from mail.media4u.pl ([194.79.24.11]:55772) by mailhub.media4u.pl with esmtp (Exim 4.69 (FreeBSD)) (envelope-from ) id 1NjuI6-000NF4-Rj; Tue, 23 Feb 2010 13:55:06 +0100 Received: from gw.media4u.net.pl ([194.79.25.15]:58484 helo=[192.168.9.33]) by mail.media4u.pl with esmtpa (Exim 4.63) (envelope-from ) id 1NjuI2-000JRn-Gf; Tue, 23 Feb 2010 13:55:03 +0100 Message-ID: <4B83D021.7020201@killfile.pl> Date: Tue, 23 Feb 2010 13:54:57 +0100 From: Maciej Wierzbicki Organization: =?UTF-8?B?xbt5amVteSB3IEtyYWp1IEN1ZG93bnljaCBNZXRhZm9y?= User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: freebsd-net@freebsd.org References: <529374128DC1B04D9D037911B8E8F05301C17A51@Exchange26.EDU.epsb.ca> <43416_1266864062_4B82CFBE_43416_81_1_2a41acea1002221043k1b8742c9m8fb484a8e8a4fdda@mail.gmail.com> <529374128DC1B04D9D037911B8E8F05301C17A54@Exchange26.EDU.epsb.ca> <2a41acea1002221113v26804200q4f3971c3359dffab@mail.gmail.com> In-Reply-To: <2a41acea1002221113v26804200q4f3971c3359dffab@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Jack Vogel Subject: Re: Intel em0: watchdog timeout X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Feb 2010 13:26:23 -0000 Jack Vogel wrote on 2010-02-22 20:13: > 7.2 seems to be a stable base OS and driver, 8 is better in some respects, > but > has not been without its reported problems. I leave the choice to you. Let me sneak into this thread as I am also suffering from em watchdog timeouts. In my case there is a 7.2-release doing HAProxy LB for several webservers. But as far as I can tell, the watchdogs are not related to traffic rate: I can have low traffic rate near 50Mbps having timeouts every minute and I can have 200-300Mbps with long periods of time without timeouts, there is no visible regularity in that. em is built into kernel. Typical watchdog timeout log: Feb 22 21:21:31 CSBP kernel: em0: watchdog timeout -- resetting Feb 22 21:21:31 CSBP kernel: em0: link state changed to DOWN Feb 22 21:21:34 CSBP kernel: em0: link state changed to UP Feb 22 21:43:33 CSBP kernel: em0: watchdog timeout -- resetting Feb 22 21:43:33 CSBP kernel: em0: link state changed to DOWN Feb 22 21:43:36 CSBP kernel: em0: link state changed to UP OK, here is some data: FreeBSD 7.2-RELEASE-p5 #2: Thu Dec 10 14:21:26 CET 2009 kern.ipc.nmbclusters="262144" I never saw anything close to resource exhausting via netstat -m 5999/28441/34440 mbufs in use (current/cache/total) 3240/18468/21708/262144 mbuf clusters in use (current/cache/total/max) 3239/17881 mbuf+clusters out of packet secondary zone in use (current/cache) 2673/10297/12970/204800 4k (page size) jumbo clusters in use (current/cache/total/max) 18796K/85234K/104030K bytes allocated to network (current/cache/total) em0: port 0xa000-0xa01f mem 0xe9080000-0xe909ffff,0xe9000000-0xe907ffff,0xe90a0000-0xe90a3fff irq 16 at device 0.0 on pci2 em0: Using MSIX interrupts em1: port 0xb000-0xb01f mem 0xeb020000-0xeb03ffff,0xeb000000-0xeb01ffff irq 16 at device 0.0 on pci3 em1: Using MSI interrupt Feb 23 13:20:43 CSBP kernel: em0: Excessive collisions = 0 Feb 23 13:20:43 CSBP kernel: em0: Sequence errors = 0 Feb 23 13:20:43 CSBP kernel: em0: Defer count = 0 Feb 23 13:20:43 CSBP kernel: em0: Missed Packets = 3371167 Feb 23 13:20:43 CSBP kernel: em0: Receive No Buffers = 257 Feb 23 13:20:43 CSBP kernel: em0: Receive Length Errors = 1 Feb 23 13:20:43 CSBP kernel: em0: Receive errors = 0 Feb 23 13:20:43 CSBP kernel: em0: Crc errors = 0 Feb 23 13:20:43 CSBP kernel: em0: Alignment errors = 0 Feb 23 13:20:43 CSBP kernel: em0: Collision/Carrier extension errors = 0 Feb 23 13:20:43 CSBP kernel: em0: RX overruns = 416328 Feb 23 13:20:43 CSBP kernel: em0: watchdog timeouts = 1210 Feb 23 13:20:43 CSBP kernel: em0: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0 Feb 23 13:20:43 CSBP kernel: em0: XON Rcvd = 0 Feb 23 13:20:43 CSBP kernel: em0: XON Xmtd = 0 Feb 23 13:20:43 CSBP kernel: em0: XOFF Rcvd = 0 Feb 23 13:20:43 CSBP kernel: em0: XOFF Xmtd = 0 Feb 23 13:20:43 CSBP kernel: em0: Good Packets Rcvd = 9534885245 Feb 23 13:20:43 CSBP kernel: em0: Good Packets Xmtd = 12866598217 Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Xmtd = 3515091251 Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Failed = 0 Feb 23 13:21:14 CSBP kernel: em1: Excessive collisions = 0 Feb 23 13:21:14 CSBP kernel: em1: Sequence errors = 0 Feb 23 13:21:14 CSBP kernel: em1: Defer count = 0 Feb 23 13:21:14 CSBP kernel: em1: Missed Packets = 171 Feb 23 13:21:14 CSBP kernel: em1: Receive No Buffers = 1112 Feb 23 13:21:14 CSBP kernel: em1: Receive Length Errors = 0 Feb 23 13:21:14 CSBP kernel: em1: Receive errors = 0 Feb 23 13:21:14 CSBP kernel: em1: Crc errors = 0 Feb 23 13:21:14 CSBP kernel: em1: Alignment errors = 0 Feb 23 13:21:14 CSBP kernel: em1: Collision/Carrier extension errors = 0 Feb 23 13:21:14 CSBP kernel: em1: RX overruns = 5 Feb 23 13:21:14 CSBP kernel: em1: watchdog timeouts = 0 Feb 23 13:21:14 CSBP kernel: em1: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0 Feb 23 13:21:14 CSBP kernel: em1: XON Rcvd = 0 Feb 23 13:21:14 CSBP kernel: em1: XON Xmtd = 0 Feb 23 13:21:14 CSBP kernel: em1: XOFF Rcvd = 0 Feb 23 13:21:14 CSBP kernel: em1: XOFF Xmtd = 0 Feb 23 13:21:14 CSBP kernel: em1: Good Packets Rcvd = 11350337360 Feb 23 13:21:14 CSBP kernel: em1: Good Packets Xmtd = 9594728760 Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Xmtd = 30554321 Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Failed = 0 This is neither em0-hardware problem nor em0-type problem, because I tested both cases - I've used different em0 (the same model as my em1 above) with the same result. There is one additional thing I should write here: with current em0 card watchdog timeouts results in 1-2 minutes of non-responsive network, I mean when the watchdog occured, the box was not reachable for 1 to 2 minutes. I managed to lower 1-2 minutes of nonresponsive state to "acceptable" 2-3 seconds by this: kern.ipc.nmbjumbop=204800 When I put NIC of the same type as em1, the watchdogs still occurs, but the box is non-responsive for 2-3 seconds only "by default", without modifying kern.ipc.nmbjumbop. What else can I do (or report) to narrow the problem, or are there any patches I should try? :-) Thanks & regards -- * Maciej Wierzbicki * At paranoia's poison door * * VOO1-RIPE *