Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Feb 2010 13:54:57 +0100
From:      Maciej Wierzbicki <voovoos-fnet@killfile.pl>
To:        freebsd-net@freebsd.org
Cc:        Jack Vogel <jfvogel@gmail.com>
Subject:   Re: Intel em0: watchdog timeout
Message-ID:  <4B83D021.7020201@killfile.pl>
In-Reply-To: <2a41acea1002221113v26804200q4f3971c3359dffab@mail.gmail.com>
References:  <529374128DC1B04D9D037911B8E8F05301C17A51@Exchange26.EDU.epsb.ca>	<43416_1266864062_4B82CFBE_43416_81_1_2a41acea1002221043k1b8742c9m8fb484a8e8a4fdda@mail.gmail.com>	<529374128DC1B04D9D037911B8E8F05301C17A54@Exchange26.EDU.epsb.ca> <2a41acea1002221113v26804200q4f3971c3359dffab@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Jack Vogel wrote on 2010-02-22 20:13:

> 7.2 seems to be a stable base OS and driver, 8 is better in some respects,
> but
> has not been without its reported problems. I leave the choice to you.

Let me sneak into this thread as I am also suffering from em watchdog 
timeouts. In my case there is a 7.2-release doing HAProxy LB for several 
webservers. But as far as I can tell, the watchdogs are not related to 
traffic rate: I can have low traffic rate near 50Mbps having timeouts 
every minute and I can have 200-300Mbps with long periods of time 
without timeouts, there is no visible regularity in that. em is built 
into kernel. Typical watchdog timeout log:

Feb 22 21:21:31 CSBP kernel: em0: watchdog timeout -- resetting
Feb 22 21:21:31 CSBP kernel: em0: link state changed to DOWN
Feb 22 21:21:34 CSBP kernel: em0: link state changed to UP
Feb 22 21:43:33 CSBP kernel: em0: watchdog timeout -- resetting
Feb 22 21:43:33 CSBP kernel: em0: link state changed to DOWN
Feb 22 21:43:36 CSBP kernel: em0: link state changed to UP

OK, here is some data:
FreeBSD 7.2-RELEASE-p5 #2: Thu Dec 10 14:21:26 CET 2009
kern.ipc.nmbclusters="262144"

I never saw anything close to resource exhausting via netstat -m
5999/28441/34440 mbufs in use (current/cache/total)
3240/18468/21708/262144 mbuf clusters in use (current/cache/total/max)
3239/17881 mbuf+clusters out of packet secondary zone in use (current/cache)
2673/10297/12970/204800 4k (page size) jumbo clusters in use 
(current/cache/total/max)
18796K/85234K/104030K bytes allocated to network (current/cache/total)


em0: <Intel(R) PRO/1000 Network Connection 6.9.6> port 0xa000-0xa01f mem 
0xe9080000-0xe909ffff,0xe9000000-0xe907ffff,0xe90a0000-0xe90a3fff irq 16 
at device 0.0 on pci2
em0: Using MSIX interrupts
em1: <Intel(R) PRO/1000 Network Connection 6.9.6> port 0xb000-0xb01f mem 
0xeb020000-0xeb03ffff,0xeb000000-0xeb01ffff irq 16 at device 0.0 on pci3
em1: Using MSI interrupt

Feb 23 13:20:43 CSBP kernel: em0: Excessive collisions = 0
Feb 23 13:20:43 CSBP kernel: em0: Sequence errors = 0
Feb 23 13:20:43 CSBP kernel: em0: Defer count = 0
Feb 23 13:20:43 CSBP kernel: em0: Missed Packets = 3371167
Feb 23 13:20:43 CSBP kernel: em0: Receive No Buffers = 257
Feb 23 13:20:43 CSBP kernel: em0: Receive Length Errors = 1
Feb 23 13:20:43 CSBP kernel: em0: Receive errors = 0
Feb 23 13:20:43 CSBP kernel: em0: Crc errors = 0
Feb 23 13:20:43 CSBP kernel: em0: Alignment errors = 0
Feb 23 13:20:43 CSBP kernel: em0: Collision/Carrier extension errors = 0
Feb 23 13:20:43 CSBP kernel: em0: RX overruns = 416328
Feb 23 13:20:43 CSBP kernel: em0: watchdog timeouts = 1210
Feb 23 13:20:43 CSBP kernel: em0: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK 
MSIX IRQ = 0
Feb 23 13:20:43 CSBP kernel: em0: XON Rcvd = 0
Feb 23 13:20:43 CSBP kernel: em0: XON Xmtd = 0
Feb 23 13:20:43 CSBP kernel: em0: XOFF Rcvd = 0
Feb 23 13:20:43 CSBP kernel: em0: XOFF Xmtd = 0
Feb 23 13:20:43 CSBP kernel: em0: Good Packets Rcvd = 9534885245
Feb 23 13:20:43 CSBP kernel: em0: Good Packets Xmtd = 12866598217
Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Xmtd = 3515091251
Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Failed = 0

Feb 23 13:21:14 CSBP kernel: em1: Excessive collisions = 0
Feb 23 13:21:14 CSBP kernel: em1: Sequence errors = 0
Feb 23 13:21:14 CSBP kernel: em1: Defer count = 0
Feb 23 13:21:14 CSBP kernel: em1: Missed Packets = 171
Feb 23 13:21:14 CSBP kernel: em1: Receive No Buffers = 1112
Feb 23 13:21:14 CSBP kernel: em1: Receive Length Errors = 0
Feb 23 13:21:14 CSBP kernel: em1: Receive errors = 0
Feb 23 13:21:14 CSBP kernel: em1: Crc errors = 0
Feb 23 13:21:14 CSBP kernel: em1: Alignment errors = 0
Feb 23 13:21:14 CSBP kernel: em1: Collision/Carrier extension errors = 0
Feb 23 13:21:14 CSBP kernel: em1: RX overruns = 5
Feb 23 13:21:14 CSBP kernel: em1: watchdog timeouts = 0
Feb 23 13:21:14 CSBP kernel: em1: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK 
MSIX IRQ = 0
Feb 23 13:21:14 CSBP kernel: em1: XON Rcvd = 0
Feb 23 13:21:14 CSBP kernel: em1: XON Xmtd = 0
Feb 23 13:21:14 CSBP kernel: em1: XOFF Rcvd = 0
Feb 23 13:21:14 CSBP kernel: em1: XOFF Xmtd = 0
Feb 23 13:21:14 CSBP kernel: em1: Good Packets Rcvd = 11350337360
Feb 23 13:21:14 CSBP kernel: em1: Good Packets Xmtd = 9594728760
Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Xmtd = 30554321
Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Failed = 0

This is neither em0-hardware problem nor em0-type problem, because I 
tested both cases - I've used different em0 (the same model as my em1 
above) with the same result.

There is one additional thing I should write here: with current em0 card 
watchdog timeouts results in 1-2 minutes of non-responsive network, I 
mean when the watchdog occured, the box was not reachable for 1 to 2 
minutes. I managed to lower 1-2 minutes of nonresponsive state to 
"acceptable" 2-3 seconds by this: kern.ipc.nmbjumbop=204800

When I put NIC of the same type as em1, the watchdogs still occurs, but 
the box is non-responsive for 2-3 seconds only "by default", without 
modifying kern.ipc.nmbjumbop.

What else can I do (or report) to narrow the problem, or are there any 
patches I should try? :-)

Thanks & regards
-- 
*   Maciej Wierzbicki * At paranoia's poison door  *
*   VOO1-RIPE   *



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B83D021.7020201>