From owner-freebsd-stable@FreeBSD.ORG  Tue Jan 16 18:04:24 2007
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 54A4816A407
	for <freebsd-stable@freebsd.org>; Tue, 16 Jan 2007 18:04:24 +0000 (UTC)
	(envelope-from mandrews@bit0.com)
Received: from mindcrime.bit0.com (bit0.com [207.246.88.211])
	by mx1.freebsd.org (Postfix) with ESMTP id 0BCB213C45B
	for <freebsd-stable@freebsd.org>; Tue, 16 Jan 2007 18:04:23 +0000 (UTC)
	(envelope-from mandrews@bit0.com)
Received: from localhost (localhost.bit0.com [127.0.0.1])
	by mindcrime.bit0.com (Postfix) with ESMTP id E65E2730002
	for <freebsd-stable@freebsd.org>; Tue, 16 Jan 2007 12:44:15 -0500 (EST)
X-Virus-Scanned: amavisd-new at bit0.com
Received: from mindcrime.bit0.com ([127.0.0.1])
	by localhost (mindcrime.bit0.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id Ziny8gmGNVHu for <freebsd-stable@freebsd.org>;
	Tue, 16 Jan 2007 12:44:11 -0500 (EST)
Received: from localhost (localhost.bit0.com [127.0.0.1])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mindcrime.bit0.com (Postfix) with ESMTP
	for <freebsd-stable@freebsd.org>; Tue, 16 Jan 2007 12:44:11 -0500 (EST)
Date: Tue, 16 Jan 2007 12:44:11 -0500 (EST)
From: Mike Andrews <mandrews@bit0.com>
To: freebsd-stable@freebsd.org
Message-ID: <20070116123019.I46509@bit0.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Subject: 6.2-RELEASE em0 watchdog timeouts -- sometimes (w/ partial
	workaround)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Jan 2007 18:04:24 -0000

I have a strange issue with em0 watchdog timeouts that I think is not the 
same as the ones everyone was having during the 6.2 beta cycle...

I have six systems, each with two Intel GigE ports onboard:

Systems A and B: Supermicro PDSMi+
Systems C and D: Supermicro PDSMi (without the plus)
System E: Tyan S2730U3GN
System F: Supermicro X5DPA-GG

On each system:
em0 is connected to a Cisco Catalyst 2960G layer 2 gigabit ethernet switch.
em1 is connected to a Foundry Serveriron XL layer 4-7 fast ethernet switch.

All six run FreeBSD 6.2-RELEASE i386, even though the first four are 
capable of running amd64.  They all have 2 GB of memory, except E which 
has 4 GB.  The kernel configs are all identical, and are not that far from 
GENERIC + SMP.

Several times a day, em0 will go down, give a watchdog timeout error on 
the console, then come right back up on its own a few seconds later.  But 
here's the weird twist: it ONLY happens on systems A and B, and ONLY when 
running at gigabit speed.  If I knock the two switch ports down to 100 
meg, the problem goes away.

The other four systems C thru F never have watchdog timeout issues; they 
always work perfectly even at gigabit speed.

So I'm trying to figure out if there are any other obvious hardware 
differences between the plus and non-plus version of the PDSMi that would 
be causing issues on the plus version.  Fortunately, at the moment we are 
not (yet) pushing anywhere near even 100 meg worth of traffic through 
these ports, so it's a tolerable workaround...  just kinda annoying. :)

The chipset is a bit different: the PDSMi is the Intel E7230 chipset for 
Pentium D servers, where the PDSMi+ is the E3000 that adds Core 2 Duo 
support.  But apparently the NIC chips are identical: 82573V for em0 and 
82573L for em1.  The BIOS is identical too, so the chipsets must be pretty 
similar.  Nothing shares an IRQ with the NICs.  (USB is disabled in the 
BIOS.)  They do have different disk systems; A and B are SATA gmirror 
setups, while C and D use LSI Megaraid SCSI cards for their mirrors.

I have tried the obvious switching the cables out.  No difference at all.

I have NOT yet tried a different gigabit switch.

Hopefully that's enough detail to start; I can get into more specifics as 
needed.  (Kernel configs, dmesg output, IRQ details, disk details, IPMI, 
running apps, serial console access if needed...)