From owner-freebsd-stable@FreeBSD.ORG Tue Jan 16 18:04:24 2007 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 54A4816A407 for ; Tue, 16 Jan 2007 18:04:24 +0000 (UTC) (envelope-from mandrews@bit0.com) Received: from mindcrime.bit0.com (bit0.com [207.246.88.211]) by mx1.freebsd.org (Postfix) with ESMTP id 0BCB213C45B for ; Tue, 16 Jan 2007 18:04:23 +0000 (UTC) (envelope-from mandrews@bit0.com) Received: from localhost (localhost.bit0.com [127.0.0.1]) by mindcrime.bit0.com (Postfix) with ESMTP id E65E2730002 for ; Tue, 16 Jan 2007 12:44:15 -0500 (EST) X-Virus-Scanned: amavisd-new at bit0.com Received: from mindcrime.bit0.com ([127.0.0.1]) by localhost (mindcrime.bit0.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ziny8gmGNVHu for ; Tue, 16 Jan 2007 12:44:11 -0500 (EST) Received: from localhost (localhost.bit0.com [127.0.0.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mindcrime.bit0.com (Postfix) with ESMTP for ; Tue, 16 Jan 2007 12:44:11 -0500 (EST) Date: Tue, 16 Jan 2007 12:44:11 -0500 (EST) From: Mike Andrews To: freebsd-stable@freebsd.org Message-ID: <20070116123019.I46509@bit0.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: 6.2-RELEASE em0 watchdog timeouts -- sometimes (w/ partial workaround) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jan 2007 18:04:24 -0000 I have a strange issue with em0 watchdog timeouts that I think is not the same as the ones everyone was having during the 6.2 beta cycle... I have six systems, each with two Intel GigE ports onboard: Systems A and B: Supermicro PDSMi+ Systems C and D: Supermicro PDSMi (without the plus) System E: Tyan S2730U3GN System F: Supermicro X5DPA-GG On each system: em0 is connected to a Cisco Catalyst 2960G layer 2 gigabit ethernet switch. em1 is connected to a Foundry Serveriron XL layer 4-7 fast ethernet switch. All six run FreeBSD 6.2-RELEASE i386, even though the first four are capable of running amd64. They all have 2 GB of memory, except E which has 4 GB. The kernel configs are all identical, and are not that far from GENERIC + SMP. Several times a day, em0 will go down, give a watchdog timeout error on the console, then come right back up on its own a few seconds later. But here's the weird twist: it ONLY happens on systems A and B, and ONLY when running at gigabit speed. If I knock the two switch ports down to 100 meg, the problem goes away. The other four systems C thru F never have watchdog timeout issues; they always work perfectly even at gigabit speed. So I'm trying to figure out if there are any other obvious hardware differences between the plus and non-plus version of the PDSMi that would be causing issues on the plus version. Fortunately, at the moment we are not (yet) pushing anywhere near even 100 meg worth of traffic through these ports, so it's a tolerable workaround... just kinda annoying. :) The chipset is a bit different: the PDSMi is the Intel E7230 chipset for Pentium D servers, where the PDSMi+ is the E3000 that adds Core 2 Duo support. But apparently the NIC chips are identical: 82573V for em0 and 82573L for em1. The BIOS is identical too, so the chipsets must be pretty similar. Nothing shares an IRQ with the NICs. (USB is disabled in the BIOS.) They do have different disk systems; A and B are SATA gmirror setups, while C and D use LSI Megaraid SCSI cards for their mirrors. I have tried the obvious switching the cables out. No difference at all. I have NOT yet tried a different gigabit switch. Hopefully that's enough detail to start; I can get into more specifics as needed. (Kernel configs, dmesg output, IRQ details, disk details, IPMI, running apps, serial console access if needed...)