From owner-freebsd-stable@FreeBSD.ORG Fri May 15 08:25:03 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7433D1065678 for ; Fri, 15 May 2009 08:25:03 +0000 (UTC) (envelope-from byshenknet@byshenk.net) Received: from core.byshenk.net (core.byshenk.net [62.58.73.230]) by mx1.freebsd.org (Postfix) with ESMTP id EF8B98FC1D for ; Fri, 15 May 2009 08:25:02 +0000 (UTC) (envelope-from byshenknet@byshenk.net) Received: from core.byshenk.net (localhost.aoes.com [127.0.0.1]) by core.byshenk.net (8.14.3/8.14.3) with ESMTP id n4F8P0kF025036 for ; Fri, 15 May 2009 10:25:00 +0200 (CEST) (envelope-from byshenknet@core.byshenk.net) Received: (from byshenknet@localhost) by core.byshenk.net (8.14.3/8.14.3/Submit) id n4F8P0W8025035 for freebsd-stable@freebsd.org; Fri, 15 May 2009 10:25:00 +0200 (CEST) (envelope-from byshenknet) Date: Fri, 15 May 2009 10:25:00 +0200 From: Greg Byshenk To: freebsd-stable@freebsd.org Message-ID: <20090515082500.GB2571@core.byshenk.net> References: <20090426125008.GK1550@core.byshenk.net> <20090513164207.GD67116@core.byshenk.net> <20090513164438.GE67116@core.byshenk.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090513164438.GE67116@core.byshenk.net> User-Agent: Mutt/1.4.2.3i X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on core.byshenk.net Subject: Re: em? watchdog timeout 7-stable X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 May 2009 08:25:03 -0000 Following up to myself, I experienced a watchdog timout followed by lockuup again early this morning. Strangely, rather than happening at a time of heavy activity, it seems to have occurred when there was very little activity. I was running 'systat' in a window when the watchdog timeout occurred and the network disappeared, and it showed: === begin systat output === 2 users Load 0.46 1.36 1.32 May 15 05:29 Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 50544 5484 471736 8504 1789768 count All 151492 8748 13158556 21348 pages Proc: Interrupts r p d s w Csw Trp Sys Int Sof Flt cow 16006 total 1 162 360 4 132 6 1126 zfod sio0 irq4 ozfod fdc0 irq6 12.9%Sys 12.5%Intr 0.0%User 0.0%Nice 74.7%Idle %ozfod ata0 irq14 | | | | | | | | | | | daefr 6 skc0 em0 1 ======+++++++ prcfr twa0 irq18 39 dtbuf totfr em1 irq24 Namei Name-cache Dir-cache 100000 desvn react 2000 cpu0: time Calls hits % hits % 84258 numvn pdwak 2000 cpu3: time 25000 frevn pdpgs 2000 cpu1: time intrn 2000 cpu2: time Disks da0 da1 pass0 pass1 541836 wire 2000 cpu7: time KB/t 0.00 0.00 0.00 0.00 51628 act 2000 cpu6: time tps 0 0 0 0 13865008 inact 2000 cpu4: time MB/s 0.00 0.00 0.00 0.00 458052 cache 2000 cpu5: time %busy 0 0 0 0 1331716 free === end systat output === This time, I was able to break into the debugger from my console, with the following result: === begin kdb output === KDB: enter: Line break on console [thread pid 17 tid 100009 ] Stopped at kdb_enter_why+0x3d: movq $0,0x5d70d8(%rip) db> panic panic: from debugger cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a panic() at panic+0x182 db_panic() at db_panic+0x17 db_command() at db_command+0x1ef db_command_loop() at db_command_loop+0x50 db_trap() at db_trap+0x89 kdb_trap() at kdb_trap+0x95 trap() at trap+0x264 calltrap() at calltrap+0x8 --- trap 0x3, rip = 0xffffffff804d07cd, rsp = 0xfffffffe800819d0, rbp = 0xfffffffe800819f0 --- kdb_enter_why() at kdb_enter_why+0x3d siointr1() at siointr1+0x2c5 siointr() at siointr+0x58 intr_execute_handlers() at intr_execute_handlers+0x8b Xapic_isr1() at Xapic_isr1+0x7f --- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 --- acpi_cpu_c1() at acpi_cpu_c1+0x6 acpi_cpu_idle() at acpi_cpu_idle+0x19c sched_idletd() at sched_idletd+0x46 fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 --- KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a mi_switch() at mi_switch+0x2a8 sched_bind() at sched_bind+0x58 boot() at boot+0x3f panic() at panic+0x16c db_panic() at db_panic+0x17 db_command() at db_command+0x1ef db_command_loop() at db_command_loop+0x50 db_trap() at db_trap+0x89 kdb_trap() at kdb_trap+0x95 trap() at trap+0x264 calltrap() at calltrap+0x8 --- trap 0x3, rip = 0xffffffff804d07cd, rsp = 0xfffffffe800819d0, rbp = 0xfffffffe800819f0 --- kdb_enter_why() at kdb_enter_why+0x3d siointr1() at siointr1+0x2c5 siointr() at siointr+0x58 intr_execute_handlers() at intr_execute_handlers+0x8b Xapic_isr1() at Xapic_isr1+0x7f --- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 --- acpi_cpu_c1() at acpi_cpu_c1+0x6 acpi_cpu_idle() at acpi_cpu_idle+0x19c sched_idletd() at sched_idletd+0x46 fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 --- db> bt Tracing pid 17 tid 100009 td 0xffffff00013f3a50 kdb_enter_why() at kdb_enter_why+0x3d siointr1() at siointr1+0x2c5 siointr() at siointr+0x58 intr_execute_handlers() at intr_execute_handlers+0x8b Xapic_isr1() at Xapic_isr1+0x7f --- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 --- acpi_cpu_c1() at acpi_cpu_c1+0x6 acpi_cpu_idle() at acpi_cpu_idle+0x19c sched_idletd() at sched_idletd+0x46 fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 --- === end kdb output === Kernel/world are 7-STABLE amd64, in sync, from sources csup'ed Thursday, 14 May 2009. Other information re em1 on this machine: # pciconf -lvb em1@pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82545EM Gigabit Ethernet Controller (Fiber)' class = network subclass = ethernet bar [10] = type Memory, range 64, base 0xda300000, size 131072, enabled bar [20] = type I/O Port, range 32, base 0x5000, size 64, enabled # vmstat -i interrupt total rate irq4: sio0 1479 0 irq6: fdc0 10 0 irq14: ata0 58 0 irq16: skc0 em0 758850 85 irq18: twa0 2085338 234 irq24: em1 1 0 cpu0: timer 17806226 1999 cpu3: timer 17798161 1998 cpu2: timer 17798127 1998 cpu1: timer 17798043 1998 cpu5: timer 17798058 1998 cpu6: timer 17798161 1998 cpu4: timer 17798160 1998 cpu7: timer 17798160 1998 Total 145238832 16311 # ifconfig em1 em1: flags=8843 metric 0 mtu 1500 options=db ether 00:07:e9:1a:ae:dc inet 192.168.1.62 netmask 0xfffff800 broadcast 192.168.7.255 media: Ethernet autoselect (1000baseLX ) status: active Any ideas? On Wed, May 13, 2009 at 06:44:38PM +0200, Greg Byshenk wrote: > On Wed, May 13, 2009 at 06:42:07PM +0200, Greg Byshenk wrote: > > > As a followup to my own previous message, I continue to have annoying > > problems with "em?: watchdog timeout" on one of my machines (now running > > 7.2-STABLE as of 2009-05-08). > > > > I have discontinued using the on-board (em, copper) NICs, and replaced > > the original fibre NIC with a newer model, but the problem persists. > > I've also set > > > > hw.pci.enable_msix=0 > > hw.pci.enable_msi=0 > > hw.em.rxd=1024 > > hw.em.txd=1024 > > net.inet.tcp.tso=0 > > > > ...as suggested in some discussions of this problem, and set the em1 > > interface to 'polling', all to no avail. Frequently, though irregularly > > (once or twice a day), the console begins to display > > > > em1: watchdog timeout -- resetting > > em1: watchdog timeout -- resetting > > em1: watchdog timeout -- resetting > > > > the nework is down, and the machine locks up. > > > > [Note: I am getting 'em1' now instead of 'em0' as previously, but this > > is due to changing all of the nics, which led to a different numbering; > > the timeout is still occurring on the (main) interface, the fibre > > gigabit connection.] > > > > What is particularly perverse (IMO) is that, since changing the NIC to > > the newer model (and updating the kernel), I can no longer break to the > > debugger when the lockup occurs (there is no response to the break) -- > > bit I _can_ shut the machine down cleanly via hardware (a touch of the > > power switch sends 'shutdown', and the machine shuts down cleanly -- > > after killing off processes waiting on network i/o). > > > > The machine is running nfs and samba (3.2.10, from ports), and pretty > > much nothing else. > > > > > > Anyone have any ideas about this...? I'm going mad with this. -- greg byshenk - gbyshenk@byshenk.net - Leiden, NL