From owner-freebsd-questions@FreeBSD.ORG Fri Oct 31 08:06:17 2014 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 845ACD93 for ; Fri, 31 Oct 2014 08:06:17 +0000 (UTC) Received: from sam.nabble.com (sam.nabble.com [216.139.236.26]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4E8BE9FE for ; Fri, 31 Oct 2014 08:06:17 +0000 (UTC) Received: from [192.168.236.26] (helo=sam.nabble.com) by sam.nabble.com with esmtp (Exim 4.72) (envelope-from ) id 1Xk7Dm-00036d-1i for freebsd-questions@freebsd.org; Fri, 31 Oct 2014 01:06:10 -0700 Date: Fri, 31 Oct 2014 01:06:10 -0700 (PDT) From: BBlister To: freebsd-questions@freebsd.org Message-ID: <1414742770032-5960935.post@n5.nabble.com> Subject: Every day my FreeBSD 9.3 machines reboot by watchdog timeout MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Oct 2014 08:06:17 -0000 Greetings, I have a very strange problem. I am administering a number of FreeBSD machines (64bit) with: 9.3-STABLE and after the upgrade to 9.3 for the past months until now I have noticed that every day at approximately the same time the machines reboot by watchdog timeout. I believe the problem lies on an entry on the crontab which is difficult to debug because the crontab has too many entries [280 lines]. All the previous versions of freebsd worked fine (even 9.2) and had uptime for many months. After going to 9.3 the problem arose. I have made debugging symbols for all the kernels and I have some logs of it, in case you can help me debug this strange problem. The watchdog fires almost every day in the morning (not exactly at the same minute), as the log files bellow show. Some days they skip the reboot. NOTE: This happens in every FreeBSD server that I have updated to 9-3 even with different hardware. If I kill the crontab no reboot will happen (at least for 3 days that I tried it). All the servers bear similar crontab. The watchdog flags: watchdogd_flags="-s 30 -t 60 -e /bin/ps" As you know after 3:00 a lot of house cleaning jobs take place and cpu raises load, but I believe that the kernel should not reboot in such situations. I am reluctant to try: -disable watchdog, because these servers are required to operate and if they go down and do not reboot, some users will be upset -comment one-by-one every process in the crontab, because this will required a couple of months debugging. ==>Any clever way how to spot the bug? 1st Machine [This is a real physical machine]: {NOTE: For many days I had disabled crontab and thus the reboot are less in this month.} % last | grep boot boot time Fri Oct 31 06:34 boot time Thu Oct 30 06:43 boot time Wed Oct 29 06:34 boot time Sun Oct 26 06:45 boot time Sun Oct 19 07:00 boot time Sat Oct 11 03:35 boot time Tue Oct 7 07:39 boot time Mon Oct 6 07:46 boot time Sun Oct 5 07:40 boot time Thu Oct 2 07:40 boot time Wed Oct 1 07:34 Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 9.3-STABLE #2 r268794: Sat Oct 11 09:59:45 EEST 2014 root@XXXXXXXX:/usr/obj/usr/src/sys/bigb5 amd64 FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 CPU: Intel(R) Core(TM)2 Duo CPU E7600 @ 3.06GHz (3066.70-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x1067a Family = 0x6 Model = 0x17 Stepping = 10 Features=0xbfebfbff Features2=0xc08e3bd AMD Features=0x20100800 AMD Features2=0x1 TSC: P-state invariant, performance statistics real memory = 4294967296 (4096 MB) avail memory = 4103790592 (3913 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 ioapic0: Changing APIC ID to 2 ioapic0 irqs 0-23 on motherboard acpi0: on motherboard acpi0: Power Button (fixed) acpi0: reservation of 0, a0000 (3) failed acpi0: reservation of 100000, cfde0000 (3) failed cpu0: on acpi0 cpu1: on acpi0 attimer0: port 0x40-0x43 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 hpet0: iomem 0xfed00000-0xfed003ff irq 0,8 on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 450 Event timer "HPET1" frequency 14318180 Hz quality 440 Event timer "HPET2" frequency 14318180 Hz quality 440 atrtc0: port 0x70-0x73 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 acpi_button0: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: irq 16 at device 1.0 on pci0 pci1: on pcib1 vgapci0: port 0xc000-0xc07f mem 0xe6000000-0xe6ffffff,0xd0000000-0xdfffffff,0xe4000000-0xe5ffffff irq 16 at device 0.0 on pci1 vgapci0: Boot video device pci0: at device 27.0 (no driver attached) pcib2: irq 16 at device 28.0 on pci0 pci2: on pcib2 pcib3: irq 19 at device 28.3 on pci0 pci3: on pcib3 re0: port 0xd000-0xd0ff mem 0xe9010000-0xe9010fff,0xe9000000-0xe900ffff irq 19 at device 0.0 on pci3 re0: Using 1 MSI-X message re0: Chip rev. 0x3c000000 re0: MAC rev. 0x00400000 miibus0: on re0 rgephy0: PHY 1 on miibus0 rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow re0: Ethernet address: 00:24:1d:71:e4:83 uhci0: port 0xe000-0xe01f irq 23 at device 29.0 on pci0 usbus0 on uhci0 uhci1: port 0xe100-0xe11f irq 19 at device 29.1 on pci0 usbus1 on uhci1 uhci2: port 0xe200-0xe21f irq 18 at device 29.2 on pci0 usbus2 on uhci2 uhci3: port 0xe300-0xe31f irq 16 at device 29.3 on pci0 usbus3 on uhci3 ehci0: mem 0xe9104000-0xe91043ff irq 23 at device 29.7 on pci0 usbus4: EHCI version 1.0 usbus4 on ehci0 pcib4: at device 30.0 on pci0 pci4: on pcib4 isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 ata0: at channel 0 on atapci0 ata1: at channel 1 on atapci0 uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 uart0: console (9600,n,8,1) atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] ichwd0 on isa0 orm0: at iomem 0xc0000-0xcdfff on isa0 wbwd0: HEFRAS and EFER do not align: EFER 0x2e DevID 0xff DevRev 0xff CR26 0xff acpi_perf0: on cpu0 coretemp0: on cpu0 p4tcc0: on cpu0 coretemp1: on cpu1 est1: on cpu1 est: CPU supports Enhanced Speedstep, but is not recognized. est: cpu_vendor GenuineIntel, msr 6164b2106000921 device_attach: est1 attach returned 6 p4tcc1: on cpu1 Timecounters tick every 1.000 msec vboxdrv: fAsync=0 offMin=0x192 offMax=0x325 ipfw2 (+ipv6) initialized, divert enabled, nat enabled, default to accept, logging disabled usbus0: 12Mbps Full Speed USB v1.0 usbus1: 12Mbps Full Speed USB v1.0 ugen0.1: at usbus0 uhub0: on usbus0 ugen1.1: at usbus1 uhub1: on usbus1 usbus2: 12Mbps Full Speed USB v1.0 usbus3: 12Mbps Full Speed USB v1.0 ugen2.1: at usbus2 uhub2: on usbus2 ugen3.1: at usbus3 uhub3: on usbus3 uhub0: 2 ports with 2 removable, self powered usbus4: 480Mbps High Speed USB v2.0 ad0: 715403MB at ata0-master UDMA100 SATA uhub1: 2 ports with 2 removable, self powered ugen4.1: at usbus4 uhub4: on usbus4 uhub2: 2 ports with 2 removable, self powered uhub3: 2 ports with 2 removable, self powered SMP: AP CPU #1 Launched! Timecounter "TSC-low" frequency 1533348854 Hz quality 1000 Root mount waiting for: usbus4 Root mount waiting for: usbus4 Root mount waiting for: usbus4 uhub4: 8 ports with 8 removable, self powered Root mount waiting for: usbus4 Root mount waiting for: usbus4 ugen4.2: at usbus4 umass0: on usbus4 Trying to mount root from ufs:/dev/ad0s1a [ro]... da0 at umass-sim0 bus 0 scbus0 target 0 lun 0 da0: Removable Direct Access SCSI-4 device da0: Serial Number 001478544883AC5107B80394 da0: 40.000MB/s transfers da0: 7392MB (15138816 512 byte sectors: 255H 63S/T 942C) da0: quirks=0x2 WARNING: / was not properly dismounted ZFS filesystem version: 5 ZFS storage pool version: features support (5000) cryptosoft0: on motherboard vboxnet0: Ethernet address: 0a:00:27:00:00:00 tun0: link state changed to UP nd6_setmtu0: new link MTU on tun0 (1130) is too small for IPv6 tap0: Ethernet address: 00:bd:3a:fc:06:00 tap0: link state changed to UP vboxnet0: promiscuous mode enabled Unread portion of the kernel message buffer: interrupt total irq1: atkbd0 2 irq11: em0 uhci0+ 4037862 irq15: ata1 3 cpu0:timer 8849845 irq257: virtio_pci1 686513 Total 13574225 panic: watchdog timeout cpuid = 0 Uptime: 1d1h18m3s Dumping 107 out of 502 MB:..15%..30%..45%..60%..75%..90% (kgdb) bt #0 doadump (textdump=) at pcpu.h:235 #1 0xffffffff8041554f in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:454 #2 0xffffffff80415997 in panic (fmt=) at /usr/src/sys/kern/kern_shutdown.c:642 #3 0xffffffff803c96b4 in watchdog_fire () at /usr/src/sys/kern/kern_clock.c:898 #4 0xffffffff803c99e5 in hardclock_cnt (cnt=, usermode=) at /usr/src/sys/kern/kern_clock.c:570 #5 0xffffffff806297bc in handleevents (now=0xffffff8000231978, fake=0) at /usr/src/sys/kern/kern_clocksource.c:215 #6 0xffffffff8062a3b9 in timercb (et=, arg=) at /usr/src/sys/kern/kern_clocksource.c:390 #7 0xffffffff8065c70c in lapic_handle_timer (frame=0xffffff80002319f0) at /usr/src/sys/x86/x86/local_apic.c:818 #8 0xffffffff805e868c in Xtimerint () at apic_vector.S:103 #9 0xffffffff805e49e6 in acpi_cpu_c1 () at /usr/src/sys/amd64/acpica/acpi_machdep.c:95 #10 0xffffffff802bce7a in acpi_cpu_idle () at /usr/src/sys/dev/acpica/acpi_cpu.c:1023 #11 0xffffffff805ebfc8 in cpu_idle_acpi (busy=0) at /usr/src/sys/amd64/amd64/machdep.c:678 #12 0xffffffff805ec082 in cpu_idle (busy=0) at /usr/src/sys/amd64/amd64/machdep.c:805 #13 0xffffffff8043d832 in sched_idletd (dummy=) at /usr/src/sys/kern/sched_ule.c:2610 #14 0xffffffff803e7455 in fork_exit (callout=0xffffffff8043d680 , arg=0x0, frame=0xffffff8000231c40) at /usr/src/sys/kern/kern_fork.c:996 #15 0xffffffff805e806e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #16 0x0000000000000000 in ?? () 2nd machine [This is a VPS machine running in the cloud]: % last | grep boot boot time Fri Oct 31 06:33 boot time Thu Oct 30 06:33 boot time Wed Oct 29 06:33 boot time Tue Oct 28 06:33 boot time Mon Oct 27 06:33 boot time Sun Oct 26 06:04 boot time Sat Oct 25 06:33 boot time Fri Oct 24 06:33 boot time Thu Oct 23 06:33 boot time Wed Oct 22 06:33 boot time Tue Oct 21 06:33 boot time Mon Oct 20 06:33 boot time Sun Oct 19 06:32 boot time Sat Oct 18 06:33 boot time Fri Oct 17 06:32 boot time Thu Oct 16 06:33 boot time Wed Oct 15 06:33 boot time Tue Oct 14 06:33 boot time Mon Oct 13 06:33 boot time Sun Oct 12 07:04 boot time Sat Oct 11 07:04 boot time Fri Oct 10 07:04 boot time Thu Oct 9 07:04 boot time Wed Oct 8 07:04 boot time Tue Oct 7 07:04 boot time Mon Oct 6 07:04 boot time Sun Oct 5 07:04 boot time Sat Oct 4 07:04 boot time Fri Oct 3 07:04 boot time Thu Oct 2 07:04 boot time Wed Oct 1 07:04 Copyright (c) 1992-2014 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 9.3-STABLE #6 r271376: Sun Oct 12 10:25:52 EEST 2014 root@XXXXXXX:/usr/obj/usr/src/sys/arch amd64 FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 CPU: QEMU Virtual CPU version 1.1.2 (2266.76-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x623 Family = 0x6 Model = 0x2 Stepping = 3 Features=0x783fbfd Features2=0x80802001 AMD Features=0x20100800 AMD Features2=0x1 real memory = 536870912 (512 MB) avail memory = 502841344 (479 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: ioapic0: Changing APIC ID to 1 ioapic0 irqs 0-23 on motherboard acpi0: on motherboard acpi0: Power Button (fixed) cpu0: on acpi0 atrtc0: port 0x70-0x71,0x72-0x77 irq 8 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 hpet0: iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 100000000 Hz quality 950 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0xb008-0xb00b on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pci_link4: Unable to route IRQs: AE_NOT_FOUND isab0: at device 1.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xc0c0-0xc0cf at device 1.1 on pci0 ata0: at channel 0 on atapci0 ata1: at channel 1 on atapci0 uhci0: port 0xc080-0xc09f irq 11 at device 1.2 on pci0 usbus0: controller did not stop usbus0 on uhci0 pci0: at device 1.3 (no driver attached) vgapci0: mem 0xfc000000-0xfdffffff,0xfebf0000-0xfebf0fff at device 2.0 on pci0 vgapci0: Boot video device virtio_pci0: port 0xc0a0-0xc0bf irq 11 at device 3.0 on pci0 vtballoon0: on virtio_pci0 virtio_pci0: host features: 0x71000002 virtio_pci0: negotiated features: 0 em0: port 0xc000-0xc03f mem 0xfeba0000-0xfebbffff irq 11 at device 4.0 on pci0 em0: Ethernet address: de:db:ee:d9:35:32 virtio_pci1: port 0xc040-0xc07f mem 0xfebf1000-0xfebf1fff irq 10 at device 5.0 on pci0 vtblk0: on virtio_pci1 virtio_pci1: host features: 0x710006d4 virtio_pci1: negotiated features: 0x10000254 vtblk0: 80000MB (163840000 512 byte sectors) atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 attimer0: at port 0x40 on isa0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 wbwd0: HEFRAS and EFER do not align: EFER 0x2e DevID 0xff DevRev 0xff CR26 0xff Timecounters tick every 10.000 msec ipfw2 (+ipv6) initialized, divert enabled, nat enabled, default to accept, logging disabled usbus0: 12Mbps Full Speed USB v1.0 ugen0.1: at usbus0 uhub0: on usbus0 pass0 at ata1 bus 0 scbus1 target 0 lun 0 pass0: Removable CD-ROM SCSI-0 device pass0: Serial Number QM00003 pass0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes) Timecounter "TSC-low" frequency 1133381229 Hz quality 800 uhub0: 2 ports with 2 removable, self powered Root mount waiting for: usbus0 ugen0.2: at usbus0 Trying to mount root from ufs:/dev/vtbd0s1a [ro]... uhid0: on usbus0 tap0: Ethernet address: 00:bd:ae:4b:01:00 tap0: link state changed to UP tap0: link state changed to DOWN tap0: Ethernet address: 00:bd:43:20:06:00 tap0: link state changed to UP Unread portion of the kernel message buffer: interrupt total irq4: uart0 1176 irq14: ata0 2924496 irq15: ata1 1 irq20: hpet0 86588821 irq23: uhci0 ehci0 8813832 irq256: re0 6199647 Total 104527973 panic: watchdog timeout cpuid = 0 Uptime: 23h48m59s Dumping 933 out of 4085 MB: (CTRL-C to abort) ..2%..11%..21%..31%..42%..52%..62%..71%..81%..91% (kgdb) bt #0 doadump (textdump=) at pcpu.h:235 #1 0xffffffff803bfd6f in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:454 #2 0xffffffff803c01b7 in panic (fmt=) at /usr/src/sys/kern/kern_shutdown.c:642 #3 0xffffffff80373e84 in watchdog_fire () at /usr/src/sys/kern/kern_clock.c:898 #4 0xffffffff803741b5 in hardclock_cnt (cnt=, usermode=) at /usr/src/sys/kern/kern_clock.c:570 #5 0xffffffff805bb90c in handleevents (now=0xffffff8000245858, fake=0) at /usr/src/sys/kern/kern_clocksource.c:215 #6 0xffffffff805bc498 in timercb (et=, arg=) at /usr/src/sys/kern/kern_clocksource.c:390 #7 0xffffffff802bb667 in hpet_intr_single (arg=) at /usr/src/sys/dev/acpica/acpi_hpet.c:260 #8 0xffffffff802bb70e in hpet_intr (arg=0xffffff8000648000) at /usr/src/sys/dev/acpica/acpi_hpet.c:278 #9 0xffffffff8039414b in intr_event_handle (ie=0xfffffe0001944a00, frame=0xffffff80002459b0) at /usr/src/sys/kern/kern_intr.c:1446 #10 0xffffffff805eb3d8 in intr_execute_handlers (isrc=0xfffffe0001927ce8, frame=0xffffff80002459b0) at /usr/src/sys/x86/x86/intr_machdep.c:266 #11 0xffffffff805ee78f in lapic_handle_intr (vector=, frame=0x0) at /usr/src/sys/x86/x86/local_apic.c:780 #12 0xffffffff805905d4 in Xapic_isr1 () at apic_vector.S:86 #13 0xffffffff8058cde6 in acpi_cpu_c1 () at /usr/src/sys/amd64/acpica/acpi_machdep.c:95 #14 0xffffffff802b7fda in acpi_cpu_idle () at /usr/src/sys/dev/acpica/acpi_cpu.c:1023 #15 0xffffffff805943c8 in cpu_idle_acpi (busy=0) at /usr/src/sys/amd64/amd64/machdep.c:678 #16 0xffffffff80594482 in cpu_idle (busy=0) at /usr/src/sys/amd64/amd64/machdep.c:805 #17 0xffffffff803e8052 in sched_idletd (dummy=) at /usr/src/sys/kern/sched_ule.c:2609 #18 0xffffffff80391c25 in fork_exit (callout=0xffffffff803e7ea0 , arg=0x0, frame=0xffffff8000245c00) at /usr/src/sys/kern/kern_fork.c:996 #19 0xffffffff8059046e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #20 0x0000000000000000 in ?? () -- View this message in context: http://freebsd.1045724.n5.nabble.com/Every-day-my-FreeBSD-9-3-machines-reboot-by-watchdog-timeout-tp5960935.html Sent from the freebsd-questions mailing list archive at Nabble.com.