From owner-freebsd-stable@freebsd.org Sun Jul 23 13:02:37 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CE466D7DDC2 for ; Sun, 23 Jul 2017 13:02:37 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from hz.grosbein.net (hz.grosbein.net [78.47.246.247]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 6451AAFA for ; Sun, 23 Jul 2017 13:02:36 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from eg.sd.rdtc.ru (root@eg.sd.rdtc.ru [62.231.161.221]) by hz.grosbein.net (8.15.2/8.15.2) with ESMTPS id v6ND2LF0038882 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Sun, 23 Jul 2017 15:02:21 +0200 (CEST) (envelope-from eugen@grosbein.net) X-Envelope-From: eugen@grosbein.net X-Envelope-To: Received: from eg.sd.rdtc.ru (eugen@localhost [127.0.0.1]) by eg.sd.rdtc.ru (8.15.2/8.15.2) with ESMTP id v6ND2GAw007794 for ; Sun, 23 Jul 2017 20:02:16 +0700 (+07) (envelope-from eugen@grosbein.net) To: FreeBSD Stable From: Eugene Grosbein Subject: stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock X-Enigmail-Draft-Status: N1110 Message-ID: <59749E58.2060408@grosbein.net> Date: Sun, 23 Jul 2017 20:02:16 +0700 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=3.6 required=5.0 tests=BAYES_00, DATE_IN_FUTURE_96_Q, LOCAL_FROM autolearn=no autolearn_force=no version=3.4.1 X-Spam-Report: * 3.3 DATE_IN_FUTURE_96_Q Date: is 4 days to 4 months after Received: date * -2.3 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * 2.6 LOCAL_FROM From my domains X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on hz.grosbein.net X-Spam-Level: *** X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Jul 2017 13:02:37 -0000 Hi! Long story short: stable/11 r321371 started to panic at the moment of smartd invocation after my SSD died. I have Intel motherboard with graid-supported pseudo-raid. I use it in RAID1 mode with one HDD and one SSD. Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel (timeouts). This went unnoticed by me as graid just disconnected it on-the-fly: kernel: ahcich5: Timeout on slot 24 port 0 kernel: ahcich5: is 00000000 cs 00000000 ss 01000000 rs 01000000 tfd 40 serr 00000000 cmd 0000d817 kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout kernel: (ada1:ahcich5:0:0:0): Retrying command kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) [skip] kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0 kernel: ada1: s/n JYKJ550855860139 detached [skip] kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=269389066240, length=32768)] kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to FAILED. kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from ACTIVE to FAILED. kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to DEGRADED. kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to OFFLINE. kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed from FAILED to NONE. kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=270699851776, length=32768)] kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a wrong state (OFFLINE)! Unaware of that, I've performed standard source upgrade from 11.1-PRERELEASE r318692 to stable/11 r321371 that went smooth. After reboot, BIOS was unable to detect SSD, reported degraded state of the mirror and booted FreeBSD using second mirror component (HDD). After long timeout, the kernel could not detect dead SSD too and continued to run with degraded mirror just fine: the system went multiuser mode and almost finished loading when rcNG started smartd. The kernel panices that moment. This is repeatable: I can cold-boot to single user mode, start smartd and get same panic. This is debugging kernel and I managed to obtain crashdump. kgdb session follows: <118>Starting smartd. Kernel page fault with the following non-sleepable locks held: exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766 stack backtrace: #0 0xffffffff80a12620 at witness_debugger+0x70 #1 0xffffffff80a13a4e at witness_warn+0x45e #2 0xffffffff80e4b693 at trap_pfault+0x53 #3 0xffffffff80e4ae3e at trap+0x29e #4 0xffffffff80e2ed91 at calltrap+0x8 #5 0xffffffff8033873a at passsendccb+0x6a #6 0xffffffff80337836 at passdoioctl+0x3c6 #7 0xffffffff80337052 at passioctl+0x22 #8 0xffffffff80878c78 at devfs_ioctl_f+0x138 #9 0xffffffff80a18184 at kern_ioctl+0x2c4 #10 0xffffffff80a17e4f at sys_ioctl+0x16f #11 0xffffffff80e4c05a at amd64_syscall+0x53a #12 0xffffffff80e2f07b at Xfast_syscall+0xfb Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xa fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80e494e1 stack pointer = 0x28:0xfffffe04675ff670 frame pointer = 0x28:0xfffffe04675ff670 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1387 (smartd) trap number = 12 panic: page fault cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04675ff250 vpanic() at vpanic+0x186/frame 0xfffffe04675ff2d0 panic() at panic+0x43/frame 0xfffffe04675ff330 trap_fatal() at trap_fatal+0x322/frame 0xfffffe04675ff380 trap_pfault() at trap_pfault+0x62/frame 0xfffffe04675ff3e0 trap() at trap+0x29e/frame 0xfffffe04675ff5a0 calltrap() at calltrap+0x8/frame 0xfffffe04675ff5a0 --- trap 0xc, rip = 0xffffffff80e494e1, rsp = 0xfffffe04675ff670, rbp = 0xfffffe04675ff670 --- copyin() at copyin+0x41/frame 0xfffffe04675ff670 passsendccb() at passsendccb+0x6a/frame 0xfffffe04675ff6f0 passdoioctl() at passdoioctl+0x3c6/frame 0xfffffe04675ff7a0 passioctl() at passioctl+0x22/frame 0xfffffe04675ff7e0 devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfffffe04675ff840 kern_ioctl() at kern_ioctl+0x2c4/frame 0xfffffe04675ff8a0 sys_ioctl() at sys_ioctl+0x16f/frame 0xfffffe04675ff980 amd64_syscall() at amd64_syscall+0x53a/frame 0xfffffe04675ffab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04675ffab0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80174c18a, rsp = 0x7fffffff0308, rbp = 0x7fffffff0900 --- Uptime: 2m46s Dumping 723 out of 16285 MB:..3%..12%..23%..31%..43%..51%..62%..71%..82%..91% (kgdb) bt #0 doadump (textdump=1) at /home/src/sys/kern/kern_shutdown.c:299 #1 0xffffffff809b1ab1 in kern_reboot (howto=260) at /home/src/sys/kern/kern_shutdown.c:376 #2 0xffffffff809b2100 in vpanic (fmt=, ap=) at /home/src/sys/kern/kern_shutdown.c:770 #3 0xffffffff809b2143 in panic (fmt=) at /home/src/sys/kern/kern_shutdown.c:701 #4 0xffffffff80e4b632 in trap_fatal (frame=0xfffffe04675ff5b0, eva=10) at /home/src/sys/amd64/amd64/trap.c:801 #5 0xffffffff80e4b6a2 in trap_pfault (frame=0xfffffe04675ff5b0, usermode=0) at pcpu.h:222 #6 0xffffffff80e4ae3e in trap (frame=0xfffffe04675ff5b0) at /home/src/sys/amd64/amd64/trap.c:421 #7 0xffffffff80e2ed91 in calltrap () at /home/src/sys/amd64/amd64/exception.S:236 #8 0xffffffff80e494e1 in copyin () at /home/src/sys/amd64/amd64/support.S:304 #9 0xffffffff8033873a in passsendccb (periph=0xfffff8000cfe5e00, ccb=0xfffff800662f9000, inccb=0xfffff80066d91000) at /home/src/sys/cam/scsi/scsi_pass.c:2172 #10 0xffffffff80337836 in passdoioctl (dev=, cmd=, addr=0xfffff80066d91000 "", flag=, td=) at /home/src/sys/cam/scsi/scsi_pass.c:1823 #11 0xffffffff80337052 in passioctl (dev=0xfffff8000cfc4c00, cmd=3303020802, addr=0xfffff80066d91000 "", flag=3, td=0xfffff8000e1a3000) at /home/src/sys/cam/scsi/scsi_pass.c:1751 #12 0xffffffff80878c78 in devfs_ioctl_f (fp=0xfffff8000e50d3c0, com=3303020802, data=0xfffff80066d91000, cred=0xfffff800066c7200, td=0xfffff8000e1a3000) at /home/src/sys/fs/devfs/devfs_vnops.c:791 #13 0xffffffff80a18184 in kern_ioctl (td=, fd=, com=, data=) at file.h:323 #14 0xffffffff80a17e4f in sys_ioctl (td=, uap=0xfffff8000e1a3538) at /home/src/sys/kern/sys_generic.c:745 #15 0xffffffff80e4c05a in amd64_syscall (td=0xfffff8000e1a3000, traced=0) at subr_syscall.c:131 #16 0xffffffff80e2f07b in Xfast_syscall () at /home/src/sys/amd64/amd64/exception.S:396 #17 0x000000080174c18a in ?? () Previous frame inner to this frame (corrupt stack?) I can keep dead SSD in place to debug this. Looking for help.