From owner-freebsd-bugs@FreeBSD.ORG Fri Jan 7 18:50:06 2011 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 153341065674 for ; Fri, 7 Jan 2011 18:50:06 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id E69778FC19 for ; Fri, 7 Jan 2011 18:50:05 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p07Io5Kk050773 for ; Fri, 7 Jan 2011 18:50:05 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p07Io5mF050772; Fri, 7 Jan 2011 18:50:05 GMT (envelope-from gnats) Resent-Date: Fri, 7 Jan 2011 18:50:05 GMT Resent-Message-Id: <201101071850.p07Io5mF050772@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Martin Simmons Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 46C151065672 for ; Fri, 7 Jan 2011 18:40:49 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from red.freebsd.org (unknown [IPv6:2001:4f8:fff6::22]) by mx1.freebsd.org (Postfix) with ESMTP id 0EA898FC17 for ; Fri, 7 Jan 2011 18:40:49 +0000 (UTC) Received: from red.freebsd.org (localhost [127.0.0.1]) by red.freebsd.org (8.14.4/8.14.4) with ESMTP id p07Iembf020756 for ; Fri, 7 Jan 2011 18:40:48 GMT (envelope-from nobody@red.freebsd.org) Received: (from nobody@localhost) by red.freebsd.org (8.14.4/8.14.4/Submit) id p07Iem20020755; Fri, 7 Jan 2011 18:40:48 GMT (envelope-from nobody) Message-Id: <201101071840.p07Iem20020755@red.freebsd.org> Date: Fri, 7 Jan 2011 18:40:48 GMT From: Martin Simmons To: freebsd-gnats-submit@FreeBSD.org X-Send-Pr-Version: www-3.1 Cc: Subject: kern/153771: Unkillable process after a SCSI tape write error X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Jan 2011 18:50:06 -0000 >Number: 153771 >Category: kern >Synopsis: Unkillable process after a SCSI tape write error >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Jan 07 18:50:05 UTC 2011 >Closed-Date: >Last-Modified: >Originator: Martin Simmons >Release: 8.0-RELEASE-p6 >Organization: >Environment: FreeBSD lwfs1-cam.cam.lispworks.com 8.0-RELEASE-p6 FreeBSD 8.0-RELEASE-p6 #0: Wed Dec 22 16:21:50 GMT 2010 root@lwfs1-cam.cam.lispworks.com:/usr/obj/usr/src/sys/GENERICLOCAL amd64 >Description: I'm not 100% sure of the sequence of system calls, because they are generated by Bacula, but here is what I think happened. The kernel is built from the 8.0-RELEASE-p6 GENERIC with a small patch for nfs. The hardware consists of an Intel chassis with an HP LTO-1 tape drive connected via an LSI U160 SCSI controller: Jan 3 23:06:24 lwfs1-cam kernel: sym0: <1010-33> port 0x1100-0x11ff mem 0xb1904000-0xb19043ff,0xb1900000-0xb1901fff irq 16 at device 0.0 on pci7 Jan 3 23:06:24 lwfs1-cam kernel: sym0: Symbios NVRAM, ID 7, Fast-80, LVD, parity checking Jan 3 23:06:24 lwfs1-cam kernel: sym0: open drain IRQ line driver, using on-chip SRAM Jan 3 23:06:24 lwfs1-cam kernel: sym0: using LOAD/STORE-based firmware. Jan 3 23:06:24 lwfs1-cam kernel: sym0: handling phase mismatch from SCRIPTS. Jan 3 23:06:24 lwfs1-cam kernel: sym0: [ITHREAD] Jan 3 23:06:24 lwfs1-cam kernel: sym1: <1010-33> port 0x1000-0x10ff mem 0xb1905000-0xb19053ff,0xb1902000-0xb1903fff irq 17 at device 0.1 on pci7 Jan 3 23:06:24 lwfs1-cam kernel: sym1: Symbios NVRAM, ID 7, Fast-80, SE, parity checking Jan 3 23:06:24 lwfs1-cam kernel: sym1: open drain IRQ line driver, using on-chip SRAM Jan 3 23:06:24 lwfs1-cam kernel: sym1: using LOAD/STORE-based firmware. Jan 3 23:06:24 lwfs1-cam kernel: sym1: handling phase mismatch from SCRIPTS. Jan 3 23:06:24 lwfs1-cam kernel: sym1: [ITHREAD] Jan 3 23:06:24 lwfs1-cam kernel: sa0 at sym0 bus 0 target 3 lun 0 Jan 3 23:06:24 lwfs1-cam kernel: sa0: Removable Sequential Access SCSI-3 device Jan 3 23:06:24 lwfs1-cam kernel: sa0: 160.000MB/s transfers (80.000MHz DT, offset 62, 16bit) There was a MEDIUM ERROR during a backup to the LTO-1 drive, but the write system call returned as if the error hadn't happened (AFAICS). The next write was reported as failing with EPERM and a subsequent ioctl to rewind the tape hung in an unkillable state. These syslog messages were generated Jan 3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): WRITE FILEMARKS(6). CDB: 10 0 0 0 1 0 Jan 3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): CAM Status: SCSI Status Error Jan 3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): SCSI Status: Check Condition Jan 3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): MEDIUM ERROR info:1 asc:c,0 Jan 3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): Write error Jan 3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): Retries Exhausted Jan 3 00:27:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 00:31:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 00:35:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 01:23:43 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 02:12:08 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 03:00:33 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 03:49:02 lwfs1-cam kernel: sym0:3:control msgout: 80 6. Jan 3 04:37:27 lwfs1-cam kernel: sym0:3:control msgout: 80 6. The message at 00:31:18 corresponds to when it tried to rewind the tape and the last message then repeats roughly every 50 minutes. The output of procstat -k -k for the unkillable backup process is: PID TID COMM TDNAME KSTACK 1166 100179 bacula-sd - mi_switch+0x16f sleepq_catch_signals+0x31f sleepq_wait_sig+0xc _cv_wait_sig+0x120 seltdwait+0xf6 kern_select+0x62f select+0x5d syscall+0x246 Xfast_syscall+0xe1 1166 100259 bacula-sd - mi_switch+0x16f sleepq_catch_signals+0x31f sleepq_timedwait_sig+0x19 _sleep+0x1a3 do_cv_wait+0x640 __umtx_op_cv_wait+0x5c syscall+0x246 Xfast_syscall+0xe1 1166 100316 bacula-sd - mi_switch+0x16f sleepq_wait+0x42 _sleep+0x31c cam_periph_runccb+0x4e sarewind+0x6c saioctl+0x977 giant_ioctl+0x7d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x246 Xfast_syscall+0xe1 Attempts to attach gdb to the process caused gdb to hang (but it was killable). Attempts to use "camcontrol periphlist sa0" and "camcontrol reset 0:3:0" also hung in an unkillable way. "camcontrol reset 0" gave "Reset of bus 0 was successful" but didn't fix the hanging processes. The machine has been rebooted since then so I can't test anything else directly. >How-To-Repeat: I don't know if it is repeatable. >Fix: >Release-Note: >Audit-Trail: >Unformatted: