Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 May 1999 22:06:41 +0200 (CEST)
From:      Tor Egge <tegge@not.fast.no>
To:        FreeBSD-gnats-submit@freebsd.org
Subject:   kern/11697: Disk failure hangs system
Message-ID:  <199905132006.WAA59935@not.fast.no>

next in thread | raw e-mail | index | archive | help

>Number:         11697
>Category:       kern
>Synopsis:       Disk failure hangs system
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu May 13 13:10:02 PDT 1999
>Closed-Date:
>Last-Modified:
>Originator:     Tor Egge
>Release:        FreeBSD 3.1-STABLE i386
>Organization:
Fast Search & Transfer ASA
>Environment:

FreeBSD 3.1-STABLE #0: Sat May  1 19:00:19 CEST 1999     root@response.fast.no:/usr/src/sys/compile/INDEX_SMP_SERIAL_DDB  i386

ahc1: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 17 on pci0.14.0
ahc1: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs

da13 at ahc1 bus 0 target 9 lun 0
da13: <QUANTUM QM318000TD-SCA N1K0> Fixed Direct Access SCSI-2 device 
da13: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da13: 17366MB (35566499 512 byte sectors: 255H 63S/T 2213C)

>Description:

----------------------
Unexpected busfree.  LASTPHASE == 0x80
SEQADDR == 0x15b
(da13:ahc1:0:9:0): Invalidating pack
(da13:ahc1:0:9:0): Invalidating pack
(da13:ahc1:0:9:0): Invalidating pack
vm_fault: pager read error, pid 63486 (mkserv)
(da13:ahc1:0:9:0): Invalidating pack
Stopped at      siointr1+0x6d:  jmp     siointr1+0x159
db> trace
siointr1(e3c8d800,e02890b0,0,f2e0da2c,e0206144) at siointr1+0x6d
siointr(0,f2e00010,0,1,e0289014) at siointr+0x1d
Xfastintr4(ebd13528,e3e12800,ebd13528,c8000040,e0e7e8c8) at Xfastintr4+0x24
biodone(ebd13528,ebd13528,ebd13528,c8000040,e3e08000) at biodone+0x2d0
dastrategy(ebd13528,200202b4,f2e0daa8,e018167d,f2e0dacc) at dastrategy+0xab
spec_strategy(f2e0dacc,f2e0dab4,e01e73a9,f2e0dacc,f2e0dad8) at spec_strategy+0x3e
spec_vnoperate(f2e0dacc,f2e0dad8,e016d46f,f2e0dacc,2000) at spec_vnoperate+0x15
ufs_vnoperatespec(f2e0dacc) at ufs_vnoperatespec+0x15
bwrite(ebd13528,f2e0daf0,e0171879,f2e0db34,f2e0dafc) at bwrite+0xaf
vop_stdbwrite(f2e0db34,f2e0dafc,e018167d,f2e0db34,f2e0db08) at vop_stdbwrite+0xe
vop_defaultop(f2e0db34,f2e0db08,e01e73a9,f2e0db34,f2e0db3c) at vop_defaultop+0x15
spec_vnoperate(f2e0db34,f2e0db3c,e016de03,f2e0db34,200) at spec_vnoperate+0x15
ufs_vnoperatespec(f2e0db34,200,ebd13528,1,0) at ufs_vnoperatespec+0x15
vfs_bio_awrite(ebd13528,200,a200a000,1,f2e00010) at vfs_bio_awrite+0x103
getnewbuf(f1cea900,d10050,0,0,2000) at getnewbuf+0x2ec
getblk(f1cea900,d10050,2000,0,0) at getblk+0x244
bread(f1cea900,d10050,2000,0,f2e0dc48) at bread+0x21
ffs_vget(e3e8c200,54ee7,f2e0dccc,f283ee40,f2e0df14) at ffs_vget+0x1bc
ufs_lookup(f2e0dd24,f2e0dd38,e017055c,f2e0dd24,f3009c47) at ufs_lookup+0x936
ufs_vnoperate(f2e0dd24,f3009c47,f283ee40,f2e0df14,0) at ufs_vnoperate+0x15
vfs_cache_lookup(f2e0dd80,f2e0dd90,e01729fd,f2e0dd80,f1c6ce00) at vfs_cache_lookup+0x248
ufs_vnoperate(f2e0dd80,f1c6ce00,f2e0df14,f2e0def0,0) at ufs_vnoperate+0x15
lookup(f2e0def0,0,f2e0df84,f2e0def0,7273752f) at lookup+0x2c1
namei(f2e0def0,0,f2e0df84,f2d5c840,286) at namei+0x133
vn_open(f2e0def0,3,584,f2d5c840,e0254064) at vn_open+0x1f6
open(f2d5c840,f2e0df84,dfbfd594,dfbfc7e0,dfbfbfe4) at open+0xad
syscall(27,27,dfbfbfe4,dfbfc7e0,dfbfc7b4) at syscall+0x187
Xint0x80_syscall() at Xint0x80_syscall+0x4c
db> panic
panic: from debugger
mp_lock = 01000002; cpuid = 1; lapic.id = 00000000
boot() called on cpu#1

syncing disks... 
-------------

The SCSI bus is freed at the wrong moment, probably due to the device
resetting.  Then the command is retried, but is aborted AGAIN due to
a selection timeout (indicating that the device had not completed 
resetting).  This might be caused by bad firmware on the disk or
a too weak power supply.  I assume this is bad firmware.

Combined with the VFS code being conservative (not wanting to throw
away buffer contents on fatal write errors (which might lead to file
system corruption if this is a transient error)), this sometimes lead to
the buffer queues being filled with dirty buffers associated with
the invalidated disk pack.

Combined with what appears to be a bug in the routine waitfreebuffers,
this could lead to an infinite busy loop in the kernel inside a
splbio() protect region of code.

>How-To-Repeat:

Use Quantum disks.


>Fix:
	
Index: vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.193.2.5
diff -u -r1.193.2.5 vfs_bio.c
--- vfs_bio.c	1999/04/20 19:54:20	1.193.2.5
+++ vfs_bio.c	1999/05/12 19:57:13
@@ -577,7 +577,8 @@
 	if (bp->b_flags & B_LOCKED)
 		bp->b_flags &= ~B_ERROR;
 
-	if ((bp->b_flags & (B_READ | B_ERROR)) == B_ERROR) {
+	if ((bp->b_flags & (B_READ | B_ERROR)) == B_ERROR &&
+		bp->b_error != ENXIO) {
 		bp->b_flags &= ~B_ERROR;
 		bdirty(bp);
 	} else if ((bp->b_flags & (B_NOCACHE | B_INVAL | B_ERROR | B_FREEBUF)) ||
@@ -1219,7 +1220,7 @@
 waitfreebuffers(int slpflag, int slptimeo) {
 	while (numfreebuffers < hifreebuffers) {
 		flushdirtybuffers(slpflag, slptimeo);
-		if (numfreebuffers < hifreebuffers)
+		if (numfreebuffers >= hifreebuffers)
 			break;
 		needsbuffer |= VFS_BIO_NEED_FREE;
 		if (tsleep(&needsbuffer, (PRIBIO + 4)|slpflag, "biofre", slptimeo))



>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199905132006.WAA59935>