From owner-freebsd-alpha Sun Oct 17 19: 0:38 1999 Delivered-To: freebsd-alpha@freebsd.org Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by hub.freebsd.org (Postfix) with ESMTP id 9000F14DE9; Sun, 17 Oct 1999 19:00:25 -0700 (PDT) (envelope-from gallatin@cs.duke.edu) Received: from grits.cs.duke.edu (grits.cs.duke.edu [152.3.145.36]) by duke.cs.duke.edu (8.9.1/8.9.1) with ESMTP id WAA12740; Sun, 17 Oct 1999 22:00:23 -0400 (EDT) Received: (from gallatin@localhost) by grits.cs.duke.edu (8.9.3/8.9.1) id WAA35506; Sun, 17 Oct 1999 22:00:23 -0400 (EDT) (envelope-from gallatin@cs.duke.edu) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Date: Sun, 17 Oct 1999 22:00:23 -0400 (EDT) To: sos@freebsd.org Cc: alpha@freebsd.org, "Erik H. Bakke" Subject: workaround for ata driver woes on alpha X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14346.31193.248797.237477@grits.cs.duke.edu> Sender: owner-freebsd-alpha@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org S=F8ren, There's a problem with the ata-driver on alphas. Under heavy disk load, the machine will complain "ad_timeout: lost disk contact - resetting" and then promptly panic & leave something like the following stack trace: panic: trap #0 0xfffffc0000386c2c in boot (howto=3D260) at ../../kern/kern_shutdow= n.c:278 278 savectx(&dumppcb); (kgdb) bt #0 0xfffffc0000386c2c in boot (howto=3D260) at ../../kern/kern_shutdow= n.c:278 #1 0xfffffc0000344530 in db_fncall (dummy1=3D0, dummy2=3D0, dummy3=3D0= , dummy4=3D0x0) at ../../ddb/db_command.c:532 #2 0xfffffc00003441a4 in db_command (last_cmdp=3D0xfffffc00005b1a60,=20= cmd_table=3D0x0, aux_cmd_tablep=3D0xfffffc00005d6990) at ../../ddb/db_command.c:333 #3 0xfffffc0000344320 in db_command_loop () at ../../ddb/db_command.c:= 455 #4 0xfffffc0000347ff8 in db_trap (type=3D0, code=3D0) at ../../ddb/db_= trap.c:71 #5 0xfffffc00005051c8 in kdb_trap (a0=3D1, a1=3D1, a2=3D9600, entry=3D= 3,=20 regs=3D0xfffffe0011955500) at ../../alpha/alpha/db_interface.c:194 #6 0xfffffc0000512d58 in trap (a0=3D1, a1=3D15, a2=3D9600, entry=3D3,=20= framep=3D0xfffffe0011955500) at ../../alpha/alpha/trap.c:285 #7 0xfffffc0000505ad0 in XentIF () at ../../alpha/alpha/exception.s:63= #8 0xfffffc000050538c in Debugger (msg=3D0x0) at ../../alpha/alpha/db_interface.c:256 #9 0xfffffc0000387354 in panic (fmt=3D0xfffffc00005a76fc "trap") at ../../kern/kern_shutdown.c:528 #10 0xfffffc00005131ec in trap (a0=3D40, a1=3D1, a2=3D0, entry=3D2,=20 framep=3D0xfffffe0011955740) at ../../alpha/alpha/trap.c:530 #11 0xfffffc0000505b2c in XentMM () at ../../alpha/alpha/exception.s:94= #12 0xfffffc0000523b04 in ad_transfer (request=3D0xfffffe00087e3c00) at ../../dev/ata/ata-disk.c:431 #13 0xfffffc0000521d38 in ata_start (scp=3D0xfffffe0008713400) at ../../dev/ata/ata-all.c:583 #14 0xfffffc0000522338 in ata_reinit (scp=3D0xfffffe0008713400) at ../../dev/ata/ata-all.c:716 #15 0xfffffc000052448c in ad_timeout (request=3D0xfffffe00087e3c00) at ../../dev/ata/ata-disk.c:648 #16 0xfffffc000039025c in softclock () at ../../kern/kern_timeout.c:131= #17 0xfffffc0000376d70 in hardclock (frame=3D0xfffffe00119559e0) at ../../kern/kern_clock.c:253 #18 0xfffffc000051564c in handleclock (arg=3D0xfffffe00119559e0) at ../../alpha/alpha/clock.c:266 #19 0xfffffc0000513e34 in interrupt (a0=3D0, a1=3D1536, a2=3D1844673967= 5668704635,=20 framep=3D0xfffffe00119559e0) at ../../alpha/alpha/interrupt.c:101 #20 0xfffffc0000505afc in XentInt () at ../../alpha/alpha/exception.s:7= 8 I admit to not understanding callouts, so you might want to take this theory with a grain of salt: I believe what is happening is that ad_timeout() gets called (quite prematurely) at spl0. While ad_timout() is executing, the interrupt comes in for the request in question. The interrupt handler frees the request that the ad_timeout() call chain is currently operating on (or otherwise messes with it). The request is then corrupted, and chaos (machine check, or a trap for an invalid access) ensues. I'm tempted to wrap ad_timeout() in splbio() but there is still a window when ad_callout() is being called that we'll be at spl0 (is this right, is it called at spl0? this is what I don't know..) Anyway, we see this on the alpha because the timeout is hardcoded to fire after 300 ticks. This is a little under 3 seconds on an x86 (typically hz<=3D128) but it is less than 1/3 of a second on an alpha (typically hz>=3D1024). The following patch levels the playing field &= seems to "fixe" the problem on alpha. (at least I'm now able to untar ports & then rm -rf the tree). Index: sys/dev/ata/ata-disk.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /home/ncvs/src/sys/dev/ata/ata-disk.c,v retrieving revision 1.31 diff -u -r1.31 ata-disk.c --- ata-disk.c 1999/10/10 18:08:36 1.31 +++ ata-disk.c 1999/10/18 01:13:48 @@ -417,7 +417,7 @@ if (request->donecount =3D=3D 0) { =20 /* start timeout for this transfer */ - request->timeout_handle =3D timeout((timeout_t*)ad_timeout, req= uest, 300); + request->timeout_handle =3D timeout((timeout_t*)ad_timeout, req= uest, 3*hz); =20 /* setup transfer parameters */ count =3D howmany(request->bytecount, DEV_BSIZE); Drew -----------------------------------------------------------------------= ------- Andrew Gallatin, Sr Systems Programmer=09http://www.cs.duke.edu/~gallat= in Duke University=09=09=09=09Email: gallatin@cs.duke.edu Department of Computer Science=09=09Phone: (919) 660-6590 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-alpha" in the body of the message