From owner-freebsd-current@FreeBSD.ORG Thu Apr 9 07:23:36 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0A23C1065680 for ; Thu, 9 Apr 2009 07:23:36 +0000 (UTC) (envelope-from rmtodd@ichotolot.servalan.com) Received: from mx1.synetsystems.com (mx1.synetsystems.com [76.10.206.14]) by mx1.freebsd.org (Postfix) with ESMTP id CAF1E8FC1B for ; Thu, 9 Apr 2009 07:23:35 +0000 (UTC) (envelope-from rmtodd@ichotolot.servalan.com) Received: by mx1.synetsystems.com (Postfix, from userid 66) id 590DDCCA; Thu, 9 Apr 2009 03:03:07 -0400 (EDT) Received: from rmtodd by servalan.servalan.com with local (Exim 4.69 (FreeBSD)) (envelope-from ) id 1Lrney-000EB8-Ta; Thu, 09 Apr 2009 01:22:48 -0500 To: freebsd-current@freebsd.org References: <20090408170231.K38445@rust.salford.ac.uk> From: Richard Todd Date: Thu, 09 Apr 2009 01:22:48 -0500 In-Reply-To: (Mark Powell's message of "Wed, 8 Apr 2009 17:36:17 +0100 (BST)") Message-ID: User-Agent: Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: ATA related panic during ZFS scrub X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Apr 2009 07:23:36 -0000 "Mark Powell" writes: > Hi, > Got a panic I'd not seen before, yesterday, whilst scrubbing one of > two pools, to fix the apparently spurious CRC errors highlighted here: > > http://kerneltrap.org/mailarchive/freebsd-current/2009/4/7/5428764 [...] > Sorry, but I don't have a serial console or dump device valid in a > panic. Here are screenshots: > > http://www.rootshell.be/~msp/IMG_4393.JPG > > Here's the edited gocr of the above: > > ----- > Fatal trap 9 general protection fault while in kernel mode > cpuid -- 1; apic id = 01 > instruction pointer = Ox8:Oxffffffff807db306 > stack pointer = Ox10:OxfffffffeeS79faaO > frame pointer = 0x10:Oxfffffffee579faeO > Code Segment = base Ox0, limit Oxfffff, type Ox1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL O > current process = 12 (irq19: atacpi1++) > [thread pid 12 tid 100032 ] > Stopped at bcopy+Ox16 repe movsq (%rsi),%es:(%rdi) Hmm, interesting, I got a panic, also apparently somewhere in the bounce buffer code: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x26bffcea9f0 fault code = supervisor read data, page not present instruction pointer = 0x8:0xffffffff807f4b29 stack pointer = 0x10:0xfffffffe40049980 frame pointer = 0x10:0xfffffffe400499c0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 4 (g_down) [thread pid 4 tid 100011 ] Stopped at add_bounce_page+0x99: movq 0x20(%rbx),%rax db> lock order reversal: (Giant after non-sleepable) 1st 0xffffffff80da4ac0 bounce pages lock (bounce pages lock) @ /usr/src/sys/amd64/amd64/busdma_machdep.c:1126 2nd 0xffffffff80bcf540 Giant (Giant) @ /usr/src/sys/dev/kbdmux/kbdmux.c:1044 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e _mtx_lock_flags() at _mtx_lock_flags+0x78 kbdmux_ioctl() at kbdmux_ioctl+0x116 sc_cngetc() at sc_cngetc+0xcb cncheckc() at cncheckc+0x65 cngetc() at cngetc+0x1c db_readline() at db_readline+0x77 db_read_line() at db_read_line+0x15 db_command_loop() at db_command_loop+0x38 db_trap() at db_trap+0x89 kdb_trap() at kdb_trap+0x95 trap_fatal() at trap_fatal+0x2bd trap() at trap+0x2a5 calltrap() at calltrap+0x8 --- trap 0xc, rip = 0xffffffff807f4b29, rsp = 0xfffffffe40049980, rbp = 0xfffffffe400499c0 --- add_bounce_page() at add_bounce_page+0x99 bus_dmamap_load() at bus_dmamap_load+0x256 ata_dmaload() at ata_dmaload+0x12c ata_ahci_begin_transaction() at ata_ahci_begin_transaction+0x1fe ata_start() at ata_start+0x1c5 ata_queue_request() at ata_queue_request+0x12c g_disk_start() at g_disk_start+0xf4 g_io_schedule_down() at g_io_schedule_down+0x1e4 g_down_procbody() at g_down_procbody+0x6f fork_exit() at fork_exit+0x12a fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xfffffffe40049d30, rbp = 0 --- Note that, as others have noted in this thread, the "lock order reversal" business is all stuff that happens after the system dropped into the debugger; the actual panic is one that happens in the bounce buffer code, in add_bounce_page. I tried to get a core dump, but alas, no go: db> call doadump Physical memory: 4012 MB Dumping 1991 MB:lock order reversal: 1st 0xfffffffe405e89a8 ATA queue lock (ATA queue lock) @ /usr/src/sys/dev/ata/ata-queue.c:184 2nd 0xffffffff80da4ac0 bounce pages lock (bounce pages lock) @ /usr/src/sys/amd64/amd64/busdma_machdep.c:1126 3rd 0xfffffffe406109a8 ATA queue lock (ATA queue lock) @ /usr/src/sys/dev/ata/ata-queue.c:86 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e _mtx_lock_flags() at _mtx_lock_flags+0x78 ata_queue_request() at ata_queue_request+0xae ad_dump() at ad_dump+0xaf minidumpsys() at minidumpsys+0x3e3 dumpsys() at dumpsys+0x2e doadump() at doadump+0x49 db_fncall() at db_fncall+0x8c db_command() at db_command+0x201 db_command_loop() at db_command_loop+0x50 db_trap() at db_trap+0x89 kdb_trap() at kdb_trap+0x95 trap_fatal() at trap_fatal+0x2bd trap() at trap+0x2a5 calltrap() at calltrap+0x8 --- trap 0xc, rip = 0xffffffff807f4b29, rsp = 0xfffffffe40049980, rbp = 0xfffffffe400499c0 --- add_bounce_page() at add_bounce_page+0x99 bus_dmamap_load() at bus_dmamap_load+0x256 ata_dmaload() at ata_dmaload+0x12c ata_ahci_begin_transaction() at ata_ahci_begin_transaction+0x1fe ata_start() at ata_start+0x1c5 ata_queue_request() at ata_queue_request+0x12c g_disk_start() at g_disk_start+0xf4 g_io_schedule_down() at g_io_schedule_down+0x1e4 g_down_procbody() at g_down_procbody+0x6f fork_exit() at fork_exit+0x12a fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xfffffffe40049d30, rbp = 0 --- 1976 1960 1944 1928 1912 1896 1880 1864 1848 1832 1816 1800 1784 1768 1752 1736 1720 1704 1688 1672 1656 1640 1624 1608 1592 1576 1560 1544 1528 1512 1496 1480 1464 1448 1432 1416 1400 1384 1368 1352 1336 1320 1304 1288 1272 1256 1240 1224 1208 1192 1176 1160 1144 1128 1112 1096 1080 1064 1048 1032 1016 1000 984 968 952 936 920 904 888 872 856 840 824 808 792 776 760 744 728 712 696 680 664 648 632 616 600 584 568 552 536 520 504 488 472 456 440 424 408 392 376 360 344 328 312 296 280 264 248 232 216 200 184 168 152 136 120 104 88 72 56 40 24 8Attempt to write outside dump device boundaries. ** DUMP FAILED (ERROR 6) ** = 0 db> reset (And yes, the dump device *does* have adequate room for 4G or more of dump, and dumps have worked on other occasions.) I guess it's not too surprising that if something in the bounce buffers and/or disk IO system is confused that you might not be able to get a good dump, but it's still fairly annoying. > Someone suggested that this and/or the problems, in the other thread > above, could be related to a bug in bounce buffers which occurs quite > rarely, but is causing writing the wrong blocks or data? (Just for the record, "someone" is me, in an offlist exchange I had with Mark Powell.)