Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Feb 2003 23:16:49 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Lars Eggert <larse@ISI.EDU>
Cc:        current@freebsd.org
Subject:   Re: panic starting gnome
Message-ID:  <3E532F61.653A09B0@mindspring.com>
References:  <3E52BB14.2040309@isi.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
Lars Eggert wrote:
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; lapic.id = 00000000
> fault virtual address   = 0x34
                            ****
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xc01b28a6

[ ... ]

> kernel: type 12 trap, code=0
> Stopped at      _mtx_lock_flags+0x26:   cmpl    $0xc03884a0,0(%esi)

[ ... ]

> trap_fatal(e91a5780,34,c0372ee0,2e4,c658e780) at trap_fatal+0x250
> trap_pfault(e91a5780,0,34,c03e0758,34) at trap_pfault+0x17a
> trap(c21a0018,10,c0360010,9e,34) at trap+0x3e5
> calltrap() at calltrap+0x5
> --- trap 0xc, eip = 0xc01b28a6, esp = 0xe91a57c0, ebp = 0xe91a57e0 ---
> _mtx_lock_flags(34,0,c035cf5f,9e,c658e780) at _mtx_lock_flags+0x26
                  **

Attempt to dereference the value "0x34" as if it were a pointer.

> namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134

Called from here.

Debug:

	1)	Make sure that the kernel that has the fault was
		created with "config -g", so that there is a debug
		version of it lying around in the build directory.

	2)	Make sure that the kernel you installed is the
		stripped version of the debug kernel (there are two
		kernels created as a result of "config -g"; one is
		"kernel.debug" (the debug version) and the other is
		"kernel" (the stripped version).

	3)	If #1 and #2 are not true, then make them true, and
		repeat the problem.

	4)	Boot a kernel that doesn't crash instead, so that you
		can run the debugger.

	5)	Go to the build directory, and look at the faulting
		code to see where it gets the value "0x34" to pass in
		to the _mtx_lock_flags(); this is the bogus value.  For
		example, if you had a debug kernel for the kernel that
		has the problem, and it was config'ed from i386 GENERIC,
		you would use the following sequence of commands:

			cd /sys/i386/compile/GENERIC
			gdb -k kernel.debug
			list namei+0x134

	6)	Change the code so the bogus value is no longer being
		passed.

	7)	Live happily ever after.


Note that, to me, this looks like a problem with a dereference of a
"current" process which is not really current, as a result of a
wakeup occurring in an interrupt handler for an outstanding request
which was satisfied by the interrupt handler.

Note:	Under no circumstances should a page 0 address be passed
	around to anyone, since page zero is typically unmapped in
	order to trigger NULL pointer dereference faults and/or
	structure member reference faults for structure elements
	(at least in the the initial 4K: range 0x00000000-0x00001000)
	when a structure pointer itself is NULL.

	IMO, the most likely cause is that you have a null structure
	pointer, and the element at offset 0x34 into the structure is
	being referenced out of it, without checking that the pointer
	is not NULL, and the most likely culprit is a proc/kse/thread
	type structure that's not guaranteed to be valid in interrupt
	context.

	Probably, the scheduler is switching directly from interrupt
	of a process context "Q" to a wakeup of the same process "Q",
	without restoring a register value that should normally be
	restored following an interrupt.  I have no idea which of the
	schedulers you are using, so I have no idea if this should be
	an expected omission; my best guess is you are using the new
	one, though, because this is an unlikely problem with the old
	one, if it's really a scheduler wakeup problem.

> namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134
                                   ^
                                   |
> vn_open_cred(e91a5a44,e91a5a0c,0,c2195e80,0) at vn_open_cred+0x53c
                                 ^          ^
                                 |          |
  ...all three of these are also incredibly suspicious, at first sight...


Until you are willing to list out the code where the bogus value is
being passed to the function call, there's no way any of us are
going to be able to correlate your stack traceback to our own source
trees, in order to be able to help you, unless you are running a
tagged veraion (e.g. 5.0-RELEASE) with no modifications.

Just saying "the most recent current" or "I CVS'up'ed on xxx date" is
really useless to us, because CVS mirrors don't contain well known
information relative to a CVS'up date.  In many cases, we will need
you to check out (at least!) a fresh /sys source tree from the CVS
repository, using a date tage, if you are not running a -RELEASE
version.  Yes, this is a long-standing problem with the FreeBSD
project itself.

If you can do this, and repeat the problem, then we can check out with
the same date tag, and determine what the code is supposed to be doing,
and what code you actually have, so we can narrow it down to setup, and
maybe fix it without having to rebuild an entire copy of the Internet,
from your machine's point of view.  8-).

Also, if your kernel configuration is different than the default, you
need to provide *DIFFS* -- DO NOT SEND THE WHOLE CONFIG FILE TO THE
LIST -- OR TO ME -- UNLESS YOU WANT TO BE IGNORED!

For a modified GENERIC config file from a checked out copy of the local
source tree, here is how you perform a context diff:

	cd /sys/i386/conf
	cvs diff -c GENERIC

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E532F61.653A09B0>