Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Sep 1998 15:22:50 -0400 (EDT)
From:      Bill Paul <wpaul@skynet.ctr.columbia.edu>
To:        ben@rosengart.com
Cc:        current@FreeBSD.ORG
Subject:   Re: the fs fun never stops
Message-ID:  <199809201922.PAA21926@skynet.ctr.columbia.edu>
In-Reply-To: <Pine.GSO.4.02.9809201209530.3220-100000@echonyc.com> from "Snob Art Genre" at Sep 20, 98 12:11:56 pm

next in thread | previous in thread | raw e-mail | index | archive | help
Of all the gin joints in all the towns in all the world, Snob Art Genre 
had to walk into mine and say:

> I went from yesterday's kernel to today's, and immediately after the
> "mounting NFS filesystems" (of which I have none):
> 
> Fatal trap 12: page fault while in kernel mode
> fault virtual address   = 0x40
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xf014a7e5
                                ^^^^^^^^^^
> stack pointer           = 0x10:0xf4ed6f24
> frame pointer           = 0x10:0xf4ed6f28
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 80 (mount)
> interrupt mask          =
> trap number             = 12
> panic: page fault
> 
> syncing disks... panic: lockmgr: not holding exclusive lock

I'm confused: did the 'lockmgr' message really come up after it
said 'syncing disks?' That's a little odd...

In any case, when you see a message like this, it's not enough to
just reproduce it and send it it. The instruction pointer value that
I highlighted up there is important; unfortunately, it's also
configuration dependent. In other words, the value varies depending
on the exact kernel image that you're using. If you're using a
GENERIC kernel image from one of the snapshots, then it's possible
for somebody else to track down the offending function, but if you're
running a custom kernel then only _you_ can tell us where the fault
occured.

What you should do is this:

- Write down the instruction pointer value. Note that the "0x8:" part
  at the begining is not significant in this case: it's the 0xf0xxxxxx
  part that we want.
- When the system reboots, do the following:

  % nm /kernel.that.caused.the.panic | grep f0xxxxxx

  where f0xxxxxx is the instruction pointer value. The odds are you will
  not get an exact match since the symbols in the kernel symbol table are
  for the entry points of functions and the instruction pointer address
  will be somewhere inside a function, not at the start. If you don't
  get an exact match, omit the last digit from the instruction pointer
  value and try again, i.e.:

  % nm /kernel.that.caused.the.panic | grep f0xxxxx

  If that doesn't yield any results, chop off another digit. Repeat until
  you get some sort of output. The result will be a possible list of
  functions which caused the panic. This is a less than exact mechanism
  for tracking down the point of failure, but it's better than nothing.

I see people constantly show panic messages like this but rarely do I
see someone take the time to match up the instruction pointer with a
function in the kernel symbol table.

The best way to track down the cause of a panic is by capturing a crash
dump, then using gdb to to a stack trace on the crash dump. Of course,
this depends on gdb in -current working correctly, which I can't
guarantee (I recall somebody saying that the new ELF-ized gdb didn't
handle kernel crash dumps correctly: somebody should check this before
3.0 goes out of beta or there'll be a lot of red faces after the CDs
ship).

In any case, the method I nornally use is this:

- Set up a kernel config file, optionally adding 'options DDB' if you
  think you need the kernel debugger for something. (I use this mainly
  for setting beakpoints if I suspect an infinite loop condition of
  some kind.)
- Use 'config -g KERNELCONFIG' to set up the build directory.
- cd /sys/compile/KERNELCONFIG; make
- Wait for kernel to finish compiling.
- cp kernel kernel.debug
- strip -d kernel
- mv /kernel /kernel.orig
- cp kernel /
- reboot

Note that YOU DO _NOT+ WANT TO ACTUALLY BOOT THE KERNEL WITH ALL THE
DEBUG SYMBOLS IN IT. A kernel compiled with -g can easily be close to
10MB in size. You don't have to actually boot this massive image: you
only need it later for gdb (gdb wants the symbol table). Instead, you
want to keep a copy of the full image and create a second image with
the debug symbols stripped out using strip -d. It is this second
stripped image that you want to boot.

To make sure you capture a crash dump, you need edit /etc/rc.conf and
set 'dumpdev' to point to your swap partition. This will cause the
rc scripts to use the dumpon command to enable crash dumps. You can
also run dumpon manually. After a panic, the crash dump can be
recovered using savecore; if dumpdev is set in /etc/rc.conf, the
rc scripts will run savecore automatically and put the crash dump
in /var/crash.

NOTE: FreeBSD crash dumps are usually the same size as the physical
RAM size of your machine. That is, if you have 64MB of RAM, you will
geta  64MB crash dump. Therefore you must make sure there's enough
space in /var/crash to hold the dump. Alternatively, you run savecore
manually and have it recover the crash dump to another directory where
you have more room. It's possible to limit the size of the crash dump
by using 'options MAXMEM=(foo)' to set the amount of memory the kernel
will use to something a little more sensible. For example, if you have
128MB of RAM, you can limit the kernel's memory usage to 16MB so that
your crash dump size will be 16MB instead of 128MB.

Once you have recovered the crash dump, you can get a stack trace
with gdb as follows:

% gdb -k /sys/compile/KERNELCONFIG/kernel.debug /var/crash/vmcore.0
(gdb) where

Note that there may be several screens worth of information; ideally
you should use script(1) to capture all of them. Using the unstripped
kernel image with all the debug symbols should show the exact line
of kernel source code where the panic occured. Usually you have to read
the stack trace from the bottom up in order to trace the exact sequence
of events that lead to the crash. You can also use gdb to print out
the contents of various variables or structures in order to examine
the system state at the time of the crash.

Now, if you're really insane and have a second computer, you can also
configure gdb to do remote debugging such that you can use gdb on one
system to debug the kernel on another system, including setting
breakpoints, single-stepping through the kernel code, just like
you can do with a normal user-mode program. I haven't played with
this yet as I don't often have the chance to set up two machines
side by side for debugging purposes.

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness"
=============================================================================

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809201922.PAA21926>