Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Nov 2004 23:45:19 -0800 (PST)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Sean Farley <sean-freebsd@farley.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: bugs in contigmalloc*() related to "page not found in hash" panics
Message-ID:  <200411130745.iAD7jJR5079986@apollo.backplane.com>
References:  <200411101801.iAAI1SkK061883@apollo.backplane.com> <200411110651.iAB6pekO065188@apollo.backplane.com> <20041112212917.L1667@thor.farley.org>

next in thread | previous in thread | raw e-mail | index | archive | help

:Unfortunately, it is the binary driver from Nvidia.  Maybe someone using
:DragonFly is having similar problems?

    Not that I know of.  There's not much that can be done with binary-only
    drivers short of throwing them away and finding hardware that works
    with normal drivers.

:I ran the program on the vmcore and debug kernel from the recent crash
:since the vmcore with the "page not found in hash" panic has long since
:been deleted.  As expected, the program showed no problem with the
:vmcore.

    If you ever get the page not found in hash panic again while running
    FreeBSD, running that program on the kernel core may help the FreeBSD
    folks track the problem down (if it turns out not to be the contigmalloc
    bug that I pointed out earlier).

:> :     Fatal trap 12: page fault while in kernel mode
:> :     fault virtual address   = 0x30
:> :     fault code              = supervisor read, page not present
:...
:
:I will attach it , and I will also send it to Nvidia as I did once many
:moons ago.  One interesting symptom that I just noticed very close to
:the time of instability is this message from /var/log/messages:
    
    I couldn't get much out of the backtrace.  It looks like the filesystem
    is trying to generate a core file and the vn_open() call has failed and
    is trying to cleanup.  The cleanup code looks ok and the vnode is nothing
    special.  It seems to have failed trying to do a NULL pointer dereference
    of the inode pointer, but it's hard to tell because most of the local
    variables look garbaged up (which is to be expected for a kernel compiled
    -O since the registers those variables are stored in are not accessible
    to the dump).  Those are code paths that are usually pretty widely
    exercised in the system, though.

:Here is near the end of strings output of vmcore just before panic:
:
:<118>Wed Nov 10 22:46:44 CST 2004
:<3>stray irq 7
:<118>Nov 10 22:47:14 thor /kernel: stray irq 7
:<3>stray irq 7
:<3>stray irq 7
:<118>Nov 10 22:47:46 thor last message repeated 2 times
:
:The parallel port is disabled, and I do not see these messages without
:the Nvidia driver.

    Yah, I'm afraid I there's nothing there that rings a bell.  Again the
    problem with using a binary-only driver is that there is never any
    visibility into it and no way to fix bugs.  It makes such drivers of
    strictly limited utility.

:>    kernel (assuming the kernel is compiled with options INVARIANTS and
:>    options INVARIANT_SUPPORT) mostly preclude an error path to this
:>    panic from the pmap code.  However, pmap panics could be related to
:>    corrupted VM pages.
:
:I have not tried compiling these options into the kernel.  Sometime this
:weekend I will give them a shot.

    I recommend that even production machines always be run with INVARIANTS
    and INVARIANT_SUPPORT.

						-Matt

:Thank you for your help and the detailed description of the bug
:(tricksy, sneaky bug) you fixed.
:
:Sean



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200411130745.iAD7jJR5079986>