Date: Wed, 10 Nov 2004 22:51:40 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Sean Farley <sean-freebsd@farley.org> Cc: freebsd-hackers@freebsd.org Subject: Re: bugs in contigmalloc*() related to "page not found in hash" panics Message-ID: <200411110651.iAB6pekO065188@apollo.backplane.com> References: <200411101801.iAAI1SkK061883@apollo.backplane.com> <20041110230601.T416@thor.farley.org>
next in thread | previous in thread | raw e-mail | index | archive | help
:> Here is the DragonFly commit. :> :> http://www.dragonflybsd.org/cvsweb/src/sys/vm/vm_contig.c.diff?r1=1.10&r2=1.11&f=u :> :> FreeBSD-4: :> :> FreeBSD-4 is in the same situation that DFly was in and requires :> the same fixes as the above patch, though note that in FreeBSD-4 :> the contigmalloc() code is in vm_page.c, not vm_contig.c. : :I tried the patch in the hopes it would fix my Nvidia-driver :crash-on-demand system. :) While my system appears stable without the :Nvidia driver but with this patch, my system can still crash easily with :the Nvidia driver. It usually dies with a: Point me at the nvidia driver source and I will do a quick audit of it to see if there is anything obviously broken. This is running on FreeBSD-4.x? If it's a binary-only driver there isn't much I can do, though. The 'page not found in hash' panic can ONLY occur one way: When a vm_page's pindex or object fields are directly changed or (under 4.x,) if the VM object's hash_rand field is changed. The only valid way to change either of these fields is to call vm_page_insert() or to call vm_page_remove(). That it. There is *NO* other legal way to change those fields within a vm_page that won't result in corruption of the VM page hash table (4.x) or object->root splay tree (5.x). The fields cannot be modified directly, the vm_page cannot be safely bzero'd, you can't 0 or NULL out the fields, or assign a new index or object, etc... only vm_page_insert() and vm_page_remove() can do that safely. From looking at your bug reports and comparing them with my own extensive research on this particular crash I will say *DEFINITIVELY* that it is *NOT* a RAM problem. It's software-caused corruption, period end of story. I will also note that the backtrace from the panic path in the second PR URL is very similar to what we were seeing before we fixed the issue in contigmalloc... the problem is that the VM page hash table / splay tree gets corrupted *LONG* before the code path that actually causes the panic, so it's virtually impossible to glean any information from the panic itself. There is a test you can run. If you have a kernel vmcore and related kernel image that contains the vm page not found in hash panic, you can run this program on it to do a sanity check on the VM page array and hash table. I have modified this program to work with FreeBSD-4.x (I'd have to rewrite it to make it work with 5.x/6.x, which I don't have time to do): fetch http://leaf.dragonflybsd.org/~dillon/vmpageinfo_4x.c and follow the instructions in the comments to compile it. Run it with '-N kernel.x -M vmcore.X -d'. This program will sanity check the VM page hash table from the core file and tell you if there are any pages missing from the hash table or sitting in the wrong slot. My expectation is that it will find a page sitting in the wrong slot. : : Fatal trap 12: page fault while in kernel mode : fault virtual address = 0x30 : fault code = supervisor read, page not present : This is a different failure. I'd need a backtrace or a kernel.debug and vmcore to play with, and a FreeBSD developer would probably be able to help you more with it. It's obviously a NULL pointer indirection of some sort. :Two "page not found in hash" panics that I believe are related to the :Nvidia driver: :http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/71086 :http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/72539 The 'page not found in hash' bug is *NOT* likely to be related to any of the pmap code, simply because the sanity checks already in the kernel (assuming the kernel is compiled with options INVARIANTS and options INVARIANT_SUPPORT) mostly preclude an error path to this panic from the pmap code. However, pmap panics could be related to corrupted VM pages. -Matt Matthew Dillon <dillon@backplane.com> :The first PR (mine) asks about a change in pmap_remove() that was later :removed from FreeBSD-4 but left in FreeBSD-5. If anyone knows why this :happened, I would be interested in knowing. : :Sean :-- :sean-freebsd@farley.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200411110651.iAB6pekO065188>