Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Jul 2013 19:16:15 -0700
From:      Yuri <yuri@rawbw.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Alan Cox <alc@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: Kernel crashes after sleep: how to debug?
Message-ID:  <51E9F2EF.6000908@rawbw.com>
In-Reply-To: <201307191704.47622.jhb@freebsd.org>
References:  <51E3A334.8020203@rawbw.com> <201307191100.08549.jhb@freebsd.org> <51E9945B.1050907@rawbw.com> <201307191704.47622.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 07/19/2013 14:04, John Baldwin wrote:
> Hmm, that definitely looks like garbage.  How are you with gdb scripting?
> You could write a script that walks the PQ_ACTIVE queue and see if this
> pointers ends up in there.  It would then be interesting to see if the
> previous page's next pointer is corrupted, or if the pageq.tqe_prev references
> that page then it could be that this vm_page structure has been stomped on
> instead.

As you suggested, I printed the list of pages. Actually, iteration in 
frame 8 goes through PQ_INACTIVE pages. So I printed those.
<...skipped...>
### page#2245 ###
$4492 = (struct vm_page *) 0xfffffe00b5a27658
$4493 = {pageq = {tqe_next = 0xfffffe00b5a124d8, tqe_prev = 
0xfffffe00b5b79038}, listq = {tqe_next = 0x0, tqe_prev = 
0xfffffe00b5a276e0},
   left = 0x0, right = 0x0, object = 0xfffffe005e3f7658, pindex = 5, 
phys_addr = 1884901376, md = {pv_list = {tqh_first = 0xfffffe005e439ce8,
       tqh_last = 0xfffffe00795eacc0}, pat_mode = 6}, queue = 0 '\0', 
segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
   cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags = 
0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
### page#2246 ###
$4494 = (struct vm_page *) 0xfffffe00b5a124d8
$4495 = {pageq = {tqe_next = 0xfffffe00b460abf8, tqe_prev = 
0xfffffe00b5a27658}, listq = {tqe_next = 0x0, tqe_prev = 
0xfffffe005e3f7cf8},
   left = 0x0, right = 0x0, object = 0xfffffe005e3f7cb0, pindex = 1, 
phys_addr = 1881952256, md = {pv_list = {tqh_first = 0xfffffe005e42dd48,
       tqh_last = 0xfffffe007adb03a8}, pat_mode = 6}, queue = 0 '\0', 
segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
   cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags = 
0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
### page#2247 ###
$4496 = (struct vm_page *) 0xfffffe00b460abf8
$4497 = {pageq = {tqe_next = 0xfe26, tqe_prev = 0xfffffe00b5a124d8}, 
listq = {tqe_next = 0xfffffe0081ad8f70, tqe_prev = 0xfffffe0081ad8f78},
   left = 0x6, right = 0xd00000201, object = 0x100000000, pindex = 
4294901765, phys_addr = 18446741877712530608, md = {pv_list = {
       tqh_first = 0xfffffe00b460abc0, tqh_last = 0xfffffe00b5579020}, 
pat_mode = -1268733096}, queue = 72 'H', segind = -85 '�',
   hold_count = -19360, order = 0 '\0', pool = 254 '�', cow = 65535, 
wire_count = 0, aflags = 0 '\0', flags = 0 '\0', oflags = 0,
   act_count = 0 '\0', busy = 176 '�', valid = 208 '�', dirty = 126 '~'}
### page#2248 ###
$4498 = (struct vm_page *) 0xfe26

The page #2247 is the same that caused the problem in frame 8. tqe_next 
is apparently invalid, so iteration stopped here.
It appears that this structure has been stomped on. This page is 
probably supposed to be a valid inactive page.


>
> Ultimately I think you will need to look at any malloc/VM/page operations
> done in the suspend and resume paths to see where this happens.  It might
> be slightly easier if the same page gets trashed every time as you could
> print out the relevant field periodically during suspend and resume to
> narrow down where the breakage occurs.

I am thinking to put code walking through all page queues and verifying 
that they are not damaged in this way into the code when each device is 
waking up from sleep.
dev/acpica/acpi.c has acpi_EnterSleepState, which, as I understand, 
contains top-level code for S3 sleep. Before sleep it invokes event 
'power_suspend' on all devices, and after sleep it calls 'power_resume' 
on devices. So maybe I will call the page check procedure after 
'power_suspend' and 'power_resume'.

But it is possible that memory gets damaged somewhere else after 
power_resume happens.
Do you have any thought/suggestions?

Yuri



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51E9F2EF.6000908>