Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 24 Oct 2002 15:00:59 -0400 (EDT)
From:      Andrew Gallatin <gallatin@cs.duke.edu>
To:        Fred Clift <fclift@verio.net>
Cc:        <freebsd-alpha@FreeBSD.ORG>
Subject:   Re: debugging around machine-checks...
Message-ID:  <15800.17259.397652.862956@grasshopper.cs.duke.edu>
In-Reply-To: <20021023113324.U98807-100000@vespa.dmz.orem.verio.net>
References:  <15798.56033.844389.549256@grasshopper.cs.duke.edu> <20021023113324.U98807-100000@vespa.dmz.orem.verio.net>

next in thread | previous in thread | raw e-mail | index | archive | help

Fred Clift writes:
 > by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what
 > I have of ram in size)
 > 
 > and then I do my fiddling with XFree86 that gives me the machine-check and
 > I end up at the SRM prompt.  At this point, I know that just booting will
 > fail.  I have to power-cycle the box and when it comes back up, savecore
 > either doesn't find anything, or isn't being run by the rc scripts.  Once
 > I get a chance to log in /var/crash has only minfree in it...
 > 

That *should* work..

 > Should I be doing something else?
 > 
 > I just looked in /var/log/mesages and saw no evidence of crashdumps being
 > written (ie dumping to.... or dump 254 253 252 251...  etc).

If you powercyle, the message buffer is lost.

When I would crash X on an old miata, 1/2 the time I'd get a 
'machine check in pal mode' -- this doesn't even get caught by the
OS.   

However, if you're seeing the message below, I do not understand
why you're not getting a crashdump.

In any case, since the problem is probably with the X server (based on
the mesage below), a crashdump would not help you.


 > 
 > >
 > > Can't you use the program counter from the panic output as a start?
 > > If its in the X server, there should be a PC from userspace.
 > > (see disclaimer below)
 > >
 > 
 > So can you interpret this for me then - honestly I just dont know what all
 > the fields represent -- I should probably just go read the source code and
 > see :)
 > 
 > Oct  8 06:42:24 liron /kernel: unexpected machine check:
 > Oct  8 06:42:24 liron /kernel:
 > Oct  8 06:42:24 liron /kernel: mces    = 0x1
 > Oct  8 06:42:24 liron /kernel: vector  = 0x660
 > Oct  8 06:42:24 liron /kernel: param   = 0xfffffc0000006068
 > Oct  8 06:42:24 liron /kernel: pc      = 0x1604006ac
 > Oct  8 06:42:24 liron /kernel: ra      = 0x12006cb10
 > Oct  8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200
 > Oct  8 06:42:24 liron /kernel: pid = 90765, comm = XFree86
 > Oct  8 06:42:24 liron /kernel:
 > Oct  8 06:42:24 liron /kernel: panic: machine check
 > 
 > 
 > The program counter is pc? so I should be able to, with gdb and a
 > debug-version of XFree86, figure out what code this is?

Yes,  except its in a shared lib, or other dynamically loaded text.
I don't know how you could debug that without a cordump.
The ra (return address) is at least somewhere in the main text
of the program (not a shared lib).

<...>

 > Your explanation is helpful, and perhaps I'll try your suggestion of
 > turning userland machine checks into sigbus or something  - I'm sure I'm
 > just begging for trouble here, but at least this isn't a production
 > machine that other people depend on :).
 > 
 > To send a signal to a process from within the kernel, it seems I just call
 > 
 > psignal(pid, signo)
 > 
 >  - is this right?
 > 

More or less.  I think trapsignal may be more correct.

 > Thanks very much for your information - looks like a little check in
 > machine_check() in interrupt.c will do pretty much what I want - perhaps
 > I'll make sure that my hack only works on processes who's name starts
 > with 'X' or something just to be safe....

Good luck to you!!

Drew

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?15800.17259.397652.862956>