Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Oct 2002 12:35:43 -0600 (MDT)
From:      Fred Clift <fclift@verio.net>
To:        Andrew Gallatin <gallatin@cs.duke.edu>
Cc:        <freebsd-alpha@freebsd.org>
Subject:   Re: debugging around machine-checks...
Message-ID:  <20021023113324.U98807-100000@vespa.dmz.orem.verio.net>
In-Reply-To: <15798.56033.844389.549256@grasshopper.cs.duke.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 23 Oct 2002, Andrew Gallatin wrote:


>
>  > that FreeBSD is instantenously interrupted when a machine check happens
>  > and that I dont get crash-dumps.
>
> Hmm.. I haven't used a machine check generating alpha in a while, but
> from the code in interrupt.c, it looks like it *should* give you a
> crashdump.


Perhaps I'm just clueless - I build my kernel with the option

makeoptions     DEBUG=-g


(install, reboot)

by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what
I have of ram in size)

and then I do my fiddling with XFree86 that gives me the machine-check and
I end up at the SRM prompt.  At this point, I know that just booting will
fail.  I have to power-cycle the box and when it comes back up, savecore
either doesn't find anything, or isn't being run by the rc scripts.  Once
I get a chance to log in /var/crash has only minfree in it...


Should I be doing something else?

I just looked in /var/log/mesages and saw no evidence of crashdumps being
written (ie dumping to.... or dump 254 253 252 251...  etc).



>
> Can't you use the program counter from the panic output as a start?
> If its in the X server, there should be a PC from userspace.
> (see disclaimer below)
>

So can you interpret this for me then - honestly I just dont know what all
the fields represent -- I should probably just go read the source code and
see :)

Oct  8 06:42:24 liron /kernel: unexpected machine check:
Oct  8 06:42:24 liron /kernel:
Oct  8 06:42:24 liron /kernel: mces    = 0x1
Oct  8 06:42:24 liron /kernel: vector  = 0x660
Oct  8 06:42:24 liron /kernel: param   = 0xfffffc0000006068
Oct  8 06:42:24 liron /kernel: pc      = 0x1604006ac
Oct  8 06:42:24 liron /kernel: ra      = 0x12006cb10
Oct  8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200
Oct  8 06:42:24 liron /kernel: pid = 90765, comm = XFree86
Oct  8 06:42:24 liron /kernel:
Oct  8 06:42:24 liron /kernel: panic: machine check


The program counter is pc? so I should be able to, with gdb and a
debug-version of XFree86, figure out what code this is?


>  >
>
> Look at alpha/alpha/interrupt.c:badaddr_read().
>
> If you're feeling really lucky, you could add code to send the
> appropriate signal (sigbus?) if the PC is in a userland app.
>
> The problem with this is that machine checks are somewhat
> asynchronous, and I'm not sure the PC at the time of the fault
> corresponds to the PC that actually caused the fault.
> (that's why there are so many memory barriers all over the pci probing
> and baddaddr code).


Your explanation is helpful, and perhaps I'll try your suggestion of
turning userland machine checks into sigbus or something  - I'm sure I'm
just begging for trouble here, but at least this isn't a production
machine that other people depend on :).

To send a signal to a process from within the kernel, it seems I just call

psignal(pid, signo)

 - is this right?


Thanks very much for your information - looks like a little check in
machine_check() in interrupt.c will do pretty much what I want - perhaps
I'll make sure that my hack only works on processes who's name starts
with 'X' or something just to be safe....


Fred


--
Fred Clift - fclift@verio.net -- Remember: If brute
force doesn't work, you're just not using enough.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20021023113324.U98807-100000>