Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 05 Nov 2017 17:24:14 +0100
From:      Andreas Longwitz <longwitz@incore.de>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: double fault on 10.3-Stable i386 during installworld
Message-ID:  <59FF3B2E.5010603@incore.de>
In-Reply-To: <20171101092619.GJ2566@kib.kiev.ua>
References:  <59D11664.1060206@incore.de> <20171001180943.GO95911@kib.kiev.ua> <59F910C5.8020709@incore.de> <20171101092619.GJ2566@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for answer, I am now sure the reason for the double fault is not
a FreeBSD problem, it is a CPU problem.

>> On the stack we have
>>
>> 0xe437faa0:    0x00000000  R7:0xc0bc051c     0x00000020     0x00010007
>>
>> so there is an exception on the instruction "movl  PCB_CR3(%edx),%eax"
>> in function cpu_switch(). The next stack entries indicates a lot of page
>> faults, but the "double fault" happens not until the page boundary at
>> 0xe437f000 is reached. I do not really understand this, but it seems to
>> me that the thread
> 
> Can you try to recover the %ecx, %edx values for the faulted frame ?
> Note that %ecx is loaded from the on-stack argument.

>From source swtch.s

        /* Save is done.  Now fire up new thread. Leave old vmspace. */
        movl    4(%esp),%edi
        movl    8(%esp),%ecx                    /* New thread */
        movl    12(%esp),%esi                   /* New lock */
#ifdef INVARIANTS
        testl   %ecx,%ecx                       /* no thread? */
        jz      badsw3                          /* no, panic */
#endif
        movl    TD_PCB(%ecx),%edx

        /* switch address space */
        movl    PCB_CR3(%edx),%eax

it can be seen by inspection of the stack, that %ecx is loaded with
address of newtd (0xc8029a20) and %edx is loaded with address of newpcb
(0xf0a3ad40). So we see an exception during the execution of a correct
machine instruction. At the moment of double fault I see the same values
in the saved TSS:

(kgdb) p/x __pcpu[2]->pc_common_tss
$16 = {tss_link = 0x0, tss_esp0 = 0xe437fd30, tss_ss0 = 0x28, tss_esp1 =
0x0, tss_ss1 = 0x0, tss_esp2 = 0x0, tss_ss2 = 0x0, tss_cr3 =
0x0, tss_eip = 0xc0bacac8, tss_eflags = 0x10007, tss_eax = 0xc08f492f,
tss_ecx = 0xc8029a20, tss_edx = 0xf0a3ad40, tss_ebx = 0xd3cf, t
ss_esp = 0xe437f000, tss_ebp = 0xe437fafc, tss_esi = 0xc0e43400, tss_edi
= 0xc7ebd000, tss_es = 0x28, tss_cs = 0x20, tss_ss = 0x28, ts
s_ds = 0x28, tss_fs = 0x8, tss_gs = 0x3b, tss_ldt = 0x0, tss_ioopt =
0x680000}.

Also we have tss_eax = 0xc08f492f = return address, so the movl for
"switch address space" was not executed.

> Do you have latest CPU microcode loaded ?  Your machine is very old,
> I believe this is P4 class processor, am I right ?

I have to correct one detail: The output

(kgdb) p/x cpu_id
$4 = 0xf29

for the CPUID was correct, but the correspondig output from dmesg was
not from the crashing server, so here is the correct one:

CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2791.05-MHz 686-class CPU)
  Origin="GenuineIntel"  Id=0xf29  Family=0xf  Model=0x2  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR
,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4400<CNXT-ID,xTPR>

kenv gives:
smbios.bios.reldate="01/25/2005"
smbios.bios.vendor="Intel Corporation"
smbios.bios.version="SWV25.86B.0250.P39.0501252032"
smbios.chassis.maker="Intel Corporation"
smbios.memory.enabled="4194304"
smbios.planar.maker="Intel     "
smbios.planar.product="SE7501WV2S"
smbios.planar.serial="000E0C5C4ADE374"
smbios.planar.version="A99386-112"
smbios.socket.enabled="2"
smbios.socket.populated="2"
smbios.system.maker="MAXDATA"
smbios.system.product="PLATINUM 2210R" (OEM, Intel SR2300)
smbios.system.serial="               "
smbios.system.uuid="d69da6f3-015e-11d9-b9dc-00108365a7e7"
smbios.version="2.3"

>From manual "Intel Xeon Processor (Document Number 249679-056(" I found
my CPU is a Xeon 2.8B "Prestonia" (CPUID 0F29H, Core Stepping D1)
released 8.11.2002. I have the last microcode revision m02f292d, but my
BIOS version P39 was not latest. In the meantime I have upgraded to BIOS
version P43.

> Sure if pcb access faults, the system is in very broken state and
> since an attempt to handle the fault causes a new fault for pcb access,
> it recurses and dies due to the stack overflow.

Agree.

-- 
Andreas Longwitz




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59FF3B2E.5010603>