Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Jul 2013 18:54:57 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-acpi@freebsd.org
Subject:   Re: Revisiting FPU context resume on i386
Message-ID:  <20130716164612.P1088@besplex.bde.org>
In-Reply-To: <20130716052641.GE91021@kib.kiev.ua>
References:  <20130716070716.15b7282b9dca2cbc8a767631@tackymt.homeip.net> <20130716052641.GE91021@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 16 Jul 2013, Konstantin Belousov wrote:

> On Tue, Jul 16, 2013 at 07:07:16AM +0900, Taku YAMAMOTO wrote:
>> sys/i386/i386/swtch.s have a big FIX ME in resumectx()
>> and I have occationally got bitten by it; resulting in SIGFPE disasters.
>>
>> After cursory looking around FPU context, I think it's the simplest way
>> to set CR0_TS on resumectx() and to let npxdna() DTRT lazilly.
>>
>> Attached is the mods that I'm currently using without a problem at the moment.
>> It at least doesn't interact with normal resume operations badly.
>>
>> Ah, of cource, we have a choice to throw i386 away and get on amd64,
>> but at least I won't because I'd miss VOCALOIDs running on wine/i386 :)

amd64 uses npx and CR0_TS too.  It just misspells npx as fpu and even
misspells CR0_TS as TS0_CR in a comment in cpu_switch.S.  The problem
seems to be just that acpi is not fully supported on i386.  The npx
state is simpler on i386 (no AVX support), so this should be easy to
fix by copying the suspend/resume logic from amd64.

>> diff --git a/sys/i386/i386/swtch.s b/sys/i386/i386/swtch.s
>> index 80aa6c4..71efae1 100644
>> --- a/sys/i386/i386/swtch.s
>> +++ b/sys/i386/i386/swtch.s
>> @@ -36,6 +36,7 @@
>>  #include "opt_sched.h"
>>
>>  #include <machine/asmacros.h>
>> +#include <machine/specialreg.h>
>>
>>  #include "assym.s"
>>
>> @@ -487,6 +488,10 @@ ENTRY(resumectx)
>>  	movl	PCB_CR3(%ecx),%eax
>>  	movl	%eax,%cr3
>>  	movl	PCB_CR0(%ecx),%eax
>> +#ifdef DEV_NPX
>> +	/* Let npxdna() restore the FPU context lazily. */
>> +	orl	$CR0_TS,%eax
>> +#endif
>>  	movl	%eax,%cr0
>>  	jmp	1f
>>  1:
>> @@ -519,10 +524,6 @@ ENTRY(resumectx)
>>  	movl	PCB_DR7(%ecx),%eax
>>  	movl	%eax,%dr7
>>
>> -#ifdef DEV_NPX
>> -	/* XXX FIX ME */
>> -#endif
>> -
>>  	/* Restore other registers */
>>  	movl	PCB_EDI(%ecx),%edi
>>  	movl	PCB_ESI(%ecx),%esi
>
> If this works, fine.  But I think that you also should make sure that
> invariants which are checked by npxdna() are met.  E.g. see the
> checks for fpcurthread at the start of the routine.

How can this possibly work?  If the invariants are satisfied, then
suspend would have saved a CR0_TS setting consistent with the invariants,
so that there is nothing to fix.  Actually, it seems to be necessary to
save the state to the pcb and set CR0_TS on suspend, since if the state
is not saved then, then nothing saves it before resume, and suspension
can apparently lose it.  If the state was saved on suspend, then CR0_TS
in the pcb was set on suspend, so the patch has no effect.  If the state
wasn't saved on suspend, then the patch forces restoration of the last
saved state, even when suspension didn't lose it.  Thus in the latter
case, a wrong (out of date) state is always restored, but usually the
state is not very out of date so it sort of works.

So the fix should be simply to save the state on suspend, or fix this if
it is already supposed to be done.

> Might be, just clear the pcpu fpcurthread as part of the resume ? Is it
> done by suspend counterpart ?

The state must be saved together with this to satisfy invariants.

The bug is hard to understand by looking at only savectx().  Strangely,
it is the i386 savectx() that saves the state.  amd64 has a special
routine ctx_fpusave() for saving the state.  This is called from
cpususpend_handler().  Its handling of CR0_TS seems to be worse than
i386's too -- where i386's core routine npxsave() handles CR0_TS and
returns with it set (as necessary for putting the state away for
suspension), amd64's core routine fpusave() requires the caller to
handle CR0_TS, and the ctx_fpusave() caller clears CR0_TS before
the call but doesn't set it on return.  ctx_fpusave() also doesn't
maintained the invariant.  So suspend/resume on amd64 already works
unusually: more details:
- the state saving for suspend/resume is completely independent of
   the pcb and the invariant.  Suspendion just saves the current
   hardware npx^Wfpu state to a special save area.  There would still
   be problems if any part of the state (hardware/pcb/invariant) changes
   before supend+resume completes.  Apparently this doesn't happen.
   i386 would have similar problems with the state saved in the pcb if
   it changed during suspend+resume (now since the save area is not
   special, changes wouldn't invalidate the saved state, but if the
   changes result in the state becoming active again then suspension
   may lose the state).
- since ctx_fpusave() is left clear by ctx_fpusave(), it will be clear
   when resume reloads CR0.
ISTR disagreeing with jkim on using a special save area.

i386 also has cpususpend_handler().  It has a special save area too,
and the only significant difference is that the npx part of the save
is done by savectx() instead of separately by ctx_fpusave().  The bug
is now fairly obvious: savectx() uses npxsave(), and npxsave() breaks
the invariant in various ways:

% void
% npxsave(addr)
% 	union savefpu *addr;
% {
% 
% 	stop_emulating();

Already the invariant is not satisfied, but this is protected by interrupts
being disabled.

% 	fpusave(addr);

If fpusave() is fxsave(), this leaves the state in the hardware.  Otherwise,
fpusave() is fnsave() and it initializes the state in the hardware.  In the
latter case, the invariant is transiently less satisfied than before.

% 
% 	start_emulating();

This restores the part of the invariant that says that the state is not
in the hardware (an npxdna() is needed to reload the state after fnsave().

% 	PCPU_SET(fpcurthread, NULL);
% }

This completes restoring the invariant, but only when 'addr' is in the
PCB.  When 'addr' is in the special save area, the invariant is very
broken.  We only call here if the state active to begin with (otherwise,
it wouldn't be in the hardware so copying it from the hardware to the
pcb would be very wrong).  We record the state change in the pcb, but
only saved the state in the special save area.  fpcurthread in the
special save area is either nonexistent or not initialized yet.  This
is not seriously broken provided we initialized fpcurthread in the
special save area soon, and never touch the npx state again until
resume completes.  We must be careful with the CR0_TS part of the
invariant too.  When npxsave() was called, it should be clear (and
the start_emulating() call should have no effect).  When npxsave()
returns, CR0_TS is set, and it must be saved somewhere soon.  Since
suspension is using a special save area, it presumably saves CR0 there
too.

So the pcb is not really used across suspend/resume.  But the above
write of fpcurthread to it is dangerous.

Now for the inconsistencies in i386 savectx()/resumectx() related to
the above:

% /*
%  * savectx(pcb)
%  * Update pcb, saving current processor state.
%  */

Wrong comment.  This saves in the special save area in most or all cases.
It is important to understand that the normal pcb is not really used here
and most updates to it are accidental.

% ENTRY(savectx)
% 	/* Fetch PCB. */
% 	movl	4(%esp),%ecx

Wrong comment.  As above, plus it spells pcb inconstently (capitalization
is technically more correct, but we often spell it pcb).

amd64 has the same wrong comments.  I think the special save area consists
mainly of a pcb, but it's very confusing when this pcb is not the usual one.

% ...
% 	movl	%cr0,%eax
% 	movl	%eax,PCB_CR0(%ecx)

It saves CR0 quite early.

% #ifdef DEV_NPX
% #ifdef DEV_NPX
% 	/*
% 	 * If fpcurthread == NULL, then the npx h/w state is irrelevant and the
% 	 * state had better already be in the pcb.  This is true for forks
% 	 * but not for dumps (the old book-keeping with FP flags in the pcb
% 	 * always lost for dumps because the dump pcb has 0 flags).
% 	 *
% 	 * If fpcurthread != NULL, then we have to save the npx h/w state to
% 	 * fpcurthread's pcb and copy it to the requested pcb, or save to the
% 	 * requested pcb and reload.  Copying is easier because we would
% 	 * have to handle h/w bugs for reloading.  We used to lose the
% 	 * parent's npx state for forks by forgetting to reload.
% 	 */

It saves npx as the very last step.

The copying stuff avoids some consistency problems, but the comment about
it is very out of date, and the copying is now incomplete.

% 	pushfl
% 	CLI
% 	movl	PCPU(FPCURTHREAD),%eax
% 	testl	%eax,%eax
% 	je	1f

The fpcurthread pointer (now in %eax) is not in the pcb.

% 
% 	pushl	%ecx
% 	movl	TD_PCB(%eax),%eax
% 	movl	PCB_SAVEFPU(%eax),%eax
% 	pushl	%eax
% 	pushl	%eax
% 	call	npxsave
% 	addl	$4,%esp
% 	popl	%eax
% 	popl	%ecx

We first save to the pcb.  Everything in the pcb is consistent now.
fpcurthread is now 0.  This is consistent too.

% 
% 	pushl	$PCB_SAVEFPU_SIZE
% 	leal	PCB_USERFPU(%ecx),%ecx
% 	pushl	%ecx
% 	pushl	%eax
% 	call	bcopy
% 	addl	$12,%esp

The copy of the pcb in the special save area is not quite consistent.
We only copied the very-npx-specific parts.  CR0 wasn't copied, so it
still has the old value for CR0_TS (clear when it should be set).
fpcurthread doesn't live in the pcb, but that doesn't seem to be a
problem, since the special state where the CR0_TS is set means that
fpcurthread must be NULL, so we can recover fpcurthread from CR0_TS
in this case.

% 1:
% 	popfl
% #endif	/* DEV_NPX */

% 
% 	movl	$1,%eax
% 	ret
% END(savectx)
% 
% /*
%  * resumectx(pcb) __fastcall
%  * Resuming processor state from pcb.
%  */
% ENTRY(resumectx)
% 	/* Restore GDT. */
% 	lgdt	PCB_GDT(%ecx)

Wrong comments, as above.

% 	/* Restore CR2, CR4, CR3 and CR0 */
% ...
% 	movl	PCB_CR0(%ecx),%eax
% 	movl	%eax,%cr0
% 	jmp	1f
% 1:

What's this magic null jump?

% ...
% #ifdef DEV_NPX
% 	/* XXX FIX ME */
% #endif

I now think that your patch works fine.  It just sets CR0 to the state
that should have been saved.  The correct fix is even simpler (assuming
that there are no races) -- either move the saving of CR0 after the
saving of NPX, or save CR0 again after changing it in npxsave().

The copying from the pcb to the save area can probably be safely removed.
npxsave() supports storing directly to the save area.  The pcb is not
really used for suspend/resume or other callers of savectx() (if any).
It doesn't help much for the NPX part of it to be consistent with the
save area when other parts are inconsistent and parts like fpucurthread
don't live in either the pcb or the save area.

savectx() seems to only be used by for dumping, stopping CPUs and
suspending CPUs.  Its save area is always special, and always an
ordinary pcb struct (but amd64 seems to have a separate save area
for suspendeded FPU state.  I can't find where the pointer for this
is initialized).  Using the primary pcb is especially unimportant
for dumping.  When I wrote the above long comment about copying the
npx state in FreeBSD-1, savectx() was only used for forking and
dumping.  It had to be careful for forking.  In FreeBSD-2+, it
wasn't used for forking or for anything else except dumping.  Then the
copying in the i386 version of it was excessively careful and the
comment was out of date.  Now savectx() is important again and must
be careful for stopping and suspending CPUs.  It is still broken for
stopping CPUs on amd64, since it doesn't save the fpu state and neither
does cpustop_handler() (unlike cpususpend_handler).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130716164612.P1088>