Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Nov 2000 11:03:53 -0800
From:      Julian Elischer <julian@elischer.org>
To:        John Baldwin <jhb@FreeBSD.org>
Cc:        arch@FreeBSD.org
Subject:   Re: Thread-specific data and KSEs
Message-ID:  <3A1D6A19.801BBFA5@elischer.org>
References:  <XFMail.001122141324.jhb@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
John Baldwin wrote:
> 
> On 22-Nov-00 Terry Lambert wrote:

> >
> > The %gs register already has to be saved for WINE processes,
> > so it's taken (at least when USER_LDT is defined).  So there
> > would not be an additional context switch for it.
> 
> Ok.  Since %fs is only used in the kernel and is saved/restored it might be a
> good thing to use instead.

OK so let's do a little kernel code inspection.....

first let's look at where the regs are saved..

the trapframe looks as follows (frame.h) slightly cut down ([...])
System calls are treated as a trap. This should be a good starting point.

/*
 * Exception/Trap Stack Frame
 */

struct trapframe {
        int     tf_fs;
        int     tf_es;
        int     tf_ds;
[...]
        int     tf_cs;
        int     tf_eflags;
        /* below only when crossing rings (e.g. user to kernel) */
        int     tf_esp;
        int     tf_ss;
};

/* Superset of trap frame, for traps from virtual-8086 mode */

struct trapframe_vm86 {
        int     tf_fs;
        int     tf_es;
        int     tf_ds;
[...]
        int     tf_cs;
        int     tf_eflags;
        /* below only when crossing rings (e.g. user to kernel) */
        int     tf_esp;
        int     tf_ss;
        /* below only when switching out of VM86 mode */
        int     tf_vm86_es;
        int     tf_vm86_ds;
        int     tf_vm86_fs;
        int     tf_vm86_gs;
};

/* Interrupt stack frame */

struct intrframe {
        int     if_vec;
        int     if_fs;
        int     if_es;
        int     if_ds;
[...]
        /* below portion defined in 386 hardware */
        int     if_eip;
        int     if_cs;
        int     if_eflags;
        /* below only when crossing rings (e.g. user to kernel) */
        int     if_esp;
        int     if_ss;
};


/* frame of clock (same as interrupt frame) */

struct clockframe {
        int     cf_vec;
        int     cf_fs;
        int     cf_es;
        int     cf_ds;
 [...]
        int     cf_cs;
        int     cf_eflags;
        /* below only when crossing rings (e.g. user to kernel) */
        int     cf_esp;
        int     cf_ss;
};



So, as you see, there is space for %fs to be saved, but in general, 
no place for %gs (except in th VM86 case). This kinda suggests that 
%fs is the way to go. (so far it appears that %gs can't be in use 
at the moment).

In signal.h the osigcontext looks like: (showing only segment regs)
truct  osigcontext {
        int     sc_onstack;             /* sigstack state to restore */ 
        osigset_t sc_mask;              /* signal mask to restore */
[...]
        int     sc_es;
        int     sc_ds;
        int     sc_cs;
        int     sc_ss;
[...]
        int     sc_gs;
        int     sc_fs;
        int     sc_trapno;
        int     sc_err;
};

which has places for both %gs and %fs

Similarly the new sigcontext given to the process is:
/*
 * The sequence of the fields/registers in struct sigcontext should match
 * those in mcontext_t.
 */
struct  sigcontext {
        sigset_t sc_mask;               /* signal mask to restore */
        int     sc_onstack;             /* sigstack state to restore */
        int     sc_gs;                  /* machine state (struct trapframe): */
        int     sc_fs;
        int     sc_es;
        int     sc_ds;
[...]
        int     sc_cs;
        int     sc_efl;
        int     sc_esp;
        int     sc_ss;
[...]
};

Once again both %gs and %fs are supported.
so, signals should be able to cope with either.

reg.h shows what /proc supports (both f and g)

proc.h includes a trapframe (see above) via machine/proc.h
so the proc structure (and this the KSEC eventually) hold %f but not %g

In trap.c there is the following code that might have to be understood for 
this to work:

void
trap(frame)
{
[...]
	  if ((ISPL(frame.tf_cs) == SEL_UPL) ||
            ((frame.tf_eflags & PSL_VM) && !in_vm86call)) {
                /* user trap */
[...]
        } else {
                /* kernel trap */
[...]
                case T_SEGNPFLT:        /* segment not present fault */
                        if (in_vm86call)
                                break;

                        if (intr_nesting_level != 0)
                                break;

                        /*
                         * Invalid %fs's and %gs's can be created using
                         * procfs or PT_SETREGS or by invalidating the
                         * underlying LDT entry.  This causes a fault
                         * in kernel mode when the kernel attempts to
                         * switch contexts.  Lose the bad context
                         * (XXX) so that we can continue, and generate
                         * a signal.
                         */
                        if (frame.tf_eip == (int)cpu_switch_load_gs) {
                                curpcb->pcb_gs = 0;
                                psignal(p, SIGBUS);
                                goto out;
                        }

I notice that %fs is not touched.. (maybe it's fixed elsewhere)
but this suggests that %gs and %fs are being loaded or the 
fault wouldn't happen.

So where is %gs being loaded from..?

in proc.h
the proc structure includes:
        struct  mdproc p_md;    /* Any machine-dependent fields. */
which from i386/include/proc.h is:
struct mdproc {
        struct trapframe *md_regs;      /* registers on current frame */
};

which as we see above does not include room for %gs,
however This appears misleading, because the structure 'pcb'
in i386/include/pcb.h does include a field for %gs.
A pointer to the current pcb is part of the per-CPU global data in 
globals.h. It is in user.h and as such is in the user structure.
which is pointed to by p_addr in the proc structure. And 
lo-and-behold, there it is... a place to store the %gs register
as well.
(Why it's not in the proc structure I don't follow)

Swtch.s seems to save it nicely with:
        movl    %gs,PCB_GS(%edx)

and I'm sure that %fs is similarly saved (It's on the stack)
So it looks like you should be able to go ahead and use those registers.
We will need to duplicate the U-area for KSEs anyhow
so assuming that, both regs would be ok.

Interestingly they use %fs in kernel, but in fact since they have 
a Per-CPU 4MB range of memory now (where each CPU sees different 
physical pages at the same address, it would now be possible for
the kernel to drop this.. at the moment they are using
%fs AND mapping, so in fact they are mapping twice.

This brings up a possibility that if they have to fiddle the page 
maps for each KSE anyhow (to put the different PDE in,) they could 
just as easily fiddle TWO entries and give us a 4K or 4MB (take 
your pick) KSE dependent region within the use space.

That would not require ANY registers.
The trick would be to put a different PTE in for each KSE in the top 
page table just above the orogonal stack. The top page table must already
there and loaded because the stack is in it. The trouble with this idea
is that it would require having code to keep the rest of the PTEs 
(in the other KSEs Page tables) all in sync. You could make the kernel
take 4MB from each address space, and put the stack (etc) below that
and make it illegal to map new pages into that region. 
that way the kernel would only have to keep track of the single Page 
it allocates into that space per KSE. (The VM would have to be in on 
that act.. yech) Or alternatively, you could allow the user to access
a page in the existing PER_CPU region (yeah I know it's at teh top of 
memory, above where the user process can usually touch, but we could
set a segment up there, and allow it to get to it.
You'd get a 4K window at 0xffaxx000 or somewhere.


> > I think that if you guys go forward with this, you should do an
> > indirect through whatever you end up using.  I realize this will
> > cost an additional 6 clock cycles, but it will let you expand
> > the list of things indefinitely, going forward, instead of having
> > to keep a register dedicated for backward compatability, and then
> > somehow "grow a new one" when you need to do something similar to
> > this again, in the future.
> 
> It will be an indirect if I have any say in it. :)  Currently we use %fs in the
> kernel to address a segment that contains per-CPU data.  I think that if we use
> a seg reg, then we should have it address a segment that contains per-KSE data.

For now I think that %fs is  definitly safe..
and if you use it it could be used as the entry into the per-KSE
area I just mentionned. (with interestingly, the almost the
same contents as the kernel uses.).




> 
> John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
> PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
> "Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message

-- 
      __--_|\  Julian Elischer
     /       \ julian@elischer.org
    (   OZ    ) World tour 2000
---> X_.---._/  presently in:  Budapest
            v


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3A1D6A19.801BBFA5>