Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 06 Aug 2002 00:39:07 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Luigi Rizzo <rizzo@icir.org>
Cc:        Peter Wemm <peter@wemm.org>, smp@freebsd.org
Subject:   Re: how to create per-cpu variables in SMP kernels ?
Message-ID:  <3D4F7D1B.4D91400A@mindspring.com>
References:  <20020805015340.A17716@iguana.icir.org> <20020805221415.B0C732A7D6@canning.wemm.org> <20020805230556.C26751@iguana.icir.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Luigi Rizzo wrote:
> Hi Peter,
> thanks for the explaination.
> I still have a few doubts on this (let's restrict to the -current
> case where the code seems more readable):
> 
> --- MINOR DETAIL ---
> 
>   * I wonder why the macro __PCPU_GET() in sys/i386/include/pcpu.h
>     cannot store directly into __result for operand sizes of 1,2,4
>     instead of going through a temporary variable. I.e. what would
>     be wrong in having

It could, if someone wrote the code for the compiler to be able
to understand the operand sizes.  8-).  The main problem is
operand sizes that don't fit in a single register (ANSI C
permits structures).


>     Partly following Terry's description, i thought an arrangement
>     like the following could be relatively simple to implement and not
>     require any recourse to assembly code, does not impact the compiler's
>     ability to do optimizations, and does not require an extra
>     segment descriptor to access the struct pcpu.
> 
>     It relies on the following variables, my_pcpu to access the
>     pcpu data of the local processor, all_pcpu to view all pcpu
>     data (including our own, at a different mapping in vm space):
> 
>         struct pcpu *my_pcpu;
> 
>         struct pcpu *all_pcpu[MAXCPU]; /* XXX volatile */

NO.  This can not work.

The problem is that the per-CPU are is mapped into the same
location on each CPU -- and *totally inaccessible* to other CPUs.

The entire point of having a per CPU area in the first place is
to avoid indexing by the CPU ID, and to be able to *know* that
the data stored there does not require protection of a mutex on
read/write operations.  Note that the CPUID is also not guaranteed
to be a contiguous and adjacent space.

Basically, this means your attempt to dereference all_pcpu
will only work for the local processor data area *on the local
processor*.

In general, creating anything that needs this information in
the first place is really a *big* mistake.  In order to ensure
the idempotence of the data being accessed, which could be a
large structure, you would need to introduce locks.  This is
true even if the other CPUs only ever read the data, unless
the data is capable of being modified or read with atomic
instructions (this limits you to 32 bit values on Pentium class
hardware).

The only way this works without locks is if the data is
statistical.

In addition, the data in any page shared this way, with no
locks, would have to reside in a page which is non-cacheable,
to avoid caching in the L1 or L2, and the associated invalidate
on writes having to be signalled to all processors.  That means
that whatever your CPU vs. memory bus multiplier, the expense
of accessing it is divided by that (e.g. a 1.3GHz CPU with a
433MHz memory bus will take three clock cycles, minimum, to
fetch data from the page).


>     Early in the boot process we allocate MAXCPU physical pages,
>     and MAXCPU+1 entries in the VM space. Individual pcpu structs
>     go at the beginning of each of the physical pages, and the
>     VM -> physical mapping of the first MAXCPU VM entries is the
>     same for all processors. Then all_pcpu[i] can be initialized
>     with a pointer to the beginning of the i-th VM page.
> 
>     The MAXCPU+1-th VM entry maps differently on each CPU,
>     so that it effectively permits access to the per-cpu data.
>     my_pcpu can be initialized with a pointer to the MAXCPU+1-th VM page.

It won't work, for the reasons stated above.

In addition, if you were to map it in alternate page map
entries per CPU, you would find that you would run into TLB
shootdown bugs on Intel and AMD processors (the way you would
have to use this would guarantee that the shootdown was not
delivered for most of the SMP L2/Bridge chipsets out there).


>     At this point, curproc and all other per-cpu variables for the
>     local CPU can be accessed through
> 
>         my_pcpu->curproc
> 
>     and similar, whereas we can get to other cpu's data with
> 
>         all_pcpu[i]->curproc

The variables only exist on the CPU in question.  That is why
they are called "per CPU".


>     without the need for using %fs or special assembly language to
>     access these fields.

Peter pointed out that these were "assumed volatile"; that's a
simple way of saying "not permitted to be cached" or "must be
explicitly fetched each time".  This is what I was referring to
originally, when I stated that there would be an additional
dereference that would normally not be there, as it would be
hidden by the cache hardware.

Whether this is done with explicit overrides, or mapping the
pages as non-cacheable (and non-global), is really irrelevent:
it's six of one type of overhead, and a half a dozen of another.

>     Then we can discuss how/where to put "volatile" keywords.
>     In principle, all references through all_pcpu[] should be
>     readonly and treated as volatile, with perhaps the exception of
>     some section of code at machine startup. On the contrary we could
>     safely assume that references through my_pcpu are non-volatile
>     as the local processor should be the only one to mess with them
> 
> Anything wrong with this description ?

Er... how can "curproc" be read-only?

You *REALLY* want to avoid per-CPU data, if you can.  The stuff
that's there now is really only there because it's unavoidable:
any time you share a contention domain in memory, you add to the
bus contention, and decrease the value of running a shared memory
multiprocessor in the first place.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D4F7D1B.4D91400A>