Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 03 Jan 2002 12:43:40 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Bernd Walter <ticso@cicely9.cicely.de>
Cc:        Matthew Dillon <dillon@apollo.backplane.com>, Alfred Perlstein <bright@mu.org>, John Baldwin <jhb@FreeBSD.ORG>, arch@FreeBSD.ORG, Bernd Walter <ticso@cicely8.cicely.de>, Mike Smith <msmith@FreeBSD.ORG>, Bruce Evans <bde@zeta.org.au>, Michal Mertl <mime@traveller.cz>, Peter Jeremy <peter.jeremy@alcatel.com.au>
Subject:   Re: When to use atomic_ functions? (was: 64 bit counters)
Message-ID:  <3C34C27C.C7FE6713@mindspring.com>
References:  <XFMail.020102152920.jhb@FreeBSD.org> <200201030002.g0302Eo60575@apollo.backplane.com> <20020102180734.A82406@elvis.mu.org> <200201030012.g030Cgp60752@apollo.backplane.com> <3C33D3E3.38E758CA@mindspring.com> <20020103094239.GH53199@cicely9.cicely.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Bernd Walter wrote:
> > In any case, multiple reader, single write doesn't need locks
> > unless it has to be done atomically.
> 
> Single write means always with the same CPU or thread.

Yes, of course.

> If you always write with the same CPU the other might keep an
> old value in his cache for an unknown long time.
> You simply trust that caches gets syncronised automaticaly by
> regular events like context switching.
> I'm not shure this is valid.

Actually, I'd suggest putting it in a non-cacheable page, if
multiple CPUs needed to get at it, and it must be 100% accurate
at all times; if it were in a cacheable page, there would be an
Invalidate (MESI, remember?), and the cache line in the other
CPUs that had it cached would be flushed.

Personally, I'd suggest making the algorithms insensitive to
the data, in the neighborhood of 2 times the update latency.

This is what I suggested to Alfred, when I suggested per CPU
run queues (and three years ago, when we had the threads
design meeting at the Whistle facility FreeBSD user group).

---
  Example application and use of variable accuracy counts in
  a non-locking scheduler with CPU migration, negaffinity,
  and variable scheduling policy
---

Basically, you would not need extremely accurate load numbers
in order to make CPU load balancing decisions, if you did the
load balancing "correctly":

1)	Per CPU run queues are created
2)	Processes do not migrate themselves between run queues
3)	A single "figure of merit" is kept on a per CPU basis,
	where other CPUs can see it; their view of this figure
	potentially contains latency in its accuracy
4)	An infrequent (relative to the scheduling quantum)
	timer averages "CPUs of interest" (this permits NUMA
	and cluster based migration, in the future, if desired,
	by keeping per CPU set statistics, if desired), and
	stores this, also, as a "figure of merit" which is
	accessible to the other CPUs.  It might also keep a
	list of M CPUs, in sorted order of lowest to highest
	values, to make migration CPU target selection easier.
5)	When a CPU is ready to pull a task off its run queue,
	perhaps every N iterations of this procedure, it takes
	its "figure of merit" and compares it against the average
	"figure of merit"; if the per CPU number is not some
	configurable "high water mark" over the average, then
	nothing happens: it simply schedules the task
6)	If it decides action nees to be taken, then it polls
	the figures of merit for its interest group.  The CPU
	with the lowest "figure of merit" is selected; this
	may be mediated by negaffinity; for example, if the
	task is a thread in a thread group (process), it might
	have a 32 bit value in the process structure indicating
	which CPUs have threads on them currently.  In that case,
	only a CPU without its bit set in the 32 bit value could
	be selected, based on relative figure of merit.
7)	The "scheduler insertion queue" for the selected target
	CPU (a simple linked list) has the task added to it (the
	task is migrated).  This requires a lock on the queue;
	note that only one other CPU is tied up in this process,
	in an N CPU system, and the statistical likelihood of a
	collision is very low.
8)	Each CPU going into its scheduler looks first at its per
	CPU scheduler insertion queue; this look does not require
	a lock, in most instances, since all it is doing is a
	check to see if a pointer is NULL.  If the pointer is not
	NULL, then (and only then) is a lock on the queue
	asserted by the CPU that "owns" it, and the element is
	dequeued, the lock released, and run.

This is only an example algorithm; but note that it operates
completely without locks, except in the migration case, and the
migration case is a relatively rare event.  This approach gets
rid of all the Linux and other OS's *mess*, where they try to
ensure CPU affinity for processes by scheduler selection: the
scheduler algorithm need not be nearly as complex, in order to
acieve the same benefits of negaffinity for CPUs for threads in
a single process, etc. (the more you look at it, the more benefits
you will see).

---

The point of this example is that the issue isn't about how much
or how little locks cost, but about designs that don't require
locks in the common case.

So it really doesn't *matter* what the costs are up front, as
long as the costs are incurrect so infrequently that their
amortized value is lower than the lowest cost method being hit
much more frequently.

In other words: Matt was right, when he said "Maybe we are looking
at this all wrong ... maybe we don't need locks ..." (paraphrasing).


> > An easy way to do this atomically, even for fairly large items,
> > is to toggle between an active and inactive structure of data,
> > where the pointer to the strucutre is assigned atomically.
> 
> Only you use rel/acq behavour - atomic alone isn't strong enough.

It depends on your cycle time on the objects.  So long as the
update/read cost is very small relative to the update/read
frequency, you're OK.

Basically, you need N copies, where the update/read frequency
divided by the update/read cost is greater than N*2.  For my
example of a zero system call gettimeofday() and time(), using
a page with PG_U set on it to contain a reflection of the
timecounter data, N == 2 (the minimum N permissable).  For
something else, N might be a larger value, and then instead of
two pointers for "current" and "previous", you are talking two
pointers for a two-handed clock through a circularly linked
list of structures containing the reflected data (the hands
are both moved on update-forward, so (1) the data is never
incorrect, and (2) the data is always one-behind during the
update process itself).

If N gets larger than 2, in any case, you probably ought to be
reconsidering your choice of algorithms, anyway.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C34C27C.C7FE6713>