Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Dec 2000 20:52:20 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        msmith@FreeBSD.ORG (Mike Smith)
Cc:        tlambert@primenet.com (Terry Lambert), smp@FreeBSD.ORG
Subject:   Re: Netgraph and SMP
Message-ID:  <200012082052.NAA22447@usr01.primenet.com>
In-Reply-To: <200012080533.eB85XRN00458@mass.osd.bsdi.com> from "Mike Smith" at Dec 07, 2000 09:33:27 PM

next in thread | previous in thread | raw e-mail | index | archive | help
> > Actually, you can just put it in non-cacheable memory, and the
> > penalty will only be paid by the CPU(s) doing the referencing.
> 
> Yes.  And you'll pay the penalty *all* the time.  At least when the 
> ping-pong is going on, there will be times when you'll hit the counter 
> valid in your own cache.  Marking it uncacheable (or even write-back 
> cacheable) is worse.

The absolute worst thing you can do on a multiprocessor system
is contend shared resources, either stalling another CPU or
causing a cache invalidation.


> > Still, for a very large number of CPUs, this would work fine
> > for all but frequently contended objects.
> 
> Er.  We're talking about an object which is susceptible to being *very* 
> frequently contended.

Right.  Which is why you break the contention domain so that it
is _not_ contending between CPUs.  That way, only one CPU will
pay the penalty.  In the UP case, you can decide to not mark the
page non-cacheable.


> > I think that it is making more and more sense to lock interrupts
> > to a single CPU.
> 
> No, it's not.  Stop this nonsense.  It's not even practical on some of 
> the platforms we're looking at.

NT does it on every platform on which it runs.  It significantly
beat both Linux and FreeBSD in the Ziff Davis benchmarks you and
Jordan attended, using this configuration.

For the platforms where it's not possible, I agree: you eat the
synchronization overhead.

BTW: aren't some of these platforms MEI, and not MESI?


> > What happens if you write to a page that's marked non-cachable
> > on the CPU on which you are running, but cacheable on another
> > CPU?  Does it do the right thing, and update the cache on the
> > caching CPU? 
> 
> Er, what are you smoking Terry?  You never 'update' the cache on another 
> processor; the other processor snoops your cache/memory activity and 
> invalidates its own cache based on your broadcasts.

Let me explain the model: You mark the page cacheable on the
processor that will contend the resource at interrupt time,
and you make it uncacheable on other processors.  The question
is whether the write through is immediate or delayed (if delayed,
the main memory value could be incorrect when examined by another
CPU), and whether a write to main memory by a CPU without it
cached will result in an proper invalidation in the CPU that has
it cached (if so, then the approach will work).

What this gives you is no inter-CPU contention, unless the main
memory location is written by a processor that is not the
interrupt processor for the lock for the driver being held.  In
that case, the only invalidation is against a single CPU, not
multiple CPUs.

This isn't necessarily a strategy limited to locked interrupts.
By allocating lock regions in cache line lengths, you can then
practically guarantee that, on a heavily loaded system, the
invalidation triggered by a CPU will, at most, invalidate the
cache line stored in _one_ other CPU (the last one prior to take
to the interrupt, assuming that it is moving around).  On a less
heavily loaded system, the cache line for the lock region may be
valid in multiple CPUs (not having been recycled), in which case
you will take additional invalidation overhead.  But doing so is
less problematic, since you can afford the overhead when the
system is less heavily loaded.

This is really a "virtually non-cacheable" approach.

If you can lock specific interrupts to a particular CPU, as NT
locked one network card per CPU in the Ziff-Davis tests, then
you achieve the same thing: the contended region is not ever
referenced by the other CPUs, except under exceptional conditions,
like driver unload, and it is effectively not cached on them as
a result, even if the pages are marked cacheable.

Does that make more sense?


> > If so, locking the interrupt processing for each
> > card to a particular CPU could be very worthwhile, since you
> > would never take the hit, unless you were doing something
> > extraordinary.
> 
> With the way our I/O structure is currently laid out, this blows because 
> you end up serialising everything.

Not everything; just interrupts for a single card: the locking
would occur at an interrupt granularity; I'm not talking about
ASMP here.  You also only end up serializing them to a single
CPU; as long as your load is reasonably distributed, the CPU
won't be doing other work, at the time.

Also the locking need not be literal: it could be nothing more
than a "strong affinity", requiring extra effort to implement
any migration.  This is, I believe, how NT does it.

PS: Considering all this, it makes sense to me to perhaps consider
ensuring that a disk interrupt for a DMA completion be handled on
the CPU responsible for the network card that will be sending the
data that the completion signals availability on.  NetWare does
something similar, in that its threads are based on voluntary,
not involuntary preemption.  Admitted, this means true ASMP, but
it's hard to argue with NetWare's file server performance...


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200012082052.NAA22447>