Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 Oct 2000 10:52:08 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        dillon@earth.backplane.com (Matt Dillon)
Cc:        dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG
Subject:   Re: MP: per-CPU mbuf allocation lists
Message-ID:  <200010311052.DAA29909@usr02.primenet.com>
In-Reply-To: <200010310731.e9V7VUJ16831@earth.backplane.com> from "Matt Dillon" at Oct 30, 2000 11:31:30 PM

next in thread | previous in thread | raw e-mail | index | archive | help
>     The biggest benefit to per-cpu allocation pools is L1-cache locality,
>     though being able to avoid locking can also be significant if you are
>     doing lots of small allocations. 

I think that the locking avoidance win is being grossly underrated.
Realize that a lock accessed by multiple processors will require
distribute coherency, and will either have to live in uncached
memory, or will have to take an update cycle worth of overhead.

I would be real interested in cycle counting in and out of the Big
Giant Lock; I think that it would be quite telling, even in the
current implementation.

The reason for the limitation of "four processors , and you hit
the point of diminishing returns" myth is architecture.  The limit
you are hitting when you hit "the point of diminishing returns" is
really where your software architecture falls over dead.  For
Sequent, this was 32, and now it's up to 64, with the new NUMA-Q
system (which, since it runs on Monterray, can run Linux programs,
BTW).

The limit for SVR4, which had a slab allocator and a single contended
lock (being largely derived from the Solaris code of the day) was 4.


>     However, the schemes that I see implemented today by Linux, for example,
>     and Dynix too (as Terry pointed out) are actually considerably more
>     complex then we really need.  If those schemes can reap a 95% cache
>     hit I would argue that implementing a simpler scheme that only reaps
>     85% might be better for us.

The Linux stuff is pretty baraoque, from what I've seen of it; they
are pretty much doomed because of some early decisions in their VM
system, which they are now stuck with, if they want to maintain their
level of platform support without a serious rethink.

>     For example, to reap most of the benefit we could simply implement a
>     5-10 slot 'quick cache' (also known as a working-set cache) in
>     MALLOC()/FREE() and zalloc[i]()/zfree().  It is not necessary to keep
>     big per-cpu pools.  With small per-cpu pools and hystersis we reap most
>     of the benefits but don't have to deal with any of the garbage collection
>     or balancing issues.  After seeing the hell the Linux folks are going 
>     through, I'd much prefer to avoid having to deal with balancing.

The example from the Dynix allocator was "three elements per list, a
running allocation of 4-6 elements max" (this was with a target of 3).

It's really trivial to implement this:

/* something to cast about; allocations can be no smaller than this*/
struct thing {
	struct thing	*next;		/* linked list of 'target'*/
	struct thing	*tail;		/* tail of 'target' we are member of*/
};

/* our system-wide freelist*/
struct global_layer {
	mutex_t		gl_lock;	/* global layer lock*/
	struct thing	*global_list;	/* list of 'target' clusters*/
	struct thing	*bucket_list;	/* list of sub-'target' individuals*/
	int		target;		/* our target size*/
};

/* our per-CPU freelist*/
struct per_cpu {
	struct thing	*main_freelist;	/* unit pool*/
	struct thing	*aux_freelist;	/* cluster of unit pool*/
	int		target;
};


The obvious, of course:

struct myobject *
alloc_from_per_cpu()
{
	struct thing *rv;

	rv = percpup->main_freelist;
	if( rv->next) {
		percpup->main_freelist = rv->next;
	} else {
		/*
		 * Note: we could just mark this, and deal with it
		 * later (e.g. if we are in an interrupt handler).
		 */
		percpup->main_freelist = percpup->aux_freelist;
		LOCK(globalp->gl_lock);
		if( ( percpup->aux_freelist = globalp->global_list) == NULL) {
			/* we really are out of memory!*/
			...
		} else {
			globalp->global_list = globalp->global_list->tail->next;
			percpup->aux_freelist->tail = NULL;
		}
		UNLOCK(globalp->gl_lock);
	}

	/* maybe do some initialization, if needed; probably in caller...*/
	return( (struct myobject *)rv);
}

Deallocation goes back to the main_freelist.  When there are target
items already on the list, the list on aux_freelist is given back
to the system, as a cluster.  Again, this could be delayed until
the CPU was idle, or whatever.  The point is that the tail needs
setting during the clustering operation; right now, that's a linear
traversal, though we could cheat, and fill toward the tail, and pull
off the head, which would let us maintain the tail pointer at the
head.  The tradeoff is really based on incremental vs. linear cost,
and if we can really delay, so it's partly dependent on the size
of 'target'.

The coalescing of freed pages is a system task.  Basically, the
system frees to a coelesce-to-page layer in the Dynix model;
there's no reason FreeBSD couldn't go on using the same model it
uses now for doing this, though it will have the same unbounded
worst case performance FreeBSD has today.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200010311052.DAA29909>