Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Jan 1997 14:05:45 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        proff@suburbia.net
Cc:        hackers@freebsd.org
Subject:   Re: SLAB stuff, and applications to current net code (fwd)
Message-ID:  <199701262105.OAA02273@phaeton.artisoft.com>
In-Reply-To: <19970126042316.10096.qmail@suburbia.net> from "proff@suburbia.net" at Jan 26, 97 03:23:16 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > I'm going to try feveriously to get the SLAB allocator integrated into
> > Linus's sources over the next two days.  For the most part my
> > incentive is so that people think about it when they design memory
> > object allocation subsystems.
> > 
> > For example, even right now, look at the way struct sock's are indeed
> > allocated.  Alan's recent change to add sock_init_data() and the fact
> > that my sources already use SLAB for struct sock sparked this idea.
> > 
> > We could in this case just make sock_init_data() the constructor
> > routine for the sock SLAB.  So for a warm SLAB cache this code never
> > gets run as long as users of sock's are not forgetful and leave a sock
> > in a reasonable state when they free them.  (ie. don't leave crap on
> > the receive queue etc.)
> 
> Can anyone inform me what a SLAB allocator is, and if so, would freebsd
> benefit from one?

In simple terms, it allocates "slabs" of memory for memory pools of
a particular type, generally in units of pages.  It is basically a
variant of the zone allocator (per MACH).

You can implement memory zoning, per MACH, to tag object persistance
within a SLAB allocator.  Assuming you allocate the kernel space itself
with the SLAB allocator, and the kernel image is assembled on SLAB
boundries, you can even do things like zone discard, which let you
throw away initialization code once the system is up, and reclaim
the memory for reuse by the system.  One of the reasons I want ELF
is to allow zone discard.


Technically, FreeBSD already does SLAB allocation, or at least its
interface looks like it does.  The sys/malloc.h values M_MBUF,
M_SOCKET, M_NAMEI, etc. used by the kernel malloc, are all SLAB
identifiers.

---

In a kernel multithreading or SMP environment, you don't *want* a SLAB
allocator... at least, not a pure one, as the base level for allocation.
You want a global page-pool allocator instead, and then you will
implement your allocator on top of the page pool.

There are even reasons you might want to have this on a simple UP
system.


Because the kernel is reeentrant, you can reenter the allocator on a
SLAB id.  FreeBSD currently copes with this (badly) on its fault/interrupt
reentrancy case... for example, allocating an mbuf at interrupt level
requires that the mbuf SLAB be guarded against reentrancy by running
the allocation to completion at high SPL so that it is never really
reentered while the data structure is in an indeterminate state.  This
adds to processing latency in the kernel.


In reality, each context wants its own set of SLABs.  This may be
overkill for exception reentrancy, actually.  The way Microsoft does
this in Windows95 and NT is dividing the interrupt service routines
into "upper" and "lower", with "lower" running at interrupt mode, and
"upper" running in non-interrupt mode.  In Windows95 and NT, you do
not do memory allocation at interrupt level; you must preallocate
the memory at driver init, and allocate it again in "upper" code if
the preallocated object is consumed by the "lower" code.

If you don't want to make this restriction on the use of allocators
(there are several places where the preallocation overhead would
be exhorbitant, like FDDI or ATM with large card buffers), then you
must provide a seperate set of SLABs for a context.


In terms of context SLABs, you *do* want seperate SLABs for each
processor context in SMP.  The reason you want this is that using
a global SLAB pool, like in SVR4 or Solaris, you must line up all
your processors behind the IPI in order to synchronize access to
the SLAB control object for the object type you are allocating.

The correct method, first demonstrated in Sequent Dynix, is to have
per processor page pools.  The synchronization is not required unless
you need to refill the page pool from (low water mark) or drain the
page pool to (high watermark) the global system page pool.

This method is documented in:
	"UNIX Internals: The New Frontiers"
	Uresh Vahalia, _Prentice Hall_
	ISBN 0-13-101908-2
	Chapter 12 _Kernel Memory Allocation_
	Section 12.9 _A Hierarchical Allocator for Multiprocessors

Sequent failed to implement SLAB allocation on top of this page pool
abstraction, and so Vahalia's Analysis is rather harsh, compared to
his analysis of SLAB allocation (covered Section 12.10).  But it is
incorrect to call the SLAB allocation itself superior.


Vahalia cites a paper:
	"Efficient Kernel Memory Allocation on Shared Memory
	 Multiprocessors"
	McKenney, P.E. and Slingwine, J.
	Proceedings of USENIX, Winter, 1993

Which shows the sequent code to be faster than the McKusick-Karels
algorithm by a factor of three to five on a UP, and a factor of
one hundred to one thousand on a 25 processor system.

Clearly, if we considered contexts as owning pools instead of CPU's,
we should expect a three to nine times improvement for UP BSD from
having seperate contexts for interrupt vs. exception vs. normal
allocations (in place of running the allocations to completion at
high SPL).

This might not have more than a 10% scaled effect on a heavily
interrupting FreeBSD system, but a 10% improvement is an improvement.

There are a number of issues, like object garbage collection, which you
would do using cache invalidation at IPI synchronization time to
determine if a low water mark hit is real or not.  For instance, I
may allocate an mbuf on CPU 1 and pass it's address to CPU 2 for
use by a user process context entered in the TCP code.  If the CPU 2
process then deallocates the MBUF, the cache line indicating the
allocation will not have been invalidated.  Effectively, this means
that there must be an allocation map with each object, and that it
must be dynamically scoped.  This lets CPU2 mark the map entry as
invalid, even though CPU1 did the allocating.  CPU1 would sync his
picture to that of the global picture at low water mark, reclaiming
released buffers at that time.  In reality, it's probably better to
have a message interface per SLAB per CPU to propagate deallocation
messages... if you didn't do that, then a deallocation on CPU1 or
CPU3 could cause a corrupt cache line to be written.  Rather than
fighting between CPU's and being careful at the cache line level,
the IPI synchronization should allow message delivery at IPI;
this will generally be very fast, since the memory can be preallocated
with page attributes so the que pointer ownership is toggled at
IPI time.  We can go into this in detail if anyone wants to, but the
SMP list is probably a better forum for these issues.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199701262105.OAA02273>