Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 07 Jun 2001 23:52:50 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Ian Dowse <iedowse@maths.tcd.ie>
Cc:        Matt Dillon <dillon@earth.backplane.com>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: UFS large directory performance
Message-ID:  <3B207642.A7C7E7C@mindspring.com>
References:  <200106021207.aa78407@salmon.maths.tcd.ie>

next in thread | previous in thread | raw e-mail | index | archive | help
Ian Dowse wrote:
> >    The only potential problem I see here is that you could
> >    end up seriously fragmenting the malloc pool you are
> >    using to allocate the slot arrays.  And, of course, the
> >    two issues you brought up in regards to regularing memory
> >    use.
> 
> Thanks for the comments :-) Yes, malloc pool fragmentation is a
> problem. I think that it can be addressed to some extent by using
> a 2-level system (an array of pointers to fixed-size arrays) instead
> of a single large array, but I'm open to any better suggestions.

Use a chain allocator.  I would suggest using the zone
allocator, but it has some fundamental problems that I
don't think are really resolvable without a rewrite.

Unlike the zone allocator, which rounds to 64 byte units,
a chain allocator allocates a free chain that is fit to
a 32 bit boundary (4 bytes instead of 64, for an average
wastage of 2, instead of 32, bytes, if there is any at
all).

In addition, the allocator will allocate based on least
common multiple (after 32 bit rounding), and will use a
backing object size to get the multiple.  For a 60 byte
object, this would mean allocating 15 pages, and dividing
it into 1024 objexts (2048 for Alpha, with its 8k page
size).  But the memory can be backed by kmem_malloc() or
alloc(), which allocates page units, instead of by the
power of 2 allocator.  And there are zero bytes wasted.

The memory also need not be permanently type stable,
like memory in a zone, though you would need to build a
reclaimer that ran in the background.  You could make
it deterministic, by sorting the freelist in chunks of
the allocation unit at a time (see Knuth; a quicksort
is really not appropriate).  You could use a second
level reclaim to return memory to the system; this would
effectively make it a hybrid between the Dynix zone
allocator (on which it is modelled), and a slab allocator.

Finally, you could maintain per-cpu chains to allow the
allocations and frees to occur on a per CPU basis; that
would mean that there was no IPI or bus contention for
allocations or deallocations (whether you want to defer
migration of objects to the chains of their home CPU, or
let the free to the system at a low watermark take care
of that for you, giving preference to frees of non-local
origin elements, is reall a policy decision).

I've actually written one of these, without the Dynix
style reclaimer (e.g. the memory becomes type stable only
after allocation, which makes it an improvement on the
zone allocator, at the cost of not being usable at
interrupt time for allocations), as part of some performance
improvement work in a modified FreeBSD kernel.


> If the second-level array size was fixed at around 4k, that would
> keep the variable-length first-level array small enough not to
> cause too many fragmentation issues. The per-DIRBLKSIZ free space
> summary array is probably relatively okay as it is now.

If you are going to fix it, and your object happens to be
an even power of 2 in size, at the very least, make the
backing object be the page size, so it will still work well
on the Alpha, without page fragmentation.

I rather expect that the IA64 will end up forcing our
Intel page size to be increased.

> The other main issue, that of discarding old hashes when the memory
> limit is reached, may be quite tricky to resolve. Any approach
> based on doing this from ufsdirhash_build() is likely to become a
> locking nightmare. My original idea was to have ufsdirhash_build()
> walk a list of other inodes with hashes attached and free them if
> possible, but that would involve locking the other inode, so things
> could get messy.


Add an LRU field, and lock the hashes on a "32 object at
a time" or similarly high granularity.  This is more
natural with a chain block, of course, but works well in
any case.

I have been considering moving the per process open file
table into a direct/indirect object list, similar to the
SVR4 method, rather than the size doubling, for similar
performance reasons.  Locking a block of files, and
maintaining per block freelists (instead of a hints pointer
-- what the heck else are you keeping in those idle structs
anyway, such that you _shouldn't_ be cating them into a
list?!?).

If you poke around in that area long enough, you'll see
that there are some "magic" numbers like "NFILES", which
started out as an optimization, but which all the recent
additions for things like kqueue event chains have turned
into hopeless pessimizations.  8-(.

-- Terry



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B207642.A7C7E7C>