Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Jun 2001 01:03:13 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Bosko Milekic <bmilekic@technokratis.com>
Cc:        freebsd-current@FreeBSD.ORG, freebsd-alpha@FreeBSD.ORG
Subject:   Re: New SMP Mbuf Allocator (PATCH and TESTING request)
Message-ID:  <3B286FC1.5C112A5C@mindspring.com>
References:  <20010613010744.A4083@technokratis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Bosko Milekic wrote:
>         I plan to commit the new bits within the next week.
> However, as is usually the case with commits of this magnitude,
> I'd like a few more tests to be run by a few more people.
> I've been testing the allocator myself (in several different
> ways, mainly resource exhaustion simulations) for the past
> couple of weeks and have been able to catch and fix a couple
> of bugs.
>         Also, jake, jlemon, and Silby (Mike Silbersack) have
> provided me with some reviews, all of which have been integrated
> into the latest version of the patch (below).


A general comment, and then a comment on the patch:

I would like to see this code be optional, until such time
as benchmarks have proven it to be better than the existing
code, or have proben it to be worse.  Diking out the existing
code, and forcing it to be diked out by making unnecessary
function name changes seems to me to be a problem.

Overall, I like the idea of moving towards a Dynix allocator
type model to support many processors (not just 4) well, but
I'm leery of many of the changes, particularly in light of
the claim that "It has been established that we do not benefit
from different locks for different objects, so we use the same
lock, regardless of object type".

On to the specific comments... I have addressed them in patch
order, so that they can be read side-by-side:

--

The "npg" calculation has always been incorrect for large
memory systems with non-sparse physical maps, and now it
is much worse.  I commented on this before in -arch.

Consider the case of a machine with 4G of physical memory.

--

Is the policy two line or 4 line license, these days?

--

The MBALLOC_CPU stuff should be more dynamic; it appears
that the real number of CPUs is known at the time the
allocations take place.

--

I think you should spin, and only block if you don't get
the spin lock after several attempts.  The Solaris tradeoff
in this regard is 10 attempts.  Direct blocking on the cv
seems a little wasteful.

--

I'm not sure why you mutex the per-CPU containers?  Aren't
they, by definition, per-CPU, and therefore immune from
contention?

--

You realize that there is a dependency on MAXFILES, right?

Specifically, there is an mbuf allocated per open socket
for the tcptmpl (even though it only needs 60 bytes).

---

If you fix the tcptmpl allocation, a couple of additional
issues arise:

1)	Since the allocations are in page-units, you will
	end up with a lot of waste.

2)	Allocation in page units will make it hard to give
	60 byte allocations back to the system.

3)	The allocator is not really general enough for this
	(or for things like TIME_WAIT zombie structs, etc.).

4)	Yeah, this sort of also implies that mbuf cluster
	headers come from a seperate pool, instead of being
	256 bytes as well...

--

Why do you lock around things in mb_init()?  It's never
called twice, let alone reentrantly... the code will be
done before an AP is allowed to run.  Alpha has this same
restriction, from what I can see from the code.

--

Dropping the lock in mb_pop_cont() seems wrong; I guess
you are avoiding sleeping with the lock held?

I don't think you need to worry (per-CPU, again), unless
you later attempt lazy repopulation; it seems to me that
the lock that should be held is a global lock, to prevent
reentrancy on the global memory pool, and the malloc()
code does that.

Perhaps I'm missing something here.

--

The ability to specify a "how" that implies an unwillingness
to block implies interrupt allocation; yet you can't support
this, since you do not preallocate a kernel map (ala the zone
allocator).  The zone allocator is actually very misleading,
in that some of it's code never has the values it implies that
it might have; in particular, there are functions which are
passed parameters which can't really be variant, as implied by
the use of per zone variables as arguments.

--

Ah.  I see why all the locks: mb_alloc_wait attempts to be
a primitive reclaimer.  You would do much better with a per-CPU
high-watermark reclaim at free(), I think, instead of doing
this.  Really, you are in a starvation condition because of
your load, and not because someone is "hogging already free
mbufs", I think.  This is probably premature optimization.

--

In the starvation case during a free, I don't think you
really care, except perhaps to adjust the high/low watermarks.

An interesting issue here is that the average TCP window size
will end up being 16k, which comes out to 4 1-page buckets (2,
on Alpha), where you end up with multiple buckets and many
mbufs (64) per, so freeing isn't really an option, if you are
sending lots of data (serving content).  It would be different
for a desktop system, since the receive interrupt might come
to any idle CPU.

--

I like the cleanup via the use of local macros.

--

I think the m_getclr -> m_get_clrd change is gratuitous.

--

I like the "only" comment claifications.

--

I dislike the bloating of one line comments into three lines,
with the first and last blank.

--

You could resolve the statistics problems by keeping them on
a per-CPU basis, and then agregating them under a general
system lock, when necessary (when they were requested).

--

Most of the header definition changes for function declarations
are gratuitous.

--

Don't really care about netstat stuff...

--

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B286FC1.5C112A5C>