From owner-freebsd-arch  Thu Jan 23 19:13:36 2003
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 06AAB37B401; Thu, 23 Jan 2003 19:13:32 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 51C8843F43; Thu, 23 Jan 2003 19:13:31 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0463.cvx22-bradley.dialup.earthlink.net ([209.179.199.208] helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18buHM-0004nK-00; Thu, 23 Jan 2003 19:13:17 -0800
Message-ID: <3E30AEF6.FD18CF37@mindspring.com>
Date: Thu, 23 Jan 2003 19:11:50 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Bosko Milekic <bmilekic@unixdaemons.com>
Cc: Doug Rabson <dfr@nlsystems.com>, John Baldwin <jhb@FreeBSD.org>,
	arch@FreeBSD.org, Andrew Gallatin <gallatin@cs.duke.edu>
Subject: Re: M_ flags summary.
References: <XFMail.20030123103959.jhb@FreeBSD.org> <1043339738.29341.1.camel@builder02.qubesoft.com> <3E309FE5.F74564DC@mindspring.com> <20030123212722.A80406@unixdaemons.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a499b4237ee4ee977d4c2efb242358239593caf27dac41a8fd350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Bosko Milekic wrote:
> On Thu, Jan 23, 2003 at 06:07:33PM -0800, Terry Lambert wrote:
> > This is preferrable for *most* cases.  For cases where a failure
> > of an operation to complete immediately results in the operation
> > being queued, which requires an allocation, then you are doing a
> > preallocation for the failure code path.  Doing a preallocation
> > that way is incredibly expensive.  If on the other hand, you are
> > doing the allocation on the assumption of success, then it's
> > "free".  The real question is whether or not the allocation is in
> > the common or uncommon code path.
> 
>   In that case you shouldn't be holding the lock protecting the queue
>   before actually detecting the failure.  Once you detect the failure,
>   then you allocate your resource, _then_ you grab the queue lock,
>   _then_ you queue the operation.  This works unless you left out some
>   of the detail from your example.  The point is that I'm sure that a
>   reasonable solution exists for each scenario, unless the design is
>   wrong to begin with... but I'm willing to accept that my intuition has
>   misled me.

Sorry, I thought the problem was obvious: by doing this, you
invert the lock order that you would normally use for a free,
if you are locking one or more other things, at the time.  The
most common case for this inversion would be "fork", or any
other place that has to punch the scheduler.  But it's also in
pretty much every place you would do a kevent, as well as in
the network stacks, where copies or pullups happen.

Basically, you can't delay holding the lock, but you can
accellerate it (ie. doing it early means an extra free in
the failure case -- if it means it in the success case, you
are pessimizing the heck out of something you shouldn't be).

Otherwise, you would have to tolerate "LOCK MALLOC; LOCK XXX"
on malloc and "LOCK XXX; LOCK MALLOC" on frees.


> > The easy way to mitigate the issue here is to maintain an object
> > free list, and use that, instead of the allocator.  Of course, if
> > you do that, you can often avoid holding a mutex altogether.  And
> > if the code tolerates a failure to allocate reasonably well, you
> > can signal a "need to refill free list", and not hold a mutex over
> > an allocation at all.
> 
>   Although clever, this is somewhat bogus behavior w.r.t. the allocator.
>   Remember that the allocator already keeps a cache but if you instead
>   start maintaining your own (lock-free) cache, yes, maybe you're
>   improving local performance but, overall, you're doing what the
>   allocator should be doing anyway and, in some cases, this hampers the
>   allocator's ability to manage the resources it is responsible for.
>   But I'm sure you know this because, yes, you are technically correct.

IMO, the allocator should really do this on your behalf, under
the covers.  One of the things that UnixWare (SVR4.2) did internally
was to preallocate a pool of buffers for use by network drivers, to
avoid having to do allocations at interrupt time, and then refill the
pool, as necessary, in the top en of the drivers.  So while clever,
I can't claim that it's original.  8-).


>   In any case, it's good that we're discussing general solution
>   possibilities for these sorts of problems but I think that we agree
>   that they are rather special exception situations that, given good
>   thought and MP-oriented design, can be avoided.  And that's what I
>   think the allocator API should encourage: good design.  By specifying
>   the wait-case as the default behavior, the allocator API is
>   effectively encouraging all non-ISR code to be prepared to wait, for
>   whatever amount of time (the actual amount of time is irrelevant in
>   making my point).

This contradicts bot Jeffrey's and Warner's points, though, and
I think their points were valid.

The problem that Jeffrey wants addressed is the problem of magic
numbers; I almost made this point myself, when we were talking
about prototypes in scope for the math library functions, which
were defines instead of prototypes in the x86 case, to use inline
functions.

The issue there was that the place the defines and/or prototypes
belonged was actually in the machine dependent files.  Thsi is
because the functions took manifest constants as parameters which
may very well be enum's or something else -- and the values of the
bits could be different from platform to platform.

Magic numbers really suck, even if they are "0".


The problem Warner wants addressed is that in order to provide
certain classes of scheduling service to applications, and in
particular, to provide POSIX conformant scheduling for parts
of the POSIX specifications, you have to be able to do deadlining.

What this boils down to is that you have to be able to guarantee
that particular operations will either succeed or fail in a
bounded amount of time (e.g. 2ms or whatever the bound happens
to be).

For that to work, you prefer that something fails, rather than
sleeping.

Short of adding a parameter that gets passed down to all the
functions in the chain to the target function which might
sleep, telling it whether or not it's OK to do so, you really
can't make the type of guaranteeds necessary to conform to the
standards.

I understand that malloc() has a parameter for this now -- or
it defaults to that aparameter being there -- but basically this
means that what *should* be the common case: bounded completion
time, regardless of success or failure, ends up being unbounded
by default.  So to fix this problem, a programmer would have to
go out of their way to add additional "nowait" parameters to all
functions up the gall graph, until they got to the one they cared
about.

This, to me, means that there's a lot of unnecessary slogging
that's going to have to happen to get to the point where this
is all heading anyway, eventually, and along with that, a lot of
additional future opportunities for error.

--

At this point, I would almost suggest elminating both M_NOWAIT
and M_WAIT (and M_TRYWAIT and M_WAITOK, or whatever the heck it
is this week), and split the functionality, so that there are
two different function calls for malloc.

This is similar to the suggestion on the table here -- however,
I would *NOT* pass a mutex that could be released and reacquired
over the wait to the allocation function; I would, instead have
an allocation function which blocks indefinitely until it gets
memory, or the heat death of the universe, whichever comes first
(the whole "TRYWAIT" thing was an incredible mistake, IMO).

At least this way, you can look at it as clearly laying out a
way of getting rid of blocking allocation requests *some time in
the future*, and then call the entry point for blocking allocations
"deprecated" from the start, so that people will at least try to
avoid using it in new code.


I guess I should say that, on general principles, barriers should
be up front-loaded, if possible.  What I mean by this is that if you
make it hard to do initial work on a massive change, and easier to
do later work, you are *MUCH* better off than if you make the work
easy up front, and then have to write "And Then A Miracle Happens..."
in your project plan, right at the end.  8-).

Put another way: people will only do work they are passionate about,
and passion wanes, so you have to put the hard stuff up front, or
lots of things will be started, but nothing will ever be completed.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message