From owner-freebsd-arch  Wed Jul 17  0:56: 1 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5F18937B405
	for <arch@FreeBSD.ORG>; Wed, 17 Jul 2002 00:55:56 -0700 (PDT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 16EED43E64
	for <arch@FreeBSD.ORG>; Wed, 17 Jul 2002 00:55:54 -0700 (PDT)
	(envelope-from dl-freebsd@catspoiler.org)
Received: from mousie.catspoiler.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.5/8.12.5) with ESMTP id g6H7t1wr029257;
	Wed, 17 Jul 2002 00:55:06 -0700 (PDT)
	(envelope-from dl-freebsd@catspoiler.org)
Message-Id: <200207170755.g6H7t1wr029257@gw.catspoiler.org>
Date: Wed, 17 Jul 2002 00:55:01 -0700 (PDT)
From: Don Lewis <dl-freebsd@catspoiler.org>
Subject: Re: wiring the sysctl output buffer
To: bde@zeta.org.au
Cc: arch@FreeBSD.ORG
In-Reply-To: <20020717135901.F3656-100000@gamplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On 17 Jul, Bruce Evans wrote:
> On Tue, 16 Jul 2002, Don Lewis wrote:
> 
>> On 14 Jul, Bruce Evans wrote:
>> > ...
>> > `oldlen' is known ahead of time, so a large enough buffer could be
>> > allocated in most cases.  Perhaps vslock() iff oldlen is very large.
>> > The data could still change while it is being copied to a malloc()'ed
>> > buffer (or earlier) if the kernel is preemptible.
>>
>> I basically like this approach.  In most cases the data will be a small
>> and of fixed size, so the handler could even store the temporary copy on
>> the stack.  If the data is a type that can be copied atomically, like an
>> int, we don't have to worry about the data changing in mid-copy.  If it
> 
> It's not clear that ints can be copied atomically unless a suitable lock
> is held.  Longs can't be on some arches (i386's with 64-bit longs for
> example :-).

I believe that ints can be copied atomically by assignment if they are
naturally aligned and nobody is doing byte writes to them.  It's would
be likely that longs could be copied atomically as long as neither the
the reader nor the writer of the variable in question were preempted by
interrupts in the middle of the two halves of their operations.  I
wouldn't want to make any guarantees without a lot of careful study.
It's not likely that you could get an atomic snapshot with bcopy() or
copyout() without doing explicit locking.

>> is larger and we care about getting an atomic snapshot, the handler can
>> lock the source for the duration of the copy to the temporary buffer.
>>
>> For large amounts of data, we probably want to avoid copying to a
>> temporary buffer.  PHK raised the issue of the user specifying a buffer
>> size that is way too large.  Allocating a huge buffer in wired down
>> kernel memory sounds even worse than wiring down a huge buffer in user
>> space.
> 
> I think this is a small problem provided you only make the buffer large
> enough to hold what the kernel is supplying in the current SYSCTL_OUT()
> and not what the application is willing to accept for the whole sysctl.
> There may be no suppliers of huge kernel data except KERN_VNODE, and
> KERN_VNODE was turned off because it was too hard to supply that much
> data correctly even years ago when there were fewer vnodes and less
> concurrency than now.  Any suppliers of huge kernel data may have the
> same problems.

As long as we know the amount of data ahead of time ...

>> If we don't want to allocate a temporary buffer of length oldlen because
>> it is much larger than what looks to be reasonable, but the amount of
> 
> OK, but allocating if it is <= PAGE_SIZE or small enough to put on the
> stack should handle most cases and hopefully not signficantly complicate
> the problem of keeping track of what you allocated.

That sounds about right since that's probably about the crossover point
between the expense of the extra copy versus the call to vslock().

>> data is not easy to calculate ahead of time, we have another potential
>> problem.  We might have to lock a data structure, walk through this
>> data structure and calculate the total data size, and unlock the data
>> structure.  After we malloc() the buffer, relock the data structure, and
>> start traversing the data structure again to do the copy, we may
>> discover that the amount of data to be returned has increased, so we
>> would have to drop the lock, free the buffer, and start over again.
> 
> Also, the size might turn out to be too huge to allocate.  You would then
> need to backtrack and recalulate the size using a less ambitious algorithm.
> 
>> > I think the callers of SYSCTL_OUT() don't need a new API, but they
>> > should supply the data in a form suitable for copying out directly.
>> > This probably involves them making a copy while holding suitable
>> > domain-specific locks.  They can't just point to an active kernel
>> > struct since it might change.
>> >
>> > I would just make a copy at a high level for now.
>>
>> I'd vote for a hybrid approach.  I would remove the vslock() call from
>> sysctl_old_user() and provide the new API for invoking it from the
>> handler if appropriate.  Instead, I would call WITNESS_SLEEP() if
>> vslock() had not been called to wire the buffer to guard against
>> blocking while locks are held.  I would keep the call to vsunlock() in
>> the sysctl cleanup code.
> 
> It's never really appropriate, but I guess you need it for a few sysctls
> because they are too large to fix all at once.  I hope the new API won't
> be used much.  Sysctls with fixed locking can keep using the current code
> since they will have copied the data to a safe place and not hold any
> locks when they call SYSCTL_OUT() or care if SYSCTL_OUT() makes another
> copy or blocks.

I suspect that there are very few places where this would be used.  A
very large percentage of the sysctl calls use the standard handlers and
small data types whose size is known ahead of time.  All of these can be
modified to make a temporary copy of the data and can avoid the vslock()
overhead.

>> I'd also like to provide a second argument to the interface to vslock()
>> to allow the handler to specify a maximum, kernel enforced, buffer
>> length to potentially limit the amount of wired memory to something less
>> than oldlen, but unless we add a member to the sysctl_req structure to
>> hold it, we would need to overwrite oldlen so that we can pass the
>> proper value to vsunlock() in the cleanup code.  What kind impact would
>> changing sysctl_req have on binary compatibility?  I doesn't look like a
>> problem to me, since this structure seems to be mostly treated as an
>> opaque type.
> 
> Changing its size would probably not be good for some modules.
> 
> I think no time should be spent changing interfaces for this.  The
> following hack might work well: use the current code in sysctl_old_user()
> if req->oldlen is not huge.  Otherwise, vslock() only the region being
> copied out to and vsunlock() it after the copy so that we don't have to
> keep track of what is vslock()ed.

That doesn't fix sysctl_kern_function_list(), which grabs a lock and
then walks a linked list, calling SYSCTL_OUT() on each element.  In this
case we can't allow SYSCTL_OUT() to block in either vslock() or
copyout().  The incremental vslock() solution also doesn't really solve
any problems if the calls to vslock() and vsunlock() are done in
sysctl_old_user().  They neither guarantee us a coherent copy if the
source buffer is dynamic and unlocked (though they might increase the
probability), and they also prevent SYSCTL_OUT() being called while a
lock is held.

My initial solution was to allocate a temporary buffer, but that
complicates the code and the similar code might have to be added to
other parts of the kernel where similar situations might arise.  It was
not until later that I thought of pre-wiring the buffer, which looks
like a much cleaner solution.  In the latter case, I decided that it
would be best to abstact the interface so that the handler doesn't need
to become intimate with the internals of the sysctl system.  If this
solution allows us to avoid the call to vslock() in the majority of
cases, then so much the better.

The initial implementation of the interface to vslock() would probably
just ignore the size argument, but defining the API this way would allow
it to be used later if necessary without the necessity of retrofitting
the handlers that use it.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message