From owner-freebsd-arch  Tue Jul 16 22:17: 2 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0A8F937B400
	for <arch@FreeBSD.ORG>; Tue, 16 Jul 2002 22:16:59 -0700 (PDT)
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1359E43E31
	for <arch@FreeBSD.ORG>; Tue, 16 Jul 2002 22:16:58 -0700 (PDT)
	(envelope-from bde@zeta.org.au)
Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102])
	by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id PAA27438;
	Wed, 17 Jul 2002 15:16:47 +1000
Date: Wed, 17 Jul 2002 15:20:31 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Don Lewis <dl-freebsd@catspoiler.org>
Cc: arch@FreeBSD.ORG
Subject: Re: wiring the sysctl output buffer
In-Reply-To: <200207160954.g6G9scwr026538@gw.catspoiler.org>
Message-ID: <20020717135901.F3656-100000@gamplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On Tue, 16 Jul 2002, Don Lewis wrote:

> On 14 Jul, Bruce Evans wrote:
> > ...
> > `oldlen' is known ahead of time, so a large enough buffer could be
> > allocated in most cases.  Perhaps vslock() iff oldlen is very large.
> > The data could still change while it is being copied to a malloc()'ed
> > buffer (or earlier) if the kernel is preemptible.
>
> I basically like this approach.  In most cases the data will be a small
> and of fixed size, so the handler could even store the temporary copy on
> the stack.  If the data is a type that can be copied atomically, like an
> int, we don't have to worry about the data changing in mid-copy.  If it

It's not clear that ints can be copied atomically unless a suitable lock
is held.  Longs can't be on some arches (i386's with 64-bit longs for
example :-).

> is larger and we care about getting an atomic snapshot, the handler can
> lock the source for the duration of the copy to the temporary buffer.
>
> For large amounts of data, we probably want to avoid copying to a
> temporary buffer.  PHK raised the issue of the user specifying a buffer
> size that is way too large.  Allocating a huge buffer in wired down
> kernel memory sounds even worse than wiring down a huge buffer in user
> space.

I think this is a small problem provided you only make the buffer large
enough to hold what the kernel is supplying in the current SYSCTL_OUT()
and not what the application is willing to accept for the whole sysctl.
There may be no suppliers of huge kernel data except KERN_VNODE, and
KERN_VNODE was turned off because it was too hard to supply that much
data correctly even years ago when there were fewer vnodes and less
concurrency than now.  Any suppliers of huge kernel data may have the
same problems.

> If we don't want to allocate a temporary buffer of length oldlen because
> it is much larger than what looks to be reasonable, but the amount of

OK, but allocating if it is <= PAGE_SIZE or small enough to put on the
stack should handle most cases and hopefully not signficantly complicate
the problem of keeping track of what you allocated.

> data is not easy to calculate ahead of time, we have another potential
> problem.  We might have to lock a data structure, walk through this
> data structure and calculate the total data size, and unlock the data
> structure.  After we malloc() the buffer, relock the data structure, and
> start traversing the data structure again to do the copy, we may
> discover that the amount of data to be returned has increased, so we
> would have to drop the lock, free the buffer, and start over again.

Also, the size might turn out to be too huge to allocate.  You would then
need to backtrack and recalulate the size using a less ambitious algorithm.

> > I think the callers of SYSCTL_OUT() don't need a new API, but they
> > should supply the data in a form suitable for copying out directly.
> > This probably involves them making a copy while holding suitable
> > domain-specific locks.  They can't just point to an active kernel
> > struct since it might change.
> >
> > I would just make a copy at a high level for now.
>
> I'd vote for a hybrid approach.  I would remove the vslock() call from
> sysctl_old_user() and provide the new API for invoking it from the
> handler if appropriate.  Instead, I would call WITNESS_SLEEP() if
> vslock() had not been called to wire the buffer to guard against
> blocking while locks are held.  I would keep the call to vsunlock() in
> the sysctl cleanup code.

It's never really appropriate, but I guess you need it for a few sysctls
because they are too large to fix all at once.  I hope the new API won't
be used much.  Sysctls with fixed locking can keep using the current code
since they will have copied the data to a safe place and not hold any
locks when they call SYSCTL_OUT() or care if SYSCTL_OUT() makes another
copy or blocks.

> I'd also like to provide a second argument to the interface to vslock()
> to allow the handler to specify a maximum, kernel enforced, buffer
> length to potentially limit the amount of wired memory to something less
> than oldlen, but unless we add a member to the sysctl_req structure to
> hold it, we would need to overwrite oldlen so that we can pass the
> proper value to vsunlock() in the cleanup code.  What kind impact would
> changing sysctl_req have on binary compatibility?  I doesn't look like a
> problem to me, since this structure seems to be mostly treated as an
> opaque type.

Changing its size would probably not be good for some modules.

I think no time should be spent changing interfaces for this.  The
following hack might work well: use the current code in sysctl_old_user()
if req->oldlen is not huge.  Otherwise, vslock() only the region being
copied out to and vsunlock() it after the copy so that we don't have to
keep track of what is vslock()ed.

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message